CN117082156B - Intelligent analysis method for network flow big data - Google Patents

Intelligent analysis method for network flow big data Download PDF

Info

Publication number
CN117082156B
CN117082156B CN202311351287.8A CN202311351287A CN117082156B CN 117082156 B CN117082156 B CN 117082156B CN 202311351287 A CN202311351287 A CN 202311351287A CN 117082156 B CN117082156 B CN 117082156B
Authority
CN
China
Prior art keywords
dictionary
compression
data
character
characters
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311351287.8A
Other languages
Chinese (zh)
Other versions
CN117082156A (en
Inventor
陈小星
李俊
朱晓峰
黄佳
范向东
吴志坚
曹彬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
JIANGSU YITONG HIGH-TECH CO LTD
Original Assignee
JIANGSU YITONG HIGH-TECH CO LTD
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by JIANGSU YITONG HIGH-TECH CO LTD filed Critical JIANGSU YITONG HIGH-TECH CO LTD
Priority to CN202311351287.8A priority Critical patent/CN117082156B/en
Publication of CN117082156A publication Critical patent/CN117082156A/en
Application granted granted Critical
Publication of CN117082156B publication Critical patent/CN117082156B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L69/00Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass
    • H04L69/04Protocols for data compression, e.g. ROHC

Abstract

The invention relates to the technical field of data compression, in particular to a network flow big data intelligent analysis method, which comprises the following steps: acquiring network flow data to be analyzed and each dictionary character in the LZW dictionary; determining a compression contribution degree according to the number of times each dictionary character is used as compression, the single compressible data amount and the total compressed data amount before the current moment; determining a correlation index based on a difference between a time at which each dictionary character was last used for compression and a current time; and determining the retrieval priority of each dictionary character according to the compression contribution degree and the correlation index, and performing LZW dictionary compression processing on the network traffic data to be analyzed to obtain the compressed network traffic data to be analyzed. According to the invention, the retrieval priority of each dictionary character in the LZW dictionary is quantized, so that the dictionary retrieval time consumption is reduced, and the compression efficiency of network flow data is improved.

Description

Intelligent analysis method for network flow big data
Technical Field
The invention relates to the technical field of data compression, in particular to an intelligent analysis method for network traffic big data.
Background
With the continuous development of information technology and the continuous increase of internet user data, the scale of network traffic is also continuously expanding, which is easy to cause network congestion and collapse, and can also bring great threat to network security. In order to avoid the existence of network security threat, a large amount of network traffic data can be collected, stored and analyzed so as to facilitate understanding of the characteristics and trend of the network traffic and realize the management of the network traffic. Network traffic management typically requires processing large amounts of data, which increases computation and resource consumption if the processed network traffic data is huge, and compressing the network traffic data may reduce the size of the data volume.
In the prior art, the LZW (Lempel-Ziv Welch) compression technology is used for compressing network traffic data, and as LZW compression is dictionary-based compression, the capacity of a dictionary is gradually increased along with the increase of time sequence, so that the retrieval time is increased, the dictionary retrieval efficiency is reduced due to the increase of the retrieval time, and the compression efficiency of the network traffic data is reduced to a certain extent.
Disclosure of Invention
In order to solve the technical problem of low network traffic data compression efficiency, the invention aims to provide a network traffic big data intelligent analysis method, which adopts the following technical scheme:
the embodiment of the invention provides a network traffic big data intelligent analysis method, which comprises the following steps:
acquiring network flow data to be analyzed and each dictionary character in the LZW dictionary;
determining the compression contribution degree of each dictionary character according to the number of times each dictionary character is used for compression, the single compressible data quantity and the total compressed data quantity of each dictionary character before the current moment;
determining a correlation index between each dictionary character and the network traffic data to be analyzed according to the difference between the time when each dictionary character is used for compression for the last time and the current time;
determining the retrieval priority of each dictionary character in the LZW dictionary at the current moment according to the compression contribution degree of each dictionary character and the correlation index between each dictionary character and the network flow data to be analyzed;
and carrying out LZW dictionary compression processing on the network traffic data to be analyzed according to the retrieval priority of each dictionary character to obtain compressed network traffic data to be analyzed.
Further, the determining the compression contribution degree of each dictionary character according to the number of times each dictionary character is used as compression, the single compressible data amount and the total compressed data amount of each dictionary character before the current moment comprises:
for any dictionary character, determining the ratio of the single compressible data quantity and the total compressed data quantity of the dictionary character as a first compression contribution factor; determining the number of times that dictionary characters before the current moment are used as compression as a second compression contribution factor; and determining the compression contribution degree of the dictionary characters according to the first compression contribution factor and the second compression contribution factor of the dictionary characters.
Further, the determining the compression contribution degree of the dictionary character according to the first compression contribution factor and the second compression contribution factor of the dictionary character includes:
and determining the product of the first compression contribution factor and the second compression contribution factor of the dictionary characters as the compression contribution degree of the dictionary characters.
Further, the determining a correlation index between each dictionary character and the network traffic data to be analyzed according to the difference between the time when each dictionary character is last used as compression and the current time comprises:
for any dictionary character, calculating a time difference between the current time and the time of the last time the dictionary character is used as compression, and determining the time difference as a correlation factor; and determining a correlation index between the dictionary characters and the network traffic data to be analyzed according to the correlation factors of the dictionary characters.
Further, the determining, according to the relevance factor of the dictionary character, a relevance index between the dictionary character and the network traffic data to be analyzed includes:
and carrying out inverse proportion normalization processing on the correlation factors of the dictionary characters, and determining the processed correlation factors as correlation indexes between the dictionary characters and the network traffic data to be analyzed.
Further, the calculation formula of the correlation index is as follows:
wherein i is Rel Is the correlation index between the ith dictionary character and the network traffic data to be analyzed, e is a natural constant, T i,INR For the phase of the ith dictionary characterAnd a correlation factor.
Further, the determining the retrieval priority of each dictionary character in the LZW dictionary at the current moment according to the compression contribution degree of each dictionary character and the correlation index between each dictionary character and the network traffic data to be analyzed includes:
for any dictionary character, carrying out normalization data processing on correlation indexes between the dictionary character and network flow data to be analyzed, and obtaining correlation indexes after normalization data processing; and determining the product of the correlation index after the normalization data processing and the compression contribution degree of the dictionary characters as the retrieval priority of the dictionary characters.
Further, the normalizing data processing is performed on the correlation index between the dictionary character and the network traffic data to be analyzed, so as to obtain the correlation index after the normalizing data processing, which comprises the following steps:
processing the correlation index between the dictionary characters and the network flow data to be analyzed by using a tangent function to obtain a tangent value; the product of the tangent value and pi is determined as a correlation index after normalized data processing.
Further, in the process of the LZW dictionary compression processing, searching is replaced by searching according to the position sequence of dictionary characters, and searching is carried out according to the searching priority of each dictionary character.
Further, the current time is the acquisition time of the network traffic data to be analyzed.
The invention has the following beneficial effects:
the invention provides a network flow big data intelligent analysis method, which aims at the problem that the network flow data compression efficiency is low due to the gradual redundancy of a dictionary and the gradual increase of search time along with the increase of time sequence of LZW coding, firstly, according to the use frequency, single compressible data quantity and total compressed data quantity of each dictionary character in an LZW dictionary in the history compression process, the compression contribution degree of the dictionary characters is quantized, and the compression contribution degree can represent the contribution degree of the whole compression process, so the compression contribution degree is one of important indexes of the search priority of the dictionary characters calculated later, and the calculation of the compression contribution degree is convenient for the calculation of the search priority later; then, according to the difference between the time when the dictionary characters are used for compression for the last time and the current time, determining a correlation index between each dictionary character and the network traffic data to be analyzed, wherein the correlation index also affects the numerical value of the retrieval priority of the dictionary characters, and determining the correlation index can enhance the correlation between the dictionary characters and the network traffic data to be analyzed and can adaptively improve the compression efficiency of the network traffic data; the compression contribution degree and the correlation index of the dictionary characters are combined, so that the retrieval priority of each dictionary character can be quantized, and the numerical accuracy and precision of the retrieval priority can be improved; the arrangement of the original dictionary characters is changed based on the retrieval priority of each dictionary character, so that the retrieval speed of the expected characters can be increased, the compression efficiency of network traffic data to be analyzed is improved, the analysis of the network traffic big data is facilitated, and the method is mainly applicable to the technical field of data compression.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions and advantages of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are only some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a method for intelligent analysis of network traffic big data according to the present invention;
FIG. 2 shows the tangent function tan (i) Rel ) Is shown in the drawings.
Detailed Description
In order to further describe the technical means and effects adopted by the present invention to achieve the preset purpose, the following detailed description is given below of the specific implementation, structure, features and effects of the technical solution according to the present invention with reference to the accompanying drawings and preferred embodiments. In the following description, different "one embodiment" or "another embodiment" means that the embodiments are not necessarily the same. Furthermore, the particular features, structures, or characteristics of one or more embodiments may be combined in any suitable manner.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
The specific scene aimed by the invention is as follows: LZW compression relies on a dictionary built by itself to realize data compression processing, the capacity of the LZW dictionary gradually increases along with the time sequence of the compression process, the contained dictionary characters gradually increase, the time for searching the expected characters each time is possibly prolonged, namely, the dictionary searching time is prolonged, and the compression efficiency is low when the network traffic data is compressed.
In order to overcome the defect of low compression efficiency, the embodiment provides a network traffic big data intelligent analysis method, as shown in fig. 1, comprising the following steps:
s1, acquiring network flow data to be analyzed and each dictionary character in the LZW dictionary.
In this embodiment, in order to realize management of network traffic, understand characteristics and trends of network traffic, a large amount of network traffic data needs to be collected, stored and analyzed. When the network traffic data is stored, the efficiency of network traffic management is improved by LZW dictionary compression, so the network traffic data needing to be compressed and stored is used as the network traffic data to be analyzed. The compression efficiency of the network traffic data is low under the influence of the retrieval duration of the LZW dictionary, at the moment, dictionary characters in the LZW dictionary need to be changed in arrangement positions, the positions of the expected dictionary characters which are easier to be used for compression are leaned forward, so that the time of each retrieval is shortened, and the compression efficiency of the network traffic data is improved. Therefore, the present embodiment can obtain the network traffic data to be analyzed, that is, the network traffic data that needs to be subjected to the compression processing, and each dictionary character in the LZW dictionary.
It should be noted that, the compression of the LZW dictionary may be performed according to the matching condition of the dictionary and the network traffic data to be analyzed, where the dictionary is constructed in the LZW compression process, and the retrieval priority order of the dictionary characters is adjusted by analyzing the relationship between the dictionary characters and the compression process. The construction process of the LZW dictionary is the prior art and is not in the scope of the present invention, and will not be described in detail here.
S2, determining compression contribution degree of each dictionary character according to the number of times each dictionary character is used for compression, single compressible data quantity and total compressed data quantity of each dictionary character before the current moment.
After the dictionary characters are built in the LZW dictionary, when data matched with the dictionary characters exist in the data to be compressed when the data to be compressed is input, the ASCII code values of the dictionary characters successfully matched can be directly used for replacing the corresponding data to be compressed, so that the data to be compressed is replaced, and the compression is completed. After entering a dictionary, some data to be compressed becomes dictionary characters, new dictionary characters cannot be built by using the dictionary characters as a base later, and the dictionary characters expand the whole capacity of the dictionary, but have extremely low contribution to compression efficiency. The construction of a new dictionary character by a certain dictionary character is an expansion of the whole dictionary capacity, and the compression is realized according to the new dictionary character, but the process of constructing the new dictionary character only needs one time and the number of times of compression can be used for a plurality of times, so that the compression contribution degree of the dictionary character is mainly considered or the compression contribution degree can not be used in the compression process, namely the historical compression use frequency. The data to be compressed here may be network traffic data.
In summary, the greater the number of times dictionary characters are retrieved for compression of data to be compressed, the higher the contribution of dictionary characters to overall compression, which means that dictionary characters are often used as compressed data. In contrast, the fewer or unused dictionary characters are retrieved for compression after some data to be compressed is input to the dictionary, the lower the degree of contribution of the corresponding dictionary characters to the compression overall. Therefore, it is necessary to count the number of times each dictionary character before the current time is used for compression to quantify the compression contribution of the dictionary character. Further, the data length that can be compressed for each dictionary character is different, and the larger the data length that the dictionary character represents, the more parts that can be compressed, i.e., the single compressible data amount of the dictionary character. The compression contribution degree of the dictionary characters can be analyzed based on the number of times each dictionary character is used as compression, the single compressible data amount and the total compressed data amount of each dictionary character; the greater the compression contribution of the dictionary character, the more important the dictionary character is, and the higher the search priority should be in the subsequent search.
In this embodiment, the specific implementation steps of the compression contribution degree of the dictionary characters may include:
for any dictionary character, determining the ratio of the single compressible data quantity and the total compressed data quantity of the dictionary character as a first compression contribution factor; determining the number of times that dictionary characters before the current moment are used as compression as a second compression contribution factor; and determining the product of the first compression contribution factor and the second compression contribution factor of the dictionary characters as the compression contribution degree of the dictionary characters.
As an example, the calculation formula of the compression contribution of dictionary characters may be:
wherein C is i Compression contribution degree for ith dictionary character, N T,i For the number of times the ith dictionary character before the current time is used as compression, and also for the second compression contribution factor of the ith dictionary character, R i Sum is the single compressible data volume of the ith dictionary character T,i For the total compressed data amount of the ith dictionary character before the current time,is the first compression contribution factor of the ith dictionary character.
In the calculation formula of the compression contribution degree, the ith dictionary character can be any dictionary character, and the first compression contribution factorMay be used to characterize the extent to which the dictionary characters contribute to the compression process in the already compressed data; single compressible data volume R i The larger the historical time is used as the number of compression N T,i The more, the more the compressed data is described as being compressed by the ith dictionary character, the higher the i dictionary character contributes to the compression process of the overall compressed data than other dictionary characters that enter the dictionary but are not used as compression.
Thus, by the number of times each dictionary character is used as compression before the current time and the single compressible data amount of each dictionary character, the degree of contribution of the dictionary character to the entire compression process can be quantified. The current time refers to the acquisition time of the network traffic data to be analyzed, namely, the data acquisition time, namely, the corresponding time point when the network traffic data to be analyzed starts to be compressed.
And S3, determining a correlation index between each dictionary character and the network traffic data to be analyzed according to the difference between the time when each dictionary character is used for compression last time and the current time.
For example, the ASCII code value of dictionary character ABB is 278, the first time retrieved for compression in the dictionary is 0.5ms, the second time used is 1.5ms, the third time used is 2.5ms, the fourth time used is 7.5ms, and no subsequent retrieval is performed; whereas the ASCII code value of the dictionary character BCC is 305, the first time retrieved for compression in the dictionary is 10.5ms and the second time used is 10.9ms; the current time is 12ms, i.e. the current point in time.
Dictionary characters ABB are used for a search number of 4 and dictionary characters BCC are used for a search number of 2, the dictionary characters ABB are used for a search number significantly larger than the dictionary characters BCC are used for a search number, but dictionary characters ABB are not used for compression after the last search, and dictionary characters BCC are used for compression twice consecutively within the last 1.5ms time interval. In the data to be compressed, the correlation between the dictionary characters BCC and the data to be compressed is higher, and possibly the dictionary characters ABB are retrieved again along with the change of the time sequence, but in the data to be compressed at the current moment, the correlation between the characters BCC is obviously higher than the characters ABB. However, since the dictionary characters ABB are located earlier in the dictionary than the dictionary characters BCC, i.e. the ASCII code values of the dictionary characters BCC are larger than the dictionary characters ABB, the dictionary characters BCC are retrieved after the dictionary characters ABB at each retrieval, whereas in order to reduce the retrieval time, the ideal state of the retrieval order should be that the dictionary characters BCC are retrieved first for faster return to be used as compression.
Thus, by quantifying the difference between the time at which the dictionary character was last used as compression and the current time, a correlation index between the dictionary character and the network traffic data to be analyzed can be obtained. The specific implementation steps for obtaining the correlation index may include:
for any dictionary character, calculating a time difference between the current time and the time of the last time the dictionary character is used as compression, and determining the time difference as a correlation factor; and carrying out inverse proportion normalization processing on the correlation factors of the dictionary characters, and determining the processed correlation factors as correlation indexes between the dictionary characters and the network traffic data to be analyzed.
As an example, the calculation formula of the correlation factor may be:
T i,INR =T now -T i,final the method comprises the steps of carrying out a first treatment on the surface of the Wherein T is i,INR For the relevance factor of the ith dictionary character, T now T is the time at which the network flow data to be analyzed is acquired i,final The time at which the ith dictionary character was last used as compression.
It should be noted that the current time gradually increases with time sequence, and the point in time when the dictionary character at this time was last used as compression is fixed. The reason for this is that one dictionary character may be compressed a plurality of times, but the time point at which the last time was used for compression is not necessarily the current time node, i.e., the current time, but the time point at which the last time was used for compression is before or at the current time node, which is a fixed time.
In the calculation formula of the correlation factor, the correlation factor T of dictionary characters i,INR The correlation between dictionary characters and network traffic data to be analyzed can be characterized; for dictionary characters with higher correlation with network traffic data to be analyzed, the time used for compression for the last time is necessarily closer to the current time, and the current time is the latest time node; for dictionary characters with low correlation with network traffic data to be analyzed, the time point of last time used as compression is far from the current time point, the time interval between the last time and the current time point is larger and larger, and the correlation is reduced.
As an example, the calculation formula of the correlation index may be:
wherein i is Rel Is the correlation index between the ith dictionary character and the network traffic data to be analyzed, e is a natural constant, T i,INR Is the relevance factor of the ith dictionary character.
In the calculation formula of the correlation index, the correlation factor T of dictionary characters i,INR The smaller the dictionary characters are, the larger the correlation index between the dictionary characters and the network flow data to be analyzed is, so that the correlation factor and the correlation index are in a negative correlation relationship; when the correlation factor is normalized, the correlation index gradually approaches 0 along with the increase of the correlation factor, soWith T i,INR Is infinitely approaching 1; correlation index i Rel The larger the correlation between the ith dictionary character and the network traffic data to be analyzed is, the stronger the correlation is, and the weaker the correlation is. At that time, the implementer may also implement the normalization processing of the inverse proportion of the correlation factor by other means, without specific limitation.
Therefore, it should be noted that, when the dictionary characters are retrieved, in addition to considering the influence of the dictionary characters retrieved as desired characters for compression, the influence of the time interval in which the dictionary characters themselves are used on the compression efficiency should be considered to determine the correlation index between each dictionary character and the network traffic data to be analyzed. The correlation index and the subsequent retrieval priority are positively correlated, and the larger the correlation index is, the smaller the time interval between the latest time point of the dictionary character used for compression and the current time point is, and the larger the retrieval priority of the dictionary character is.
And S4, determining the retrieval priority of each dictionary character in the LZW dictionary at the current moment according to the compression contribution degree of each dictionary character and the correlation index between each dictionary character and the network flow data to be analyzed.
It should be noted that dictionary characters in the dictionary gradually increase along with the change of time sequence, and each time the data to be compressed is compressed, the dictionary content needs to be searched once, so as to search the dictionary characters which can be used for compression. If the expected character is close to the searched entry of the dictionary, the time consumption of the search is greatly reduced; in contrast, if the desired character is near the end of the dictionary, the current retrieval efficiency is relatively low. Therefore, the arrangement of dictionary characters in the dictionary is dynamically adjusted according to the contribution degree of the dictionary characters to the compressed part and the correlation between the dictionary characters and data to be compressed, so that the speed of obtaining expected characters through searching is increased when searching each time, the utilization efficiency of the dictionary is improved as a whole, and the compression speed of network flow data is increased.
For data whose history has been compressed, dictionary characters having a higher compression contribution are used as compression more frequently. Dictionary characters with high compression contribution degree provide a large compression space for compression of certain data segments compared with other dictionary characters, but as data to be compressed is continuously input into the LZW dictionary, some dictionary characters with higher compression frequency can not provide assistance in the compression process of the data to be compressed, and the time interval between the moment when the dictionary characters with high compression contribution degree are used for compression for the last time and the current moment, namely the correlation with the network traffic data to be analyzed, needs to be considered. Compared with dictionary characters with better relevance, the dictionary characters with higher compression contribution degree and poorer relevance are arranged at a front position in the dictionary, so that the retrieval efficiency of the dictionary characters with better relevance is affected, and the retrieval priority of the dictionary characters with excellent relevance and rear position in the dictionary is required to be adjusted, so that the dictionary characters can be retrieved more quickly.
In this embodiment, the priority order of the dictionary characters arranged in the dictionary needs to be comprehensively considered from the point of view that the correlation between the dictionary characters and the network traffic data to be analyzed affects the compression contribution degree. The method comprises the following steps: the LZW compression process is a process of continuously compressing data to be compressed, and the compression efficiency of the data to be compressed is greatly helped and is dictionary characters with higher correlation with network flow data to be analyzed; the correlation and the compression contribution degree have a certain relation, and if dictionary characters with high compression contribution degree do not participate in compression for a long time, the compression contribution degree of the dictionary characters can gradually decrease along with the gradual increase of compressed partial data; if the correlation between the dictionary characters and the network traffic data to be analyzed is in an ascending trend, the compression contribution of the dictionary characters will also be in an ascending trend as the dictionary characters are frequently used for compression.
The specific steps of the retrieval priority of dictionary characters may include:
for any dictionary character, carrying out normalization data processing on correlation indexes between the dictionary character and network flow data to be analyzed, and obtaining correlation indexes after normalization data processing; and determining the product of the correlation index after the normalization data processing and the compression contribution degree of the dictionary characters as the retrieval priority of the dictionary characters. The determining step of the correlation index after the normalized data processing is as follows: processing the correlation index between the dictionary characters and the network flow data to be analyzed by using a tangent function to obtain a tangent value; the product of the tangent value and pi is determined as a correlation index after normalized data processing.
As an example, the calculation formula of the retrieval priority of dictionary characters may be:
wherein i is P Search priority for ith dictionary character, C i Compression contribution degree for ith dictionary character, i Rel Is the index of the correlation between the ith dictionary character and the network traffic data to be analyzed, pi is the circumference ratio, and tan is the tangent function.
In the calculation formula of the retrieval priority, the retrieval priority i of dictionary characters P The higher the dictionary characters are, the earlier the retrieval time for the dictionary characters to be used as compression; compression contribution C of dictionary characters i While gradually decreasing, if compression is not participated for a long time, leading to a correlation index i with network traffic data to be analyzed Rel Also gradually decreasing, then the retrieval priority of dictionary characters i P The value of (C) also gradually decreases, so that the compression contribution degree C i Correlation index i Rel Priority of retrieval with dictionary characters i P Presenting a positive correlation; compression contribution degree C i In the process of reducing, the correlation between the dictionary characters and the network flow data to be analyzed can be increased to a certain extent, and the descending trend of the retrieval priority of the dictionary characters is slowed down, so that the influence proportion on the compression efficiency is gradually larger than the compression contribution degree in the process of rising the correlation index casting between the dictionary characters and the network flow data to be analyzed, and the mutation of the correlation index is required to be faster, so that the retrieval priority of the dictionary characters is more influenced; the original image of the tangent function tan accords with the expected change rule of the correlation index, the steepness degree can be adjusted, and the value interval of the correlation index is limited to be between 0 and 1; with respect to pi, it can be used to measure the similarity or correlation between two vectors, and normalize over a range between-1 and 1, making it easier to interpret.
The tangent function tan (i Rel ) As shown in fig. 2, the correlation index may be characterized by an abscissa in fig. 2, and an ordinate may be represented by a tableThe search priority may be observed to exhibit a sharp increasing change with a small increase in the relevance index, e.g., corresponding ordinate values when x=0.1, x=0.5, and x=0.8.
And S5, performing LZW dictionary compression processing on the network traffic data to be analyzed according to the retrieval priority of each dictionary character, and obtaining compressed network traffic data to be analyzed.
In this embodiment, after the retrieval priority of each dictionary character in the LZW dictionary is obtained, the retrieval is replaced by the retrieval according to the retrieval priority of each dictionary character according to the position sequence of the dictionary character in the compression processing of the LZW dictionary, that is, the arrangement of the dictionary characters is not fixed any more, so as to realize the improvement of the LZW dictionary. And carrying out LZW dictionary compression processing on the network traffic data to be analyzed by utilizing the improved LZW dictionary, so as to obtain the compressed network traffic data to be analyzed. The implementation process of the LZW dictionary compression process is the prior art, and is not within the scope of the present invention, and will not be described in detail here.
It should be noted that, for the expected character which has higher relativity with the network traffic data to be analyzed and is easier to be used for compressing or constructing new dictionary characters, the retrieval priority is improved, the retrieval priority is favorable for being retrieved faster, the time consumption of the retrieval stage in the whole compression link is reduced, the dictionary compression efficiency is improved, and meanwhile, the compression efficiency of the network traffic data to be analyzed is also improved.
For example, each dictionary character in the dictionary is: A. b, C, D, E, F, G, H, J, K, L, M, N, the priority corresponding to each dictionary character is: a (8), B (5), C (10), D (6), E (4), F (13), G (2), H (3), J (7), K (9), L (11), M (12) and N (1), wherein A (8) refers to that the retrieval priority of the dictionary character A is 8. At present, data M needs to be compressed, when a traditional LZW dictionary is sequentially searched according to an original dictionary, dictionary characters M need to be searched from dictionary characters A, and 12 dictionary characters need to be detected; when data M is compressed according to the retrieval priority of dictionary characters, the maximum retrieval priority F (13) is retrieved first, M (12) can be retrieved when retrieving next time, and expected characters M can be retrieved only by retrieving 2 dictionary characters to realize compression, so that the time consumption of a retrieval stage is effectively reduced, and the compression efficiency of network flow data is remarkably improved.
It should be noted that, regarding the retrieval priority update time of each dictionary character in the dictionary, the corresponding time may be set to perform timing update or the trigger threshold may be set to perform trigger condition update.
The invention provides a network traffic big data intelligent analysis method, which analyzes dictionary characters of an LZW dictionary, quantifies retrieval priority of the dictionary characters according to the number of times the dictionary characters are used for compression and the relativity between the dictionary characters and data to be compressed, arranges the retrieval sequence of the dictionary characters in the dictionary according to the retrieval priority, accelerates the retrieval efficiency of the expected dictionary characters, and improves the compression efficiency of network traffic data.
The above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention and are intended to be included within the scope of the invention.

Claims (9)

1. The intelligent analysis method for the network traffic big data is characterized by comprising the following steps:
acquiring network flow data to be analyzed and each dictionary character in the LZW dictionary;
determining the compression contribution degree of each dictionary character according to the number of times each dictionary character is used for compression, the single compressible data quantity and the total compressed data quantity of each dictionary character before the current moment;
determining a correlation index between each dictionary character and the network traffic data to be analyzed according to the difference between the time when each dictionary character is used for compression for the last time and the current time;
determining the retrieval priority of each dictionary character in the LZW dictionary at the current moment according to the compression contribution degree of each dictionary character and the correlation index between each dictionary character and the network flow data to be analyzed;
according to the retrieval priority of each dictionary character, performing LZW dictionary compression processing on the network traffic data to be analyzed to obtain compressed network traffic data to be analyzed;
and the current time is the acquisition time of the network flow data to be analyzed.
2. The intelligent analysis method according to claim 1, wherein the determining the compression contribution degree of each dictionary character based on the number of times each dictionary character before the current time is used as compression, the single compressible data amount and the total compressed data amount of each dictionary character comprises:
for any dictionary character, determining the ratio of the single compressible data quantity and the total compressed data quantity of the dictionary character as a first compression contribution factor; determining the number of times that dictionary characters before the current moment are used as compression as a second compression contribution factor; and determining the compression contribution degree of the dictionary characters according to the first compression contribution factor and the second compression contribution factor of the dictionary characters.
3. The intelligent analysis method of network traffic big data according to claim 2, wherein determining the compression contribution of the dictionary characters according to the first compression contribution and the second compression contribution of the dictionary characters comprises:
and determining the product of the first compression contribution factor and the second compression contribution factor of the dictionary characters as the compression contribution degree of the dictionary characters.
4. The intelligent analysis method according to claim 1, wherein the determining the correlation index between each dictionary character and the network traffic data to be analyzed based on the difference between the time at which each dictionary character was last used as compression and the current time comprises:
for any dictionary character, calculating a time difference between the current time and the time of the last time the dictionary character is used as compression, and determining the time difference as a correlation factor; and determining a correlation index between the dictionary characters and the network traffic data to be analyzed according to the correlation factors of the dictionary characters.
5. The intelligent analysis method according to claim 4, wherein determining the correlation index between the dictionary character and the network traffic data to be analyzed according to the correlation factor of the dictionary character comprises:
and carrying out inverse proportion normalization processing on the correlation factors of the dictionary characters, and determining the processed correlation factors as correlation indexes between the dictionary characters and the network traffic data to be analyzed.
6. The intelligent analysis method for network traffic big data according to claim 5, wherein the calculation formula of the correlation index is:
wherein i is Rel Is the correlation index between the ith dictionary character and the network traffic data to be analyzed, e is a natural constant, T i,INR Is the relevance factor of the ith dictionary character.
7. The intelligent analysis method of network traffic big data according to claim 1, wherein determining the retrieval priority of each dictionary character in the LZW dictionary at the current moment according to the compression contribution degree of each dictionary character and the correlation index between each dictionary character and the network traffic data to be analyzed comprises:
for any dictionary character, carrying out normalization data processing on correlation indexes between the dictionary character and network flow data to be analyzed, and obtaining correlation indexes after normalization data processing; and determining the product of the correlation index after the normalization data processing and the compression contribution degree of the dictionary characters as the retrieval priority of the dictionary characters.
8. The intelligent analysis method of network traffic big data according to claim 7, wherein the normalizing the correlation index between the dictionary character and the network traffic data to be analyzed to obtain the correlation index after normalizing the data processing includes:
processing the correlation index between the dictionary characters and the network flow data to be analyzed by using a tangent function to obtain a tangent value; the product of the tangent value and pi is determined as a correlation index after normalized data processing.
9. The intelligent analysis method of network traffic big data according to claim 1, wherein the searching is replaced by searching according to the searching priority of each dictionary character according to the sequence of the dictionary characters in the compression process of the LZW dictionary.
CN202311351287.8A 2023-10-18 2023-10-18 Intelligent analysis method for network flow big data Active CN117082156B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311351287.8A CN117082156B (en) 2023-10-18 2023-10-18 Intelligent analysis method for network flow big data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311351287.8A CN117082156B (en) 2023-10-18 2023-10-18 Intelligent analysis method for network flow big data

Publications (2)

Publication Number Publication Date
CN117082156A CN117082156A (en) 2023-11-17
CN117082156B true CN117082156B (en) 2024-01-26

Family

ID=88713869

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311351287.8A Active CN117082156B (en) 2023-10-18 2023-10-18 Intelligent analysis method for network flow big data

Country Status (1)

Country Link
CN (1) CN117082156B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117539913A (en) * 2023-12-08 2024-02-09 杭州易靓好车互联网科技有限公司 Insurance data management method and system for automobile transaction platform
CN117792403A (en) * 2024-02-26 2024-03-29 成都农业科技职业学院 Distributed agricultural data storage management method based on stream big data technology

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104077272A (en) * 2014-06-23 2014-10-01 华为技术有限公司 Method and device for compressing dictionary
CN106687966A (en) * 2014-08-05 2017-05-17 伊卢米纳剑桥有限公司 Methods and systems for data analysis and compression
CN114025024A (en) * 2021-10-18 2022-02-08 中国银联股份有限公司 Data transmission method and device
CN116013488A (en) * 2023-03-27 2023-04-25 中国人民解放军总医院第六医学中心 Intelligent security management system for medical records with self-adaptive data rearrangement function
CN116775589A (en) * 2023-08-23 2023-09-19 湖北华中电力科技开发有限责任公司 Data security protection method for network information

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030030575A1 (en) * 2001-05-07 2003-02-13 Harmonic Data Systems Ltd. Lossless data compression
US7003039B2 (en) * 2001-07-18 2006-02-21 Avideh Zakhor Dictionary generation method for video and image compression

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104077272A (en) * 2014-06-23 2014-10-01 华为技术有限公司 Method and device for compressing dictionary
CN106687966A (en) * 2014-08-05 2017-05-17 伊卢米纳剑桥有限公司 Methods and systems for data analysis and compression
CN114025024A (en) * 2021-10-18 2022-02-08 中国银联股份有限公司 Data transmission method and device
CN116013488A (en) * 2023-03-27 2023-04-25 中国人民解放军总医院第六医学中心 Intelligent security management system for medical records with self-adaptive data rearrangement function
CN116775589A (en) * 2023-08-23 2023-09-19 湖北华中电力科技开发有限责任公司 Data security protection method for network information

Also Published As

Publication number Publication date
CN117082156A (en) 2023-11-17

Similar Documents

Publication Publication Date Title
CN117082156B (en) Intelligent analysis method for network flow big data
US9514147B2 (en) Hierarchical data compression and computation
US20160246810A1 (en) Query predicate evaluation and computation for hierarchically compressed data
CN109716658B (en) Method and system for deleting repeated data based on similarity
US11551785B2 (en) Gene sequencing data compression preprocessing, compression and decompression method, system, and computer-readable medium
CN109002889A (en) Adaptive iteration formula convolutional neural networks model compression method
CN111105035A (en) Neural network pruning method based on combination of sparse learning and genetic algorithm
US11722148B2 (en) Systems and methods of data compression
US20200294629A1 (en) Gene sequencing data compression method and decompression method, system and computer-readable medium
CN115459782A (en) Industrial Internet of things high-frequency data compression method based on time sequence segmentation and clustering
CN109783547B (en) Similarity connection query method and device
CN116915259B (en) Bin allocation data optimized storage method and system based on internet of things
Yela et al. Online visibility graphs: Encoding visibility in a binary search tree
Shrividhiya et al. Robust data compression algorithm utilizing LZW framework based on huffman technique
CN116934487B (en) Financial clearing data optimal storage method and system
Ferragina et al. On the bit-complexity of Lempel-Ziv compression
CN112580805A (en) Method and device for quantizing neural network model
Navarro et al. L-systems for Measuring Repetitiveness
CN111797984B (en) Quantification and hardware acceleration method and device for multi-task neural network
CN117194490B (en) Financial big data storage query method based on artificial intelligence
CN115329118B (en) Image similarity retrieval method and system for garbage image
CN117134777B (en) Intelligent compression method for positioning data
El Daher et al. Compression Through Language Modeling
Klein et al. Enhanced Extraction from Huffman Encoded Files.
Islam et al. Short text compression for smart devices

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant