CN115905871B

CN115905871B - Matrix similarity-based network transmission file information rapid judging method and system

Info

Publication number: CN115905871B
Application number: CN202211596171.6A
Authority: CN
Inventors: 张宏; 梁元
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2022-12-12
Filing date: 2022-12-12
Publication date: 2023-08-22
Anticipated expiration: 2042-12-12
Also published as: CN115905871A

Abstract

The invention discloses a method for rapidly judging network transmission file information based on matrix similarity, which is mainly used for mapping a received data packet into a matrix with column offset due to time sequence difference in the process of network transmission of the data packet, and rapidly negating a non-similarity matrix by a hamming weight detection method to realize precise judgment of matrix similarity. Firstly, carrying out similarity coarse screening on the matrixes, then carrying out accurate comparison on the matrixes reserved by the coarse screening, and finally, judging the similarity of the matrixes. The invention further comprises a system for rapidly judging the network transmission file information based on the matrix similarity. The method is mainly applied to the related fields of information theory, encryption communication, coding theory, cryptography and the like, and is popularized to the discrimination of the matrix similarity of the integer domain.

Description

Matrix similarity-based network transmission file information rapid judging method and system

Technical Field

The invention relates to a method and a system for quickly judging network transmission file information.

Background

With the advent of the big data cloud computing era, especially the wide use of mobile terminals, data transmission through networks is becoming more popular, and in a network data protocol layer, processing requirements on data packets are fast and efficient, so that real-time business requirements are met. When a file is sent through a network, if the file is large, firstly generating summary information, generating a data packet according to the summary information, simultaneously generating the summary information by the original file and the summary information, and sending the data packet to a receiver, generating the summary information according to the received original file, generating the summary information into a data packet, abstracting the data packet into a matrix, generating the matrix by the received summary information, and judging the corresponding relation between the summary and the original file by comparing the similarity of the matrix, wherein the specific application scene is shown in figure 1.

When a document is divided into data packets with fixed length and transmitted through a network, the received data packets have disordered time sequences due to the interference of time sequences, the quick judgment of the data packets is one of application scenes to be solved by the method, a solving model can be abstracted into similarity judgment of a binary domain column transformation matrix, and the two matrices cannot be directly compared with each other by data lines due to column dislocation.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a method and a system for quickly judging network transmission file information based on matrix similarity.

Binary domain matrix operations are widely used in information theory, encrypted communication, coding theory, cryptography, and other related fields. The method is mainly used for judging the similarity of two column transformation matrixes which cannot be directly compared, and can be applied to integer fields in an expanding mode.

According to the attribute of constant hamming weight in the row vectors of the matrix data, a hamming weight comparison algorithm is used for calculating hamming weight in the row vectors, and when the matrix similarity is quickly negated when the matrix is unequal, so that the judgment of the similarity of two column transformation matrices is realized; meanwhile, a hamming weight table is constructed, the calculation step of calculating hamming weight by comparison each time is omitted, and the effect of space time exchange and multiple times of use by calculation is realized by a space time compromise method. The method can be popularized to the discrimination of the integer domain matrix similarity, and has a general application popularization meaning for the discrimination of the transformation subject.

The invention discloses a network transmission file information rapid judging method based on matrix similarity, which utilizes the characteristic that the hamming weights of column transformation matrix row vectors are equal to realize coarse screening of the matrix similarity, realizes accurate comparison of the matrix through judging the transposed matrix hamming distance, finally achieves rapid judgment of the matrix similarity, and the overall implementation step diagram is shown in fig. 2, and the processing flow diagram is shown in fig. 3.

The time complexity of the traditional matrix similarity discrimination is thatThe invention can respectively perform data preprocessing, data coarse screening, data accurate matching and the like, and can save timeThe inter-complexity drops to 0 (n).

(S1) constructing a Hamming weight scale by utilizing the characteristic of constant weight of the matrix row vector Hamming.

(S1.1) mapping the data packets into a data matrix.

Given k=1, 2, 3, … S, a total of S data (b) matrices, each matrix M rows and N columns of bits:

given 1=1, 2, 3, … T, for a total of T data (c) matrices, each matrix M rows and N columns of bits:

(S1.2) a matrix Hamming weight table calculating method is designed: the number of the data (b) matrix sets is recorded as S, the dimension M is recorded as N, and the number of the data (c) matrix sets which are subjected to similarity comparison with the data (b) matrix sets is recorded as T, and the dimension M is recorded as N. Generating a hamming weight table according to a hamming weight algorithm by each row vector of the matrix in the matrix set of the data (B), and setting row vector elements of the matrix in the row data (B) as (b= (B) ₁ ，b ₂ ，b ₃ ，......b _n ) The hamming weight of the row vector is calculated according to the following formula (1):

the calculation method is shown as (2)

(S1.3) constructing a matrix Hamming weight scale: the data (b) has S matrixes in the matrix set, each matrix has M row vectors, S tables are required to be generated, and the space of the tables of S.times.M is occupied. And (3) extracting row vectors of the matrix from the matrix set of the data (b), calculating the hamming weight of the row, storing the hamming weight into a corresponding hamming weight table array, and respectively selecting character types, short integer types or integer types by the weight table array according to the number of the matrix arrays, so that the storage space is reduced to the maximum extent. The detailed construction method is shown in fig. 4, and the specific table format is as follows:

(S2) coarse screening discrimination of matrix similarity.

(S2.1) the matrix row vectors are used for realizing the coarse screening discrimination of the similar matrix by searching the Hamming weight table, the data (C) matrix is taken out according to the row vectors, and the Hamming weight HW (C) is calculated _i ) Find the corresponding hamming weight table HWT, if HW (C _i )＝HW(B _i ) Then the next row vector is fetched and the hamming weight HW (C _j ) And continuing to search the corresponding hamming weight table HWT until all rows meet the equal requirement, namely:the coarse screening is ended, in the course of which there is a row of hamming weights of unequal HW (C _j )≠HW(B _j ) I.e. +.>The comparison is not continued, the matrix is directly negated, and the next data (c) matrix is taken out for comparison.

(S3) accurate discrimination of matrix similarity

(S3.1) constructing a data matrix transpose matrix: and (3) no negative comparison matrix is arranged in the coarse screen, a matrix transposition and hamming distance comparison method is continuously adopted, so that accurate comparison of matrix similarity is realized, the data (b) matrix and the data (c) matrix are transposed, the detailed transformation process is shown in fig. 5, and the specific transformation process is as follows:

the data (b) is matrix transformed as follows:

the data (c) is matrix transformed as follows:

(S3.2) a design data matrix accurate comparison calculation method: traversing the row vectors in the transposed matrix of the data (c) in the row vectors of the transposed matrix of the data (b), taking out the first row vector of the transposed matrix of the data (c), and carrying out Hamming distance comparison on the first row vector of the transposed matrix of the data (b), and meeting the condition of the formula (3)

Namely, the Hamming distance for comparison is 0, and the calculation method is shown as an algorithm formula (4):

(S3.3) a data matrix accurate comparison method: and (3) sequentially and completely comparing N row vectors of the transposed matrix of the data (c), setting the successfully compared row vectors as successfully compared marks, recording the successfully compared row vectors in a matching mark table, and skipping the comparison of the row when the next row is compared, wherein a detailed schematic diagram is shown in fig. 6. If any row vector is not successfully matched in the process, the matrix is not successfully matched with the similarity matrix, and when all row vectors in the transposed matrix of the data (c) find corresponding row vector matching in the transposed matrix of the data (b), the accurate comparison is successful. The mark of successful comparison is that the two matrix data satisfy:

(1) for any one of the N columns of matrix data of data (c)There must be a column +.>Equal to it;

(2) any one of the N columns of matrix data of data (b)There must be a column +.>Equal to it.

A matrix with similarity features is found and retained.

(S4) complete restoration of the file information.

(S4.1) extracting the index header file according to the matrix similarity information. And (3) comparing the matrix of the data (c) with the matrix of the data (b) with the matrix of the data (c) with the similarity, extracting index information, and generating a file index header file with the same sequence as the data (b).

(S4.2) generating complete file information according to the index header file. And taking out the content files corresponding to each index header file, merging to generate a source document, and finishing the restoration of file information.

The invention also comprises a network transmission file information rapid judging system based on matrix similarity, which comprises:

the Hamming weight table construction module is used for constructing a Hamming weight table by utilizing the constant characteristic of matrix row vector Hamming weight;

the matrix similarity coarse screening judging module is used for coarse screening to judge the matrix similarity;

the matrix similarity accurate judging module is used for accurately judging the matrix similarity;

and the file information complete restoration module is used for completely restoring the file information.

The invention also comprises a network transmission file information rapid judging device based on the matrix similarity, which comprises a memory and one or more processors, wherein executable codes are stored in the memory, and the one or more processors are used for realizing the network transmission file information rapid judging method based on the matrix similarity when executing the executable codes.

The invention also includes a computer readable storage medium having a program stored thereon that, when executed by a processor, implements a matrix similarity-based method for quickly determining information of a network transmission file according to the invention.

The main advantage of the present invention is that the computational complexity of the N combination is optimized to the computational complexity of N. Especially when the matrices to be compared are data sets, respectively, the efficiency improvement can reach 1 to 2 orders of magnitude. The probability that the weight of the data line vector is equal to HW isWhere w=m/2, probability P thereof _w MAX, i.e. MAX ({ P) _w })＝P _M/2 . When m=384, P _M/2 Because of this, the average probability of row-by-row negation by weight is 1-4.0% = 96%. Coarse estimation is carried out by using a Ha Hanming weight negation algorithm probability of 96%, and in N bits, N is negated by 96% at a time, and the number of times required for completing negation judgment average is log _(1/(4.0％)) N. Log when n=512 _(1/(7.0％)) 512≡1.93, i.e. m=384 and n=512, the number of determinations required for complete negative averaging is less than 2. According to n groups of data, 2 matrix rows are produced and compared on average, and the calculated quantity is n x 2 rows 384 bits of data production, weighing, table look-up and Hamming distance calculation. The computational complexity is O (n×c) =o (n), compared to +.>The calculation amount efficiency of the (C) is obviously improved.

Drawings

Fig. 1 is a schematic diagram of an application scenario of the method for rapidly discriminating network transmission file information according to the present invention.

FIG. 2 is a diagram of the steps in the practice of the method of the present invention.

FIG. 3 is a block diagram of the overall design flow of the method for fast discriminating matrix similarity of the present invention.

FIG. 4 is a diagram of a hamming weight table construction method of the present invention.

Fig. 5 is a schematic diagram of a method for constructing a transformation matrix according to the present invention.

Fig. 6 is a diagram of a matrix exact alignment row match flag of the present invention.

Fig. 7 is a block diagram of a server device in which the method of the present invention operates.

Fig. 8 is a schematic diagram of the system of the present invention.

Detailed Description

The hardware device environment realized by the invention is a Langchao NF5280M4 server, which comprises a Xeon-E5-2640CPU,512ECCDDR4 internal memory, 6 T.6SAs hard disk, and the system is a CENTOS7.6 code compiling environment which is gcc4.8.5.

The invention discloses a network transmission file information rapid judging method based on matrix similarity, which utilizes the characteristic that the hamming weights of column transformation matrix row vectors are equal to realize coarse screening of the matrix similarity, realizes accurate comparison of the matrix through judging the transposed matrix hamming distance, finally achieves rapid judgment of the matrix similarity, and the overall implementation step diagram is shown in figure 1, and the processing flow diagram is shown in figure 2.

The time complexity of the traditional matrix similarity discrimination is thatThe invention can reduce the time complexity to 0 (n) by respectively carrying out data preprocessing, data coarse screening, data accurate matching and the like.

(S1.1) in an implementation, the dimension of the data (b) matrix is 384 rows and 512 bit columns, and 20k matrices are used, so that the memory space is 2.13M. The data (b) moment is directly loaded into memory for subsequent computation.

The dimension of the data (c) matrix is 384 rows and 512 bit columns, and 1.60M matrices are taken up, and the memory space 170.89M is occupied:

the data (c) matrix is divided into 12 groups, each group occupies 14.24m, and is stored as a file, and each time, one group is loaded into the memory to participate in calculation.

(S1.2) generating a hamming weight table for each row vector of the matrix in the matrix set of data (B) according to a hamming weight algorithm, and setting the row vector element of the matrix in the row data (B) as (b= (B) ₁ ，b ₂ ，b ₃ ，......b _n ) The hamming weight of the row vector is calculated according to the following formula (1)

The calculation method is shown as (2)

(S1.3) there are 20k matrices in the data (b) matrix set, each matrix has 384 row vectors, and 20k tables need to be generated, each hamming weight occupies 2 bytes, and the hamming summary table directly loads into the memory for calculation, wherein the total memory space needs to occupy 20k×512×2=19.53 m.

(S2) coarse screening discrimination of matrix similarity.

(S2.1) the matrix row vector is used for realizing the coarse screening discrimination of the similar matrix by searching the Hamming weight table.

The data (b) matrix occupies 2.1m of space, and the Hamming weight scale occupies 19.53m of spaceAll the data are directly loaded into the memory space, and each group of data (c) occupies 14.24m and is respectively loaded into the memory for comparison in 12 times. The data (C) matrix is extracted according to the row vector, and the Hamming weight HW (C) is calculated _i ) Find the corresponding hamming weight table HWT, if HW (C _i )＝HW(B _i ) Then the next row vector is fetched and the hamming weight HW (C _j ) And continuing to search the corresponding hamming weight table HWT until all rows meet the equal requirement, namely:the coarse screening is ended, in the course of which there is a row of hamming weights of unequal HW (C _j )≠HW(B _j ) I.e. +.>The comparison is not continued, the matrix is directly negated, and the next data (c) matrix is taken out for comparison.

(S3) accurately judging the similarity of the matrix.

(S3.1) no negative comparison matrix is arranged in the coarse screen, a matrix transposition and hamming distance comparison method is continuously adopted, accurate comparison of matrix similarity is achieved, the data (b) matrix and the data (c) matrix are transposed, and the transformation process is as follows:

the data (b) is matrix transformed as follows:

after transformation, the data (b), the matrix also occupies 2.13m space, and is loaded into the memory for calculation.

The data (c) is matrix transformed as follows:

the transformation matrix of the data (c) matrix is also stored to the file in 12 groups, one group is loaded to the memory at a time, 14.24 m.

(S3.2) traversing the row vectors in the transposed matrix of the data (c) in the row vectors of the transposed matrix of the data (b), taking out the first row vector of the transposed matrix of the data (c), and (b) carrying out Hamming distance comparison on the first row vector of the transposed matrix of the data, and meeting the condition of the formula (3)

and (3) sequentially and completely comparing N row vectors of the transposed matrix of the data (c), setting successfully-compared row vectors at the same time, recording the successfully-compared row vectors in a matching mark table, and skipping the comparison of the row when the next row is compared. The matrix matching table occupies 512 bytes, 0 is cleared before each use, and all matrix pairs temporarily use the matching table.

And (S3.3) if any row of vectors are not successfully matched in the process, the matrix is not successfully matched with the similarity matrix.

And when all the row vectors in the transposed matrix of the data (c) find corresponding row vector matching in the transposed matrix of the data (b), the accurate comparison is successful. The mark of successful comparison is that the two matrix data satisfy:

A matrix with similarity features is found and retained.

(S4) complete restoration of the file information.

The method of the present invention is sensitive to the memory capacity of the device and cannot meet the efficient operation of the present invention or affect the efficiency of the method of the present invention when the memory space of the device is too small. When the memory space is smaller, more Hamming weight table blocks and transpose matrix blocks are generated, the number of times of loading the file system into the memory is increased, and the processing efficiency of the method is affected. In addition, when the data (b) matrix set and the data (c) matrix set are larger, the invention has obvious I/O intensive access characteristics, and if the device is configured with higher-performance I/O access equipment, such as a solid state disk SSD, the screening efficiency can be greatly improved.

The invention relates to a network transmission file information rapid judging method based on matrix similarity, which needs a computing device to support and comprises one or more processors/memories and IP equipment, wherein the processor/memories and the IP equipment are used for realizing the rapid judging method, and the method is shown in a specific figure 7.

As shown in fig. 8, the present invention further includes a system for quickly determining network transmission file information based on matrix similarity, including:

Claims

1. A network transmission file information rapid judging method based on matrix similarity comprises the following steps:

(S1) constructing a hamming weight table by utilizing the matrix row vector hamming weight invariant characteristic;

(S2) coarse screening and judging matrix similarity; the method specifically comprises the following steps: the matrix row vector is used for realizing the coarse screening discrimination of the similar matrix by searching the Hamming weight table, the data (C) matrix is taken out according to the row vector, and the Hamming weight HW (C) _i ) Find the corresponding hamming weight table HWT, if HW (C _i )＝HW(B _i ) Then the next row vector is fetched and the hamming weight HW (C _j ) And continuing to search the corresponding hamming weight table HWT until all rows meet the equal requirement, namely:the coarse screening is ended, in the course of which there is a row of hamming weights of unequal HW (C _j )≠HW(B _j ) I.e.Then no longer relaysContinuously comparing, directly negating the matrix, and taking down a matrix of data (c) to continuously compare;

(S3) accurately judging the similarity of the matrix; the method specifically comprises the following steps:

(S3.1) constructing a data matrix transpose matrix: and (3) no negative comparison matrix is arranged in the coarse screen, a matrix transposition and hamming distance comparison method is continuously adopted, so that accurate comparison of matrix similarity is realized, and the data (b) matrix and the data (c) matrix are transposed, wherein the specific transformation process is as follows:

the data (b) is matrix transformed as follows:

the data (c) is matrix transformed as follows:

v＝*HD(b)^*HD(c)；

v＝v-((v＞＞1)&0x55555555)；

v＝(v&0x33333333)+((v＞＞2)&0x33333333；

dist+＝(((v+(v＞＞4))&0x0F0F0F0F)*0x01010101)＞＞24； (4)

(S3.3) a data matrix accurate comparison method: sequentially and completely comparing N row vectors of the transposed matrix of the data (c), setting the successfully compared row vectors as successfully compared marks, recording the successfully compared row vectors in a matching mark table, and skipping the comparison of the row when the next row is compared; if any row vector is not successfully matched in the process, the matrix is not successfully matched with the similarity matrix, and when all row vectors in the transposed matrix of the data (c) find corresponding row vector matching in the transposed matrix of the data (b), accurate comparison is successful; the mark of successful comparison is that the two matrix data satisfy:

(2) any one of the N columns of matrix data of data (b)There must be a column +.>Equal to it;

a matrix with similarity characteristics is found and reserved;

(S4) completely restoring the file information.

2. The method for quickly judging network transmission file information based on matrix similarity according to claim 1, wherein the method comprises the following steps: the step (S1) specifically comprises:

(S1.1) mapping the data packets into a data matrix;

given l=1, 2, 3, … T, for a total of T data (c) matrices, each matrix M rows and N columns of bits:

(S1.2) a matrix Hamming weight table calculating method is designed: the number of the data (b) matrix sets is recorded as S, the dimension M is recorded as N, the number of the data (c) matrix sets which are subjected to similarity comparison with the data (b) matrix sets is recorded as T, and the dimension M is recorded as N; generating a hamming weight table according to a hamming weight algorithm by each row vector of the matrix in the matrix set of the data (B), and setting row vector elements of the matrix in the row data (B) as (b= (B) ₁ ，b ₂ ，b ₃ ，......b _N ) The hamming weight of the row vector is calculated according to the following formula (1):

the calculation method is shown in formula (2)

(S1.3) constructing a matrix Hamming weight scale: the data (b) has S matrixes in the matrix set, each matrix has M row vectors, S tables are required to be generated, and the table space of S is occupied; and (3) extracting row vectors of the matrix from the matrix set of the data (b), calculating the hamming weight of the row, storing the hamming weight into a corresponding hamming weight table array, and respectively selecting character types, short integer types or integer types by the weight table array according to the number of the matrix arrays, so that the storage space is reduced to the maximum extent.

3. The method for quickly judging network transmission file information based on matrix similarity according to claim 2, wherein the method comprises the following steps: the specific table format of the hamming weight table array described in step (S1.3) is as follows:

4. the method for quickly judging network transmission file information based on matrix similarity according to claim 1, wherein the method comprises the following steps: the step (S4) specifically comprises:

(S4.1) extracting an index header file according to the matrix similarity information; extracting index information of a matrix of data (c) with similarity against a matrix of data (b) to generate a file index header file with the same sequence as the data (b);

(S4.2) generating complete file information according to the index header file; and taking out the content files corresponding to each index header file, merging to generate a source document, and finishing the restoration of file information.

5. A network transmission file information rapid judging system based on matrix similarity is characterized in that: comprising the following steps:

the matrix similarity coarse screening judging module is used for coarse screening to judge the matrix similarity; the method specifically comprises the following steps: the matrix row vector is used for realizing the coarse screening discrimination of the similar matrix by searching the Hamming weight table, the data (C) matrix is taken out according to the row vector, and the Hamming weight HW (C) _i ) Find the corresponding hamming weight table HWT, if HW (C _i )＝HW(B _i ) Then the next row vector is fetched and the hamming weight HW (C _j ) And continuing to search the corresponding hamming weight table HWT until all rows meet the equal requirement, namely:the coarse screening is ended, in the course of which there is a row of hamming weights of unequal HW (C _j )≠HW(B _j ) I.e. +.>The comparison is not continued, the matrix is directly negated, and the next data (c) matrix is taken for comparison;

the matrix similarity accurate judging module is used for accurately judging the matrix similarity; the method specifically comprises the following steps:

the data (b) is matrix transformed as follows:

the data (c) is matrix transformed as follows:

a matrix with similarity characteristics is found and reserved;

6. A network transmission file information fast judging device based on matrix similarity is characterized in that: the method comprises a memory and one or more processors, wherein executable codes are stored in the memory, and the one or more processors are used for realizing the network transmission file information rapid judging method based on matrix similarity according to any one of claims 1-4 when the executable codes are executed.

7. A computer-readable storage medium, having stored thereon a program which, when executed by a processor, implements a matrix similarity based network transmission file information fast determination method as claimed in any one of claims 1 to 4.