CN111666404A - File clustering method, device and equipment - Google Patents

File clustering method, device and equipment Download PDF

Info

Publication number
CN111666404A
CN111666404A CN201910163113.6A CN201910163113A CN111666404A CN 111666404 A CN111666404 A CN 111666404A CN 201910163113 A CN201910163113 A CN 201910163113A CN 111666404 A CN111666404 A CN 111666404A
Authority
CN
China
Prior art keywords
clustered
file
files
application program
clustering
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910163113.6A
Other languages
Chinese (zh)
Inventor
韩孟玲
魏向前
程虎
谭昱
彭宁
许天胜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201910163113.6A priority Critical patent/CN111666404A/en
Publication of CN111666404A publication Critical patent/CN111666404A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a file clustering method, a device and equipment, comprising the following steps: acquiring application program interface sequence information called when a plurality of files to be clustered are executed, wherein the application program interface sequence information comprises a plurality of application program interfaces which are sequenced according to calling time sequence; combining the application program interface sequence information of each file to be clustered into a plurality of interface sequence tuples according to the sequencing of a plurality of application program interfaces corresponding to each file to be clustered; determining a plurality of characteristic vectors of the plurality of interface sequence tuples corresponding to each file to be clustered; determining a feature vector of each file to be clustered based on a plurality of feature vectors corresponding to each file to be clustered; and clustering the files to be clustered by utilizing the characteristic vectors of the files to be clustered. The problem of difficult or inaccurate file clustering of using deformation technique such as shell adding, flower adding instruction among the prior art is solved, the accuracy of file clustering has been improved.

Description

File clustering method, device and equipment
Technical Field
The invention relates to the technical field of file clustering, in particular to a file clustering method, a file clustering device and file clustering equipment.
Background
The current file clustering method is based on the static information of the files, for example, the method collects each static information of the executable files according to the PE (portable executable) structure of the executable files to perform weighting calculation, and judges whether the two files are similar or not by comparing the static information of the two files.
The method for clustering through static information has better universality, does not need to consider the system environment on which the file execution depends, but can not cluster files using shell adding, flower adding instructions and other deformation technologies, although the dynamic behaviors of the files can be similar, the static characteristics of some files are not obvious, and false alarm is easily caused.
In order to improve the accuracy of clustering, static clustering usually selects as many features as possible, up to dozens of features. The more features, the higher the computational complexity, and the more complex the index, and the clustering computation becomes exceptionally large when the file size increases.
Therefore, a clustering scheme capable of easily dealing with the shell-adding and flower-adding instruction files is required.
Disclosure of Invention
The invention provides a file clustering method, a file clustering device and file clustering equipment, and provides a new file clustering scheme which can easily deal with massive files to be clustered. The invention is realized by the following technical scheme:
in a first aspect, the present invention provides a file clustering method, including:
acquiring application program interface sequence information called when a plurality of files to be clustered are executed, wherein the application program interface sequence information comprises a plurality of application program interfaces which are sequenced according to calling time sequence;
combining the application program interface sequence information of each file to be clustered into a plurality of interface sequence tuples according to the sequencing of a plurality of application program interfaces corresponding to each file to be clustered, wherein the interface sequence tuples at least comprise two application program interfaces;
determining a plurality of characteristic vectors of the plurality of interface sequence tuples corresponding to each file to be clustered;
determining a feature vector of each file to be clustered based on a plurality of feature vectors corresponding to each file to be clustered;
and clustering the files to be clustered by utilizing the characteristic vectors of the files to be clustered.
In a second aspect, the present invention provides a file clustering apparatus, including:
the system comprises an acquisition module, a clustering module and a clustering module, wherein the acquisition module is used for acquiring the interface sequence information of an application program called when a plurality of files to be clustered are executed, and the interface sequence information of the application program comprises a plurality of application program interfaces which are sequenced according to calling time sequence;
the combination module is used for combining the application program interface sequence information of each file to be clustered into a plurality of interface sequence tuples according to the sequence of a plurality of application program interfaces corresponding to each file to be clustered, and the interface sequence tuples at least comprise two application program interfaces;
the first determining module is used for determining a plurality of characteristic vectors of the plurality of interface sequence tuples corresponding to each file to be clustered;
the second determining module is used for determining the characteristic vector of each file to be clustered based on a plurality of characteristic vectors corresponding to each file to be clustered;
and the clustering module is used for clustering the files to be clustered by utilizing the characteristic vectors of the files to be clustered.
Further, the combination module further includes:
a third determining module, configured to determine a first number of application program interfaces included in the interface sequence tuple;
a fourth determination module to determine an application program interface extraction window based on the first number;
and the fifth determining module is used for sequentially extracting a first number of adjacent application program interfaces from the application program interface sequence information of each file to be clustered by using the application program interface extracting window to obtain a plurality of interface sequence tuples, wherein the moving step length of the application program interface extracting window when the first number of adjacent application program interfaces are sequentially extracted is one application program interface.
Further, the first determining module further comprises:
and the mapping module is used for mapping the plurality of interface sequence tuples corresponding to each file to be clustered by using a message digest algorithm to obtain a plurality of characteristic vectors of the plurality of interface sequence tuples.
Further, the second determining module further comprises:
and the conversion module is used for converting the plurality of characteristic vectors corresponding to each file to be clustered into the characteristic vector of each file to be clustered by utilizing a locality sensitive hashing algorithm.
In a third aspect, the present invention provides a file clustering device, which comprises a processor and a memory, wherein the memory stores at least one instruction, at least one program, a set of codes, or a set of instructions, and the at least one instruction, the at least one program, the set of codes, or the set of instructions is loaded and executed by the processor to implement the file clustering method according to the first aspect.
The invention provides a file clustering method, a file clustering device and file clustering equipment. The problem of difficult or inaccurate file clustering of using deformation technique such as shell adding, flower adding instruction among the prior art is solved, the accuracy of file clustering has been improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions and advantages of the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a schematic diagram of a system provided by an embodiment of the present invention;
FIG. 2 is a schematic flow chart of a file clustering method according to an embodiment of the present invention;
FIG. 3 is a flowchart illustrating a method for determining a document feature vector according to an embodiment of the present invention;
FIG. 4 is a schematic flow chart of a method for clustering files based on feature vectors of the files to be clustered according to an embodiment of the present invention;
FIG. 5 is a schematic flow chart of another method for clustering files based on feature vectors of the files to be clustered according to an embodiment of the present invention;
FIG. 6 is a schematic flow chart of another method for clustering files based on feature vectors of the files to be clustered according to an embodiment of the present invention;
FIG. 7 is a schematic diagram of an architecture of a file clustering method according to an embodiment of the present invention;
FIG. 8 is a flowchart illustrating another method for clustering files according to an embodiment of the present invention;
FIG. 9 is a schematic structural diagram of a file clustering device according to an embodiment of the present invention;
fig. 10 is a schematic structural diagram of another file clustering device according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Referring to fig. 1, fig. 1 is a schematic diagram of a system according to an embodiment of the present invention, and as shown in fig. 1, the system may include a sample server 01, a dynamic system server 02, a cluster calculation server 03, and a database 04.
Specifically, the sample server 01 may include a server operating independently, or a distributed server, or a server cluster composed of a plurality of servers. The sample server 01 may include a network communication unit, a processor, a memory, and the like. Specifically, the sample server 01 may obtain a file to be clustered to provide a sample file for the dynamic system server 02.
Specifically, the dynamic system server 02 may include a server operating independently, or a distributed server, or a server cluster composed of a plurality of servers. The dynamic system server 02 may include a network communication unit, a processor, a memory, and the like. Specifically, the dynamic system server 02 may generate sample file behavior data according to a sample obtained from the sample server 01, and send the sample file behavior data to the cluster calculation server 03.
Specifically, the cluster computation server 03 may include a server that operates independently, or a distributed server, or a server cluster composed of a plurality of servers. The cluster computation server 03 may comprise a network communication unit, a processor and a memory, etc. Specifically, the cluster calculation server 03 may cluster the sample file behavior data obtained from the dynamic system server 02, send the clustering result to the database 04 for storage, and read the file clustering result from the database 04.
Specifically, the database 04 may be a software or hardware entity, and may operate independently as a hardware entity, or may be integrated into another server. The database 04 may receive and store the clustering result sent by the clustering calculation server 03, or may send the corresponding clustering result to the clustering calculation server 03 in response to a read request from the clustering calculation server 03.
The following describes a method for clustering files based on the above system, and the present specification provides the method operation steps as described in the embodiment or the flowchart, but may include more or less operation steps based on conventional or non-creative labor. The order of steps recited in the embodiments is merely one manner of performing the steps in a multitude of orders and does not represent the only order of execution. In practice, the system or server product may be implemented in a sequential or parallel manner (e.g., parallel processor or multi-threaded environment) according to the embodiments or methods shown in the figures.
Fig. 2 is a schematic flow chart of a file clustering method according to an embodiment of the present invention, and as shown in fig. 2, the method specifically includes:
s201: acquiring interface sequence information of application programs called when a plurality of files to be clustered are executed, wherein the interface sequence information of the application programs comprises a plurality of application program interfaces which are sequenced according to calling time sequence.
In a specific example, if the number of files to be clustered is M, wherein as shown in fig. 3, the application program interface sequence information called when the file a to be clustered is executed is P1-P2-P3-P4, the interface sequence information includes the call sequence information of each interface, and if the application program interface sequence information exists, the application program interface sequence information is P1-P2-P4-P3, they are different application program interface sequence information.
S203: and combining the application program interface sequence information of each file to be clustered into a plurality of interface sequence tuples according to the sequencing of a plurality of application program interfaces corresponding to each file to be clustered, wherein the interface sequence tuples at least comprise two application program interfaces.
Specifically, the step of combining the application program interface sequence information of each file to be clustered into a plurality of interface sequence tuples according to the sequence of the plurality of application program interfaces corresponding to each file to be clustered includes the following steps:
(1) determining a first number of application program interfaces contained in an interface sequence tuple;
the interface sequence tuple can be a binary tuple or a triple, that is, can contain two responses
The program interface can also comprise three application program interfaces, and the specific setting can be according to the actual situation
It is required to do so. Taking the binary group as an example, the application program included in the interface sequence tuple is determined
The first number of interfaces is 2.
(2) Determining an application program interface extraction window based on the first number;
(3) and sequentially extracting a first number of adjacent application program interfaces from the application program interface sequence information of each file to be clustered by using the application program interface extraction window to obtain a plurality of interface sequence tuples, wherein the moving step length of the application program interface extraction window when the first number of adjacent application program interfaces are sequentially extracted is one application program interface.
Further, taking the foregoing example as an example, the application program extraction window respectively extracts interface sequence tuples for M files to be clustered, where, as shown in fig. 3, the application program interface extraction window sequentially extracts two adjacent application program interfaces according to the application program interface sequence information P1-P2-P3-P4 of the file a to be clustered by the moving step of one application program interface, so as to obtain three binary tuples:
P1-P2、P2-P3、P3-P4。
s205: determining a plurality of feature vectors of the plurality of interface sequence tuples corresponding to each file to be clustered.
Specifically, the determining a plurality of feature vectors of the interface sequence tuples corresponding to each file to be clustered includes: and mapping the plurality of interface sequence tuples corresponding to each file to be clustered by using a message digest algorithm to obtain a plurality of characteristic vectors of the plurality of interface sequence tuples.
The message digest algorithm may specifically be an MD5 algorithm, an SHA-1 algorithm, or the like. If the hash mapping process is performed on the three binary groups obtained in step S203 by using the Md5 algorithm, three corresponding feature vectors Md5_ hash1, quantity Md5_ hash2, and quantity Md5_ hash3 are obtained, as shown in fig. 3.
S207: and determining the characteristic vector of each file to be clustered based on a plurality of characteristic vectors corresponding to each file to be clustered.
Specifically, the determining the feature vector of each file to be clustered based on the plurality of feature vectors corresponding to each file to be clustered includes: and converting the plurality of feature vectors corresponding to each file to be clustered into the feature vector of each file to be clustered by using a locality sensitive hashing algorithm.
Wherein, the locality sensitive hash algorithm may be a simhash algorithm. The three eigenvectors obtained in step S205 are fused and converted into an eigenvector simhash vector of the file a to be clustered, which is equivalent to the fingerprint of the file a to be clustered. Thus, M files to be clustered can reach M feature vectors.
S209: and clustering the files to be clustered by utilizing the characteristic vectors of the files to be clustered.
According to different dimensions of the files, a plurality of different clustering methods exist, and the clustering method is based on the dynamic behavior characteristics of the files. The files are usually executable files or script files, and when the files are executed, the operating system application program api is called to acquire system resources to complete the program purpose of the files, so that the called api sequence (behavior sequence) has certain characteristics, that is, the behavior sequences of the files of the same class have similarity. The scheme provided by the specification is that clustering is realized based on the dynamic behavior of the files, no matter whether the files are shelled or not, the clustering effect is not influenced by flower adding, a simhash algorithm is selected, and through testing, the 95% accuracy is achieved, while the current static clustering only has the 80% accuracy.
The method includes the steps of splitting a behavior sequence of a file to be clustered into a plurality of tuples, mapping the tuples into corresponding vectors, converting the vectors into feature vectors of the file to be clustered, and clustering the file to be clustered according to the feature vectors of the file to be clustered. The problem of difficult or inaccurate file clustering of using deformation technique such as shell adding, flower adding instruction among the prior art is solved, the accuracy of file clustering has been improved.
Fig. 4 is a schematic flow chart of a method for clustering files based on feature vectors of files to be clustered according to an embodiment of the present invention, and as shown in fig. 4, the method specifically includes:
s401: and calculating the distance between every two feature vectors of the files to be clustered.
Specifically, the method for calculating the distance between every two feature vectors of the file to be clustered may be a method of calculating a euclidean distance, a hamming distance, a cosine of an included angle, and the like.
S403: and judging whether the distance between every two feature vectors of the files to be clustered is greater than or equal to a preset distance threshold value.
In the embodiment of the present specification, a distance threshold for determining a distance between every two feature vectors of a file to be clustered when the file to be clustered is clustered needs to be preset, and a specific threshold may be set according to actual needs.
S405: and clustering the two files to be clustered when the distance between the characteristic vectors of the two files to be clustered is smaller than or equal to a preset distance threshold.
And when the distance between the feature vectors of the two files to be clustered is shorter, the similarity of the two files to be clustered is higher, and the two files to be clustered are classified into one class.
In the description, the simhash has good comparability, and similarity can be compared by using Euclidean distance, Hamming distance, included angle cosine and the like, so that clustering of files can be realized according to the similarity. Therefore, the M files to be clustered can be clustered by utilizing the M characteristic vectors.
Fig. 5 is a schematic flow chart of another method for clustering files based on feature vectors of files to be clustered according to an embodiment of the present invention, and as shown in fig. 5, the method specifically includes:
s501: and performing minimum hash calculation on the feature vector of each file to be clustered to obtain a plurality of minimum hash values.
The minimum hash calculation process is as follows:
Figure BDA0001985363590000091
wherein, S1 and S2 represent behavior sequences, and A, B, C, D is binary expression of the behavior sequences.
S503: and dividing the feature vector of each file to be clustered into a plurality of interface sequence buckets according to the minimum hash values.
For the multiple minimum hash values obtained in step S501, the same value is put into the same interface sequence bucket. Thus, as shown in fig. 6, a plurality of interface sequence buckets are obtained: bucket 1, bucket 2, … … bucket N.
S505: and calculating the distance between every two feature vectors of the files to be clustered in the same interface sequence bucket.
Specifically, the method for calculating the distance between every two feature vectors of the files to be clustered in the same interface sequence bucket may be a method for calculating an euclidean distance, a hamming distance, an included angle cosine and the like.
S507: and judging whether the distance between every two feature vectors of the files to be clustered in the same interface sequence bucket is smaller than or equal to a preset distance threshold value.
In the embodiment of the present specification, a distance threshold for determining a distance between every two feature vectors of files to be clustered in the same interface sequence bucket when the files to be clustered are clustered needs to be preset, and a specific threshold may be set according to actual needs.
S509: and clustering the two files to be clustered when the distance between the characteristic vectors of the two files to be clustered is smaller than or equal to a preset distance threshold.
And when the distance between the feature vectors of the two files to be clustered is shorter, the similarity of the two files to be clustered is higher, and the two files to be clustered are classified into one class.
In the description, the simhash has good comparability, and similarity can be compared by using Euclidean distance, Hamming distance, included angle cosine and the like, so that clustering of files can be realized according to the similarity. Therefore, the M files to be clustered can be clustered by utilizing the M characteristic vectors. In addition, file clustering generally has a comparison process, and for a large amount of or mass data, the comparison cost is very high, so that the description considers that certain strategies such as barrel division are adopted to meet the high-performance calculation requirement, and after minhash barrel division, two-by-two comparison is only needed to be carried out in the barrel, so that the data magnitude in the barrel is reduced after barrel division. The simhash time is complicated to O (n), meanwhile, a bucket dividing strategy is introduced, a minhash algorithm is adopted to construct indexes, similarity calculation is optimized, and massive samples are easily dealt with.
The embodiment of the invention also provides an architecture schematic diagram of a file clustering method, and particularly refers to fig. 7. The framework design is equivalent to a sorter of a production line, samples are sorted under the same family, and a cluster id is used as a family identification id. After a sample enters a scheduling queue, a simhash value and a minhash value are calculated according to a dynamic sequence, then the minhash is used as an index, traversal similarity calculation is carried out with a cluster id (simhash value) of a database, and the sample is classified under the cluster id with similar classification. Regarding the selection of the cluster id, the simhash of the first sample is selected as the cluster id.
The embodiment of the invention also provides a flow diagram of another file clustering method, and particularly refers to fig. 8.
First, the dynamic sequence of the file is converted into a computable vector. The behavior sequence is divided into single behaviors, N-gram modes are adopted to combine behavior N-tuple and add time sequence information, then hash is generated on the N-gram (the md5 algorithm is adopted), the generated N hashes are fused to generate a simhash, and the generated simhash is a calculable vector of the file and is equivalent to the fingerprint of the file.
And calculating the similarity of the two files, namely calculating the Hamming distance of the two simhashes, and judging that the two files are of the same type, namely the same family when the Hamming distance is smaller than a set threshold value during clustering.
The final actual clustering result presentation can be seen in the following table:
Figure BDA0001985363590000111
Figure BDA0001985363590000121
in practical application, it can be seen whether the min _ hash is the same, if so, then whether the hamming distance of the simhash of the two files is smaller than a threshold (the threshold actually used by us is 5.5) is calculated, and if so, it indicates that the two files are of the same type.
An embodiment of the present invention further provides a file clustering device, as shown in fig. 9, the device includes:
an obtaining module 901, configured to obtain sequence information of application program interfaces called when multiple files to be clustered are executed, where the sequence information of application program interfaces includes multiple application program interfaces ordered according to a calling time sequence;
the combination module 903 is configured to combine the application program interface sequence information of each file to be clustered into a plurality of interface sequence tuples according to the ordering of the plurality of application program interfaces corresponding to each file to be clustered, where the interface sequence tuples include at least two application program interfaces;
a first determining module 905, configured to determine a plurality of feature vectors of the plurality of interface sequence tuples corresponding to each file to be clustered;
a second determining module 907, configured to determine a feature vector of each file to be clustered based on a plurality of feature vectors corresponding to each file to be clustered;
a clustering module 909, configured to cluster the multiple files to be clustered by using the feature vectors of the multiple files to be clustered.
Further, the clustering module 909, as shown in fig. 10, includes:
a first calculating module 1001, configured to perform minimum hash calculation on the feature vector of each file to be clustered to obtain multiple minimum hash values;
a bucket dividing module 1003, configured to divide the feature vector of each file to be clustered into a plurality of interface sequence buckets according to the plurality of minimum hash values;
the second calculating module 1005 is configured to calculate a distance between every two feature vectors of the files to be clustered in the same interface sequence bucket;
the first judging and processing module 1007 is configured to judge whether a distance between every two feature vectors of files to be clustered in the same interface sequence bucket is smaller than or equal to a preset distance threshold; and clustering the two files to be clustered when the distance between the characteristic vectors of the two files to be clustered is smaller than or equal to a preset distance threshold.
Or, further, the clustering module includes:
the third calculation module is used for calculating the distance between every two feature vectors of the files to be clustered;
the second judgment and processing module is used for judging whether the distance between every two feature vectors of the files to be clustered is larger than or equal to a preset distance threshold value or not; and clustering the two files to be clustered when the distance between the characteristic vectors of the two files to be clustered is smaller than or equal to a preset distance threshold.
Further, the combination module further includes:
a third determining module, configured to determine a first number of application program interfaces included in the interface sequence tuple;
a fourth determination module to determine an application program interface extraction window based on the first number;
and the fifth determining module is used for sequentially extracting a first number of adjacent application program interfaces from the application program interface sequence information of each file to be clustered by using the application program interface extracting window to obtain a plurality of interface sequence tuples, wherein the moving step length of the application program interface extracting window when the first number of adjacent application program interfaces are sequentially extracted is one application program interface.
Further, the first determining module further comprises:
and the mapping module is used for mapping the plurality of interface sequence tuples corresponding to each file to be clustered by using a message digest algorithm to obtain a plurality of characteristic vectors of the plurality of interface sequence tuples.
Further, the second determining module further comprises:
and the conversion module is used for converting the plurality of characteristic vectors corresponding to each file to be clustered into the characteristic vector of each file to be clustered by utilizing a locality sensitive hashing algorithm.
The device and method embodiments in the device embodiment described are based on the same inventive concept.
An embodiment of the present invention further provides a file clustering device, where the device includes a processor and a memory, where the memory stores at least one instruction, at least one program, a code set, or an instruction set, and the at least one instruction, the at least one program, the code set, or the instruction set is loaded and executed by the processor to implement the file clustering method as described above.
It can be seen from the above embodiments of the method, apparatus, and device for clustering files provided by the present invention that, in the scheme of the present invention, application program interface sequence information called when a plurality of files to be clustered are executed is obtained, where the application program interface sequence information includes a plurality of application program interfaces ordered according to a calling time sequence; combining the application program interface sequence information of each file to be clustered into a plurality of interface sequence tuples according to the sequencing of a plurality of application program interfaces corresponding to each file to be clustered; determining a plurality of characteristic vectors of the plurality of interface sequence tuples corresponding to each file to be clustered; determining a feature vector of each file to be clustered based on a plurality of feature vectors corresponding to each file to be clustered; and clustering the files to be clustered by utilizing the characteristic vectors of the files to be clustered. The problem of difficult or inaccurate file clustering of using deformation technique such as shell adding, flower adding instruction among the prior art is solved, the accuracy of file clustering has been improved. And moreover, by the bucket dividing and in-bucket clustering method, the data calculation amount is greatly reduced, and the data processing efficiency is improved.
It should be noted that: the precedence order of the above embodiments of the present invention is only for description, and does not represent the merits of the embodiments. And specific embodiments thereof have been described above. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, as for the device and apparatus embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference may be made to some descriptions of the method embodiments for relevant points.
It will be understood by those skilled in the art that all or part of the steps of implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (10)

1. A method for clustering files, the method comprising:
acquiring application program interface sequence information called when a plurality of files to be clustered are executed, wherein the application program interface sequence information comprises a plurality of application program interfaces which are sequenced according to calling time sequence;
combining the application program interface sequence information of each file to be clustered into a plurality of interface sequence tuples according to the sequencing of a plurality of application program interfaces corresponding to each file to be clustered, wherein the interface sequence tuples at least comprise two application program interfaces;
determining a plurality of characteristic vectors of the plurality of interface sequence tuples corresponding to each file to be clustered;
determining a feature vector of each file to be clustered based on a plurality of feature vectors corresponding to each file to be clustered;
and clustering the files to be clustered by utilizing the characteristic vectors of the files to be clustered.
2. The method according to claim 1, wherein the clustering the plurality of files to be clustered by using the feature vectors of the plurality of files to be clustered comprises:
performing minimum hash calculation on the feature vector of each file to be clustered to obtain a plurality of minimum hash values;
dividing the feature vector of each file to be clustered into a plurality of interface sequence buckets according to the minimum hash values;
calculating the distance between every two feature vectors of the files to be clustered in the same interface sequence bucket;
judging whether the distance between every two feature vectors of the files to be clustered in the same interface sequence bucket is smaller than or equal to a preset distance threshold value or not;
and clustering the two files to be clustered when the distance between the characteristic vectors of the two files to be clustered is smaller than or equal to a preset distance threshold.
3. The method according to claim 1, wherein the clustering the plurality of files to be clustered by using the feature vectors of the plurality of files to be clustered comprises:
calculating the distance between every two feature vectors of the files to be clustered;
judging whether the distance between every two feature vectors of the files to be clustered is larger than or equal to a preset distance threshold value or not;
and clustering the two files to be clustered when the distance between the characteristic vectors of the two files to be clustered is smaller than or equal to a preset distance threshold.
4. The method according to claim 1, wherein the combining the application program interface sequence information of each file to be clustered into a plurality of interface sequence tuples according to the ordering of the plurality of application program interfaces corresponding to each file to be clustered comprises:
determining a first number of application program interfaces contained in an interface sequence tuple;
determining an application program interface extraction window based on the first number;
and sequentially extracting a first number of adjacent application program interfaces from the application program interface sequence information of each file to be clustered by using the application program interface extraction window to obtain a plurality of interface sequence tuples, wherein the moving step length of the application program interface extraction window when the first number of adjacent application program interfaces are sequentially extracted is one application program interface.
5. The method of claim 1, wherein the determining a plurality of feature vectors of the interface sequence tuples corresponding to each file to be clustered comprises:
and mapping the plurality of interface sequence tuples corresponding to each file to be clustered by using a message digest algorithm to obtain a plurality of characteristic vectors of the plurality of interface sequence tuples.
6. The method according to claim 1, wherein the determining the feature vector of each file to be clustered based on the plurality of feature vectors corresponding to each file to be clustered comprises:
and converting the plurality of feature vectors corresponding to each file to be clustered into the feature vector of each file to be clustered by using a locality sensitive hashing algorithm.
7. An apparatus for clustering files, the apparatus comprising:
the system comprises an acquisition module, a clustering module and a clustering module, wherein the acquisition module is used for acquiring the interface sequence information of an application program called when a plurality of files to be clustered are executed, and the interface sequence information of the application program comprises a plurality of application program interfaces which are sequenced according to calling time sequence;
the combination module is used for combining the application program interface sequence information of each file to be clustered into a plurality of interface sequence tuples according to the sequence of a plurality of application program interfaces corresponding to each file to be clustered, and the interface sequence tuples at least comprise two application program interfaces;
the first determining module is used for determining a plurality of characteristic vectors of the plurality of interface sequence tuples corresponding to each file to be clustered;
the second determining module is used for determining the characteristic vector of each file to be clustered based on a plurality of characteristic vectors corresponding to each file to be clustered;
and the clustering module is used for clustering the files to be clustered by utilizing the characteristic vectors of the files to be clustered.
8. The method of claim 7, wherein the clustering module comprises:
the first calculation module is used for performing minimum hash calculation on the feature vector of each file to be clustered to obtain a plurality of minimum hash values;
the bucket dividing module is used for dividing the feature vector of each file to be clustered into a plurality of interface sequence buckets according to the minimum hash values;
the second calculation module is used for calculating the distance between every two feature vectors of the files to be clustered in the same interface sequence bucket;
the first judging and processing module is used for judging whether the distance between every two feature vectors of the files to be clustered in the same interface sequence bucket is smaller than or equal to a preset distance threshold value or not; and clustering the two files to be clustered when the distance between the characteristic vectors of the two files to be clustered is smaller than or equal to a preset distance threshold.
9. The method of claim 7, wherein the clustering module comprises:
the third calculation module is used for calculating the distance between every two feature vectors of the files to be clustered;
the second judgment and processing module is used for judging whether the distance between every two feature vectors of the files to be clustered is larger than or equal to a preset distance threshold value or not; and clustering the two files to be clustered when the distance between the characteristic vectors of the two files to be clustered is smaller than or equal to a preset distance threshold.
10. A file clustering device, comprising a processor and a memory, the memory having stored therein at least one instruction, at least one program, set of codes, or set of instructions, the at least one instruction, the at least one program, set of codes, or set of instructions being loaded and executed by the processor to implement the file clustering method according to any one of claims 1 to 6.
CN201910163113.6A 2019-03-05 2019-03-05 File clustering method, device and equipment Pending CN111666404A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910163113.6A CN111666404A (en) 2019-03-05 2019-03-05 File clustering method, device and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910163113.6A CN111666404A (en) 2019-03-05 2019-03-05 File clustering method, device and equipment

Publications (1)

Publication Number Publication Date
CN111666404A true CN111666404A (en) 2020-09-15

Family

ID=72381543

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910163113.6A Pending CN111666404A (en) 2019-03-05 2019-03-05 File clustering method, device and equipment

Country Status (1)

Country Link
CN (1) CN111666404A (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103473346A (en) * 2013-09-24 2013-12-25 北京大学 Android re-packed application detection method based on application programming interface
CN104008334A (en) * 2013-02-21 2014-08-27 腾讯科技(深圳)有限公司 Clustering method and device of files
CN105512555A (en) * 2014-12-12 2016-04-20 哈尔滨安天科技股份有限公司 Homologous family dividing and mutation method and system based on file string cluster
CN106778241A (en) * 2016-11-28 2017-05-31 东软集团股份有限公司 The recognition methods of malicious file and device
CN107153789A (en) * 2017-04-24 2017-09-12 西安电子科技大学 The method for detecting Android Malware in real time using random forest grader
CN107506414A (en) * 2017-08-11 2017-12-22 武汉大学 A kind of code based on shot and long term memory network recommends method
US20180189481A1 (en) * 2016-01-26 2018-07-05 Huawei Technologies Co., Ltd. Program File Classification Method, Program File Classification Apparatus, and Program File Classification System
CN109101817A (en) * 2018-08-13 2018-12-28 亚信科技(成都)有限公司 A kind of identification malicious file class method for distinguishing and calculate equipment

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104008334A (en) * 2013-02-21 2014-08-27 腾讯科技(深圳)有限公司 Clustering method and device of files
CN103473346A (en) * 2013-09-24 2013-12-25 北京大学 Android re-packed application detection method based on application programming interface
CN105512555A (en) * 2014-12-12 2016-04-20 哈尔滨安天科技股份有限公司 Homologous family dividing and mutation method and system based on file string cluster
US20180189481A1 (en) * 2016-01-26 2018-07-05 Huawei Technologies Co., Ltd. Program File Classification Method, Program File Classification Apparatus, and Program File Classification System
CN106778241A (en) * 2016-11-28 2017-05-31 东软集团股份有限公司 The recognition methods of malicious file and device
CN107153789A (en) * 2017-04-24 2017-09-12 西安电子科技大学 The method for detecting Android Malware in real time using random forest grader
CN107506414A (en) * 2017-08-11 2017-12-22 武汉大学 A kind of code based on shot and long term memory network recommends method
CN109101817A (en) * 2018-08-13 2018-12-28 亚信科技(成都)有限公司 A kind of identification malicious file class method for distinguishing and calculate equipment

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
TRUNG KIEN TRAN, ET AL.: "NLP-based Approaches for Malware Classification from API Sequences", ASIA PACIFIC SYMPOSIUM ON INTELLIGENT AND EVOLUTIONARY SYSTEMS (IES), 31 December 2017 (2017-12-31), pages 101 - 105 *
姚晓杭;: "基于代码行为病毒监控技术", 实验室研究与探索, no. 12, 15 December 2009 (2009-12-15) *
孙贺等: "恶意程序相似性分析技术研究进展", 军事通信技术, no. 01, 25 March 2017 (2017-03-25), pages 43 - 51 *
熊俊;: "基于分类的未知病毒检测方法研究", 电脑开发与应用, no. 11, 25 November 2012 (2012-11-25) *
王冲;李炳辰;王进保;: "基于文本挖掘的恶意软件分类方法", 中国民航大学学报, no. 01, 15 February 2018 (2018-02-15) *
王硕等: "基于API序列分析和支持向量机的未知病毒检测", 计算机应用, no. 08, 1 August 2007 (2007-08-01), pages 124 - 125 *

Similar Documents

Publication Publication Date Title
CN109189991B (en) Duplicate video identification method, device, terminal and computer readable storage medium
US11915104B2 (en) Normalizing text attributes for machine learning models
WO2018188378A1 (en) Method and device for tagging label for application, terminal and computer readable storage medium
CN112000822B (en) Method and device for ordering multimedia resources, electronic equipment and storage medium
CN109255000B (en) Dimension management method and device for label data
US11741147B2 (en) Selecting balanced clusters of descriptive vectors
CN111444363A (en) Picture retrieval method and device, terminal equipment and storage medium
CN110866249A (en) Method and device for dynamically detecting malicious code and electronic equipment
CN111400126A (en) Network service abnormal data detection method, device, equipment and medium
CN111930610A (en) Software homology detection method, device, equipment and storage medium
CN110059172B (en) Method and device for recommending answers based on natural language understanding
WO2017095439A1 (en) Incremental clustering of a data stream via an orthogonal transform based indexing
CN114330584A (en) Data clustering method and device, storage medium and electronic equipment
US8918406B2 (en) Intelligent analysis queue construction
CN110209895B (en) Vector retrieval method, device and equipment
CN108170664B (en) Key word expansion method and device based on key words
EP3644195A1 (en) System for storing and querying document collections
Tang et al. Workload characterization for MG-RAST metagenomic data analytics service in the cloud
JP6356015B2 (en) Gene expression information analyzing apparatus, gene expression information analyzing method, and program
CN111507400A (en) Application classification method and device, electronic equipment and storage medium
CN114780368B (en) Table data synchronization method and apparatus
CN110647537A (en) Data searching method, device and storage medium
CN111666404A (en) File clustering method, device and equipment
CN115438989A (en) Data analysis method, server and storage medium applied to intelligent production line
CN112214494B (en) Retrieval method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination