CN116821053B - Data reporting method, device, computer equipment and storage medium - Google Patents

Data reporting method, device, computer equipment and storage medium Download PDF

Info

Publication number
CN116821053B
CN116821053B CN202311103374.1A CN202311103374A CN116821053B CN 116821053 B CN116821053 B CN 116821053B CN 202311103374 A CN202311103374 A CN 202311103374A CN 116821053 B CN116821053 B CN 116821053B
Authority
CN
China
Prior art keywords
data
reported
file
reporting
clusters
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311103374.1A
Other languages
Chinese (zh)
Other versions
CN116821053A (en
Inventor
韩孟玲
白冰
张兴明
申大坤
孙天宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Lab
Original Assignee
Zhejiang Lab
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Lab filed Critical Zhejiang Lab
Priority to CN202311103374.1A priority Critical patent/CN116821053B/en
Publication of CN116821053A publication Critical patent/CN116821053A/en
Application granted granted Critical
Publication of CN116821053B publication Critical patent/CN116821053B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/11File system administration, e.g. details of archiving or snapshots
    • G06F16/122File system administration, e.g. details of archiving or snapshots using management policies
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • G06F16/137Hash-based
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to a data reporting method, a device, computer equipment and a storage medium, wherein the data to be reported of a file is obtained, the characteristic value of the data to be reported is extracted, the data to be reported is classified into different barrel files according to the characteristic value and stored, the data to be reported is clustered in the barrel files according to the similarity to obtain a plurality of groups of data clusters to be reported, the data clusters to be reported are scored according to the ratio of normal data samples to malicious data samples under each group of data clusters to be reported, a plurality of groups of data clusters to be reported are selected according to the score to report, repeated or similar useless data reporting is reduced through clustering, the reported data is filtered through scoring, the problem that the reporting efficiency of the file data in the related technology is lower is solved, the space required for storing the data is reduced, and the reporting efficiency of the file data is improved.

Description

Data reporting method, device, computer equipment and storage medium
Technical Field
The present application relates to the field of data reporting technologies, and in particular, to a data reporting method, apparatus, computer device, and storage medium.
Background
In the cloud scene, a plurality of services need to report file data of the cloud server, but the cost for reporting all the massive file data in the cloud server is high, and the total report needs to occupy more resources of the detection engine. Therefore, the file data needs to be selectively reported according to different application scenes.
In the prior art, for reporting file data of a cloud scene, the file data is generally selected according to historical experience data, so that the reported file data is selected. However, the reporting efficiency of the method is low, important file data is easy to ignore, and the applicability is low.
At present, for the problem of low file data reporting efficiency in the related art, no effective solution has been proposed yet.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a data reporting method, apparatus, computer device, and computer readable storage medium that can improve the efficiency of reporting file data.
In a first aspect, the present application provides a data reporting method. The method comprises the following steps:
acquiring data to be reported of a file;
extracting the characteristic value of the data to be reported;
Classifying the data to be reported into different barrel files for storage according to the characteristic values;
performing similarity calculation on the data to be reported under the same barrel file, clustering the data to be reported according to the similarity, and generating a plurality of groups of data clusters to be reported, wherein the data clusters to be reported comprise normal data samples and malicious data samples;
and scoring each group of data clusters to be reported according to the ratio of the normal data sample to the malicious data sample under each group of data clusters to be reported, and selecting a plurality of groups of data clusters to be reported according to the score for reporting.
In one embodiment, the data to be reported includes: the file report path data and the malicious file path data comprise a first identification code, a file name, a file path, a directory and generation time of the file.
In one embodiment, extracting the characteristic value of the data to be reported includes:
extracting the catalogue of the file by taking the first identification code as a main key;
dividing the catalog to obtain a plurality of byte fragments;
And calculating second identification codes of the byte fragments, combining the second identification codes, and generating a first characteristic value of the data to be reported.
In one embodiment, extracting the characteristic value of the data to be reported further includes:
randomly scrambling each row of the first characteristic value for a plurality of times;
mapping a first set corresponding to the first characteristic value obtained after each scrambling into a second set, wherein each mapping value in the second set is not repeated;
searching the mapping values in the second set in order from small to large until the first characteristic value corresponding to the searched mapping value is a first preset value;
obtaining the number of digits corresponding to the searched mapping value, and combining a plurality of the number of digits to obtain a second characteristic value;
and classifying the data to be reported according to the second characteristic value.
In one embodiment, clustering the data to be reported in the bucket file includes:
selecting an unclogged piece of data to be reported, and calculating the similarity between the selected data to be reported and clustered groups of data clusters to be reported in the same barrel file;
When the similarity is larger than a first threshold, merging the selected data to be reported into the similar data cluster to be reported;
and when the similarity is smaller than a first threshold value, creating the selected data to be reported as one data cluster to be reported.
In one embodiment, selecting a plurality of groups of data clusters to be reported according to the score for reporting includes:
and selecting the first N data clusters with the highest scores for reporting, or selecting the data clusters with the scores exceeding a second threshold for reporting.
In one embodiment, reporting the data cluster to be reported includes:
dividing the data to be reported into a plurality of directory names according to the path of the data to be reported in the data cluster to be reported;
regular substitution is carried out on the paths according to the directory names, and the merging degree of the paths after substitution is calculated;
when the merging degree is lower than a third threshold value, continuing regular substitution of the paths until the merging degree of the substituted paths is higher than the third threshold value;
extracting a regular expression of a merging path, and calculating the coverage rate of the regular expression on the corresponding data to be reported and the global coverage rate of the regular expression on all the data to be reported in the data cluster to be reported;
And selecting the regular expression according to the coverage rate and the global coverage rate, and reporting the data cluster to be reported according to the selected regular expression.
In a second aspect, the application further provides a data reporting device. The device comprises:
the acquisition module is used for acquiring data to be reported of the file;
the extraction module is used for extracting the characteristic value of the data to be reported;
barrel separating module: the data to be reported are classified into different barrel files for storage according to the characteristic values;
the clustering module is used for calculating the similarity of the data to be reported under the same barrel file, clustering the data to be reported according to the similarity, and generating a plurality of groups of data clusters to be reported, wherein the data clusters to be reported comprise normal data samples and malicious data samples;
and the scoring module is used for scoring each group of data clusters to be reported according to the ratio of the normal data sample to the malicious data sample under each group of data clusters to be reported, and selecting a plurality of groups of data clusters to be reported according to the score for reporting.
In a third aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor which when executing the computer program performs the steps of:
Acquiring data to be reported of a file;
extracting the characteristic value of the data to be reported;
classifying the data to be reported into different barrel files for storage according to the characteristic values;
performing similarity calculation on the data to be reported under the same barrel file, clustering the data to be reported according to the similarity, and generating a plurality of groups of data clusters to be reported, wherein the data clusters to be reported comprise normal data samples and malicious data samples;
and scoring each group of data clusters to be reported according to the ratio of the normal data sample to the malicious data sample under each group of data clusters to be reported, and selecting a plurality of groups of data clusters to be reported according to the score for reporting.
In a fourth aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:
acquiring data to be reported of a file;
extracting the characteristic value of the data to be reported;
classifying the data to be reported into different barrel files for storage according to the characteristic values;
performing similarity calculation on the data to be reported under the same barrel file, clustering the data to be reported according to the similarity, and generating a plurality of groups of data clusters to be reported, wherein the data clusters to be reported comprise normal data samples and malicious data samples;
And scoring each group of data clusters to be reported according to the ratio of the normal data sample to the malicious data sample under each group of data clusters to be reported, and selecting a plurality of groups of data clusters to be reported according to the score for reporting.
According to the data reporting method, the device, the computer equipment and the storage medium, the data to be reported of the file are obtained, the characteristic value of the data to be reported is extracted, the data to be reported is classified into different barrel files according to the characteristic value and stored, the data to be reported are clustered in the barrel files according to the similarity to obtain a plurality of groups of data clusters to be reported, the data clusters to be reported are scored according to the ratio of normal data samples to malicious data samples under each group of data clusters to be reported, the plurality of groups of data clusters to be reported are selected according to the score to report, repeated or similar useless data reporting is reduced through clustering, the reported data is filtered through scoring, the problem that the file data reporting efficiency is low in the related technology is solved, the space required for storing the data is reduced, and the file data reporting efficiency is improved.
Drawings
FIG. 1 is an application environment diagram of a data reporting method in one embodiment;
FIG. 2 is a flow chart of a data reporting method in an embodiment;
FIG. 3 is a flow chart of a first eigenvalue calculation method in one embodiment;
FIG. 4 is a flow chart of a second eigenvalue calculation method in a data reporting method in one embodiment;
FIG. 5 is a flowchart of a clustering process in a bucket of a data reporting method in one embodiment;
FIG. 6 is a regular expression extraction diagram of a data reporting method in one embodiment;
FIG. 7 is an overall flowchart of a data reporting method in one embodiment;
FIG. 8 is a block diagram of a data reporting device in one embodiment;
fig. 9 is an internal structural diagram of a computer device in one embodiment.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
The data reporting method provided by the embodiment of the application can be applied to an application environment shown in figure 1. Wherein the terminal 102 communicates with the server 104 via a network. The data storage system may store data that the server 104 needs to process. The data storage system may be integrated on the server 104 or may be located on a cloud or other network server. The terminal 102 may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, etc. The server 104 may be implemented as a stand-alone server or as a server cluster of multiple servers.
In one embodiment, as shown in fig. 2, a data reporting method is provided, and the method is applied to the terminal in fig. 1 for illustration, and includes the following steps:
step S202, obtaining data to be reported of a file.
The data to be reported of the file comprises a file name, a file path, whether the file is a malicious file, file generation time and the like. After the data to be reported of the file is obtained, the data to be reported is also subjected to data cleaning so as to clear dirty data in the data. Dirty data includes, but is not limited to, null data, incomplete path data, messy codes, and other types of abnormal data.
Step S204, extracting the characteristic value of the data to be reported.
According to the acquired data extraction characteristics to be reported, the characteristics comprise file catalogues, the number of files, the number of malicious and normal files, the proportion of the number of the files to the total number of the files and the like, wherein the files are generated in the overall situation of the catalogues in a set period. The period setting may be adjusted according to practical situations, including but not limited to week, day, and hour.
Step S206, according to the characteristic value, the data to be reported are classified into different barrel files for storage.
Taking a cloud scenario as an example, the amount of file data reported every day is 100 hundred million, so reporting the total amount of file data results in low reporting efficiency and wastes a large amount of resources. Therefore, the data to be reported needs to be screened, and the appropriate data which is not repeated is selected to be reported. Therefore, according to the embodiment of the application, the data to be reported is subjected to barrel separation according to the characteristic value of the data to be reported, and the data which are possibly similar are put into the same barrel file.
Step S208, similarity calculation is carried out on the data to be reported under the same barrel file, the data to be reported is clustered according to the similarity, and a plurality of groups of data clusters to be reported are generated, wherein the data clusters to be reported comprise normal data samples and malicious data samples.
After the data to be reported is processed through the sub-buckets, the data in the same bucket file has the opportunity of calculating whether the data are similar or not, and the data of different bucket files are regarded as dissimilar data. In a distributed environment, data to be reported in the same bucket file is distributed to the same distributed computing node. And in the barrel file, all the data are sequentially processed piece by piece to be clustered to obtain a plurality of groups of data clusters to be reported. After the clustering is completed, the directory structure of the samples under each group of data clusters to be reported is similar, and the data clusters comprise normal data samples and malicious data samples.
Step S210, scoring each group of data clusters to be reported according to the ratio of the normal data sample to the malicious data sample under each group of data clusters to be reported, and selecting a plurality of groups of data clusters to be reported according to the score for reporting.
After obtaining multiple groups of data clusters to be reported, calculating the score of the current data cluster to be reported according to the number, proportion, catalog number and other information of normal data samples and malicious data samples under each group of data clusters to be reported, wherein the score determines the ranking of the corresponding data clusters to be reported in all the data clusters, and the higher the score is, the higher the ranking is, the more the corresponding data clusters to be reported should be reported.
In the malicious file detection scene, the principle of reporting is that the more samples represented by the data cluster to be reported currently, and the more malicious data samples in the data cluster to be reported, the more the data cluster is worth reporting, otherwise, the more is worth reporting, because in the malicious file detection scene, background data needs to pay more attention to the catalogue with malicious file reporting, and reporting is reduced as much as possible for the catalogue without malicious file reporting, so that resources are saved and detection service performance is improved. At this time, the scoring formula for the data cluster to be reported is: number of malicious data samples×m-number of normal data samples×0.1/(number of directory×10). Wherein M is a super parameter, which can be adjusted according to practical conditions, and the default super parameter is 100.
In the data reporting method, all the data to be reported are not clustered directly during clustering, but the characteristic values of the data to be reported are extracted first, primary screening is carried out on the data to be reported according to the characteristic values, the data to be reported which are possibly similar are put into the same barrel file, similarity calculation and clustering are carried out on the data to be reported in the same barrel file, and the clustering calculation efficiency is improved. The method comprises the steps of carrying out similarity calculation on data to be reported under the same barrel of files, clustering the data to be reported according to the similarity, generating a plurality of groups of data clusters to be reported, grading according to the duty ratio and the number of malicious data samples and normal data samples in the data clusters to be reported, selecting the data clusters to report according to the grades, and greatly reducing repeated or similar useless data reporting.
In one embodiment, the data to be reported comprises: the file report path data and the malicious file path data comprise a first identification code, a file name, a file path, a directory and generation time of the file.
The data to be reported comprises normal file reporting path data and malicious file path data, the first identification code comprises md5 of a normal file and a malicious file, the md5 of the file refers to a hash value generated by processing the file through an md5 encryption algorithm, and the hash value is a unique identification code of the file. For different files, if the values of the first identification codes are the same, the representative files are the same, otherwise, the representative files are not the same. If the file is modified, the value of its first identification code is changed.
In this embodiment, the data to be reported is obtained by obtaining the data to be reported, and the obtained data to be reported is stored in the data platform as original data, so as to provide data support for subsequent data processing and reporting.
In one embodiment, extracting the characteristic value of the data to be reported includes: and extracting the directory of the file by taking the first identification code as a main key. And dividing the catalog to obtain a plurality of byte fragments. And calculating second identification codes of the plurality of byte fragments, combining the plurality of second identification codes, and generating a first characteristic value of the data to be reported.
The method comprises the steps of taking a first identification code as a main key, preprocessing data to be reported of a file, and dividing a file path. In the embodiment of the application, the adopted path segmentation method is to segment according to a path separator and segment n garm, wherein n garm is an algorithm based on a statistical language model, and the basic idea is to perform sliding window operation with the size of n according to bytes on the content in a text, so as to form a byte fragment sequence with the length of n. The size of n can be adjusted according to actual conditions. And calculating second identification codes of the segmented file paths, namely md5 of the segmented file paths, and combining the generated N second identification codes to generate simhash, namely the first characteristic value. Simhash is a fingerprint generation algorithm, which can perform dimension reduction processing on texts to obtain a Simhash value, and can judge the similarity between texts by comparing Simhash values of different texts.
For example, fig. 3 is a flowchart of calculating a first characteristic value of the data reporting method according to the embodiment of the present application, as shown in fig. 3, taking 2 garm as an example, taking a first identification code of a file as a primary key, obtaining a directory of the file, and dividing the directory to obtain a plurality of byte fragments as division results, e.g.,/u, us, le, et, etc. And calculating md5 for the segmented byte fragments to obtain N second identification codes. And merging the N second identification codes according to the bits to generate simhash, namely the first characteristic value.
In this embodiment, the second characteristic value of the data is obtained by dividing the file path and calculating the identification code, so that the subsequent similarity calculation is facilitated. By adopting the simhash method, the sensitivity to the sequence of the file data is lower, and the classification accuracy of the file data is higher.
In one embodiment, extracting the characteristic value of the data to be reported further includes: the first eigenvalues are randomly shuffled multiple times per row. Mapping a first set corresponding to the first characteristic value obtained after each scrambling into a second set, wherein each mapping value in the second set is not repeated. And searching the mapping values in the second set in the order from small to large until the first characteristic value corresponding to the mapping value is found to be a first preset value. And obtaining the bit number corresponding to the searched mapping value, and combining the bit numbers to obtain a second characteristic value. And classifying the data to be reported according to the second characteristic value.
The second characteristic value is a Minhash value, the Minhash algorithm is a minimum hash function algorithm, column vectors are randomly arranged according to rows, and the row number of the first non-zero element after rearrangement is the minimum hash function value. In the embodiment of the application, after the first characteristic value is obtained by calculation, each row in the column vector of the first characteristic value is randomly disordered and is randomly arranged. Mapping the first set corresponding to the first eigenvalue obtained by random arrangement into the second set, wherein the mapping relation accords with a perfect hash function, and the perfect hash function refers to a hash function of mapping each element of the set S into another series of collision-free sets, for example, the sets {0,1,2,3,4,5,6} are mapped into a new set {3,2,5,1,0,6,4}, and each number of the new set is not repeated, so that the function is a complete hash function. In the present application, each mapping value in the second set is obtained without repeating each other, and each mapping value in the second set is searched in order from small to large until the number of lines corresponding to the column vector of the first feature value corresponding to the searched mapping value is a first preset value, where in the embodiment of the present application, the first preset value is 1. After finding out that a first preset value appears in the first set in the order from small to large, taking the binary digit number corresponding to the position of the first preset value. And (3) because of multiple times of randomization, combining binary digit numbers obtained by mapping and searching after multiple times of randomization, wherein the combined result is the second characteristic value. The process of mapping and searching after the random is taken and obtaining a binary digit number is a minhash process. The number of times of finding the minhash determines the size of the sub-bucket, the more the number of times is, the longer the second characteristic value obtained by combining is, the more the number of bucket files is, the less data to be reported in each bucket file is, but the recall rate of clustering is reduced. The more times, the shorter the second eigenvalue obtained by merging, and the higher the recall rate of clustering. Under the general condition, taking 2-3 times of minhash can meet the clustering requirement of 10 hundred million orders of magnitude. According to the second characteristic value, the data to be reported can be classified, and the data to be reported can be classified into different barrel files. The ID calculated by the sub-bucket according to the second characteristic value may be, for example, 12_73_51, and all the data with the ID of 12_73_51 may be divided into the same bucket file.
Fig. 4 is a flowchart of calculating a second eigenvalue of the data reporting method according to an embodiment of the present application, where, as shown in fig. 4, column a is a binary representation of the first eigenvalue, column B is a binary number of the first eigenvalue, column C is a binary number mapped by a perfect hash function, taking minhash twice as an example, in the first time, searching is performed according to the number of column C from small to large, when column a shows the first 1, the binary number of the corresponding column B is 5, and the result is 5. In the second time, searching is carried out according to the serial numbers of the columns C from small to large, when the first 1 is shown in the column A, the binary bit serial number of the corresponding column B is 3, and the result is 3. And combining the results of the two times, and taking minhash, wherein the finally obtained second characteristic value is 5-1.
In this embodiment, the second feature value is obtained by taking the minhash after the perfect hash function mapping is performed on the first feature value, so that the calculation efficiency of the minhash algorithm is high for a large amount of data, and the application range is wider because no mandatory requirement is imposed on the sequence of the data. And the data is subjected to barrel pretreatment, so that the subsequent clustering efficiency can be improved.
In one embodiment, clustering data to be reported in a bucket file includes: and selecting an unclogged piece of data to be reported, and calculating the similarity between the selected data to be reported and clustered groups of data clusters to be reported in the same barrel file. And merging the selected data to be reported into similar data clusters to be reported when the similarity is larger than a first threshold value. And when the similarity is smaller than a first threshold value, newly creating the selected data to be reported as a data cluster to be reported.
After the data to be reported is barreled according to the second characteristic value, the data in the same barrel file has the opportunity of calculating whether the data are similar or not, and the data of different barrel files are regarded as dissimilar data. Data within the same bucket file in a distributed environment may be distributed to the same distributed computing node (worker). Processing all data to be reported in the same barrel file one by one, when a certain piece of data to be reported is ready for clustering, firstly calculating whether a similar data cluster exists in the clustered data clusters to be reported, and if so, merging the piece of data to be reported into the similar data clusters to be reported; if not, it is newly created as a separate data cluster. And after the data to be reported are processed one by one in sequence, the clustering of the data to be reported in the barrel file is completed.
Fig. 5 is a flowchart of a bucket-division clustering method of the data reporting method according to an embodiment of the present application, as shown in fig. 5, a first feature value is generated by dividing a file directory path, a second feature value is obtained by calculating the first feature value, the data to be reported is partitioned into buckets 1 to N according to the second feature value, each bucket file is clustered, and finally, the results obtained by clustering the bucket files are combined to obtain a final clustering result.
In the embodiment, the data in the same bucket are distributed to the same distributed computing node for clustering computation, and a hierarchical clustering method is adopted to combine and aggregate according to whether the data are similar, so that the clustering efficiency is improved.
In one embodiment, selecting a plurality of groups of data clusters to be reported according to the score for reporting includes: and selecting the first N data clusters to be reported with the highest score for reporting, or selecting the data clusters to be reported with the score exceeding a second threshold for reporting.
After clustering is completed, each data cluster to be reported represents a class, after calculating the score of each data cluster to be reported, two mechanisms can be adopted for reporting the data clusters to be reported according to the score, one mechanism is a method for sorting according to the score TopN, and the first N data clusters to be reported with the highest score are selected for reporting; and setting a second threshold value, wherein the data clusters to be reported, the scores of which exceed the second threshold value, are reported.
In this embodiment, scoring is performed according to the duty ratio of malicious data, so as to determine whether the data is reported, and two selection mechanisms are adopted to select the data to be reported, so that the screening of malicious file data is facilitated, the computing resources are saved, and the performance of detection service is improved.
In one embodiment, reporting the data cluster to be reported includes: and dividing the data to be reported into a plurality of directory names according to the paths of the data to be reported in the data cluster to be reported. And carrying out regular substitution on the paths according to the directory names, and calculating the merging degree of the substituted paths. And when the merging degree is lower than a third threshold value, continuing to regularly replace the paths until the merging degree of the replaced paths is higher than the third threshold value. Extracting a regular expression of the merging path, and calculating the coverage rate of the regular expression on the corresponding data to be reported and the global coverage rate of the regular expression on all the data to be reported in the data cluster to be reported. And selecting a regular expression according to the coverage rate and the global coverage rate, and reporting the data cluster to be reported according to the selected regular expression.
And extracting regular expressions from paths of all samples in the data cluster decided to be reported. Firstly, dividing a catalog into N catalog names according to a path separator (such as "/"), then calculating regularization according to the catalog names, replacing numbers in the catalog with regularization\d+, calculating merging degree of the replaced complete paths, if the merging degree of the catalog of the samples under the data cluster is lower than a third threshold value after the replacement is completed, further using regularization substitution, for example, replacing "_and" - "existing in the N catalogs after dividing with\w+, and then further merging the complete catalogs of each sample under the cluster. After extracting the regular expression, calculating the coverage rate and the overall coverage rate of the regular expression on samples in the current data cluster, selecting the regular expression with high intra-cluster coverage rate and low overall coverage rate, and reporting the data to be reported according to the selected regular expression.
Illustratively, fig. 6 is a regular expression extraction diagram of a data reporting method according to an embodiment of the present application, as shown in fig. 6, a directory is first split into N directory names according to path delimiters (e.g., "/") and the splitting results are var, www, sites, exam.com, up, 16, and propic_0, 1209923724, respectively, for example, with/var/www/sites/exam.com/up/16/propic_0/1209923724. And calculating the rule according to the name of the catalog, replacing the numbers in the catalog with the rule \d+,16 with \d+,120992372 with \d+. The merging degree of the replaced complete paths is calculated, and the directory after the first round of merging is as follows: per var/www/sites/exam.com/up/\d+/pro c_0/\d+. If the directory merge procedure of the samples under the cluster is lower than the third threshold after the directory is replaced, regular replacement is further used, for example, replace "_and" - "existing in the N directories after the segmentation with \w+, and then further merge the complete directory of each sample under the cluster. The final merged directory regular expression is: var/www/sites/exam.com/up/\d+/propic\w+/\d+.
Fig. 7 is an overall flowchart of a data reporting method according to an embodiment of the present application, as shown in fig. 7, firstly, collecting data to be reported of an existing file, performing data cleaning and feature extraction after collecting the data, clustering the data to be reported according to the extracted features and file paths, calculating and analyzing a clustering result, selecting file data to be reported according to the clustering result, extracting a regular expression of the file reporting path when the file needs to be reported, and reporting according to the regular expression. Through the process, repeated or similar useless log reporting can be greatly reduced, the data reporting efficiency is improved, meanwhile, the space required for storing and reporting data can be reduced, and the applicability is higher. In addition, the data reporting method of the embodiment of the application not only can be used for reporting the file log data, but also can be used for reporting the log data of users such as system processes and the like and the operating system, and the application is not limited to the method.
It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.
Based on the same inventive concept, the embodiment of the application also provides a data reporting device for realizing the above related data reporting method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation in the embodiments of the data reporting device or devices provided below may refer to the limitation of the data reporting method described above, and will not be repeated here.
In one embodiment, as shown in fig. 8, there is provided a data reporting apparatus, including:
an obtaining module 81, configured to obtain data to be reported of a file;
the extracting module 82 is configured to extract a feature value of data to be reported;
the barrel sorting module 83 is configured to sort the data to be reported into different barrel files for storage according to the feature values;
the clustering module 84 is configured to perform similarity calculation on data to be reported under the same bucket of files, cluster the data to be reported according to the similarity, and generate multiple groups of data clusters to be reported, where the data clusters to be reported include normal data samples and malicious data samples;
the scoring module 85 is configured to score each group of data clusters to be reported according to the ratio of the normal data sample to the malicious data sample in each group of data clusters to be reported, and select multiple groups of data clusters to be reported according to the score.
All or part of the modules in the data reporting device can be realized by software, hardware and a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In one embodiment, a computer device is provided, which may be a server, and the internal structure of which may be as shown in fig. 9. The computer device includes a processor, a memory, an Input/Output interface (I/O) and a communication interface. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface is connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used for storing data to be reported. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a data reporting method.
It will be appreciated by persons skilled in the art that the architecture shown in fig. 9 is merely a block diagram of some of the architecture relevant to the present inventive arrangements and is not limiting as to the computer device to which the present inventive arrangements are applicable, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.
In one embodiment, a computer device is provided comprising a memory and a processor, the memory having stored therein a computer program, the processor when executing the computer program performing the steps of:
acquiring data to be reported of a file; extracting a characteristic value of data to be reported; classifying the data to be reported into different barrel files for storage according to the characteristic values; performing similarity calculation on data to be reported under the same barrel of files, clustering the data to be reported according to the similarity, and generating a plurality of groups of data clusters to be reported, wherein the data clusters to be reported comprise normal data samples and malicious data samples; and scoring each group of data clusters to be reported according to the ratio of the normal data sample to the malicious data sample under each group of data clusters to be reported, and selecting a plurality of groups of data clusters to be reported according to the score for reporting.
In one embodiment, the processor when executing the computer program further performs the steps of:
obtaining data to be reported of a file, wherein the data to be reported comprises: the file report path data and the malicious file path data comprise a first identification code, a file name, a file path, a directory and generation time of the file.
In one embodiment, the processor when executing the computer program further performs the steps of:
extracting a directory of the file by taking the first identification code as a main key; dividing the catalog to obtain a plurality of byte fragments; and calculating second identification codes of the plurality of byte fragments, combining the plurality of second identification codes, and generating a first characteristic value of the data to be reported.
In one embodiment, the processor when executing the computer program further performs the steps of:
randomly scrambling each row of the first characteristic values for a plurality of times; mapping a first set corresponding to the first characteristic value obtained after each scrambling into a second set, wherein each mapping value in the second set is not repeated; searching the mapping values in the second set in order from small to large until the first characteristic value corresponding to the searched mapping value is a first preset value; obtaining a bit number corresponding to the searched mapping value, and combining the bit numbers to obtain a second characteristic value; and classifying the data to be reported according to the second characteristic value.
In one embodiment, the processor when executing the computer program further performs the steps of:
selecting an unclogged piece of data to be reported, and calculating the similarity between the selected data to be reported and clustered groups of data clusters to be reported in the same barrel file; when the similarity is larger than a first threshold value, merging the selected data to be reported into a similar data cluster to be reported; and when the similarity is smaller than a first threshold value, newly creating the selected data to be reported as a data cluster to be reported.
In one embodiment, the processor when executing the computer program further performs the steps of:
and selecting the first N data clusters to be reported with the highest score for reporting, or selecting the data clusters to be reported with the score exceeding a second threshold for reporting.
In one embodiment, the processor when executing the computer program further performs the steps of:
dividing the data to be reported into a plurality of directory names according to the paths of the data to be reported in the data cluster to be reported; regular substitution is carried out on the paths according to the directory names, and the merging degree of the substituted paths is calculated; when the merging degree is lower than a third threshold value, continuing regular substitution of the paths until the merging degree of the substituted paths is higher than the third threshold value; extracting a regular expression of the merging path, and calculating the coverage rate of the regular expression on corresponding data to be reported and the global coverage rate of the regular expression on all the data to be reported in the data cluster to be reported; and selecting a regular expression according to the coverage rate and the global coverage rate, and reporting the data cluster to be reported according to the selected regular expression.
In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor, performs the steps of:
acquiring data to be reported of a file; extracting a characteristic value of data to be reported; classifying the data to be reported into different barrel files for storage according to the characteristic values; performing similarity calculation on data to be reported under the same barrel of files, clustering the data to be reported according to the similarity, and generating a plurality of groups of data clusters to be reported, wherein the data clusters to be reported comprise normal data samples and malicious data samples; and scoring each group of data clusters to be reported according to the ratio of the normal data sample to the malicious data sample under each group of data clusters to be reported, and selecting a plurality of groups of data clusters to be reported according to the score for reporting.
In one embodiment, the processor when executing the computer program further performs the steps of:
obtaining data to be reported of a file, wherein the data to be reported comprises: the file report path data and the malicious file path data comprise a first identification code, a file name, a file path, a directory and generation time of the file.
In one embodiment, the processor when executing the computer program further performs the steps of:
extracting a directory of the file by taking the first identification code as a main key; dividing the catalog to obtain a plurality of byte fragments; and calculating second identification codes of the plurality of byte fragments, combining the plurality of second identification codes, and generating a first characteristic value of the data to be reported.
In one embodiment, the processor when executing the computer program further performs the steps of:
randomly scrambling each row of the first characteristic values for a plurality of times; mapping a first set corresponding to the first characteristic value obtained after each scrambling into a second set, wherein each mapping value in the second set is not repeated; searching the mapping values in the second set in order from small to large until the first characteristic value corresponding to the searched mapping value is a first preset value; obtaining a bit number corresponding to the searched mapping value, and combining the bit numbers to obtain a second characteristic value; and classifying the data to be reported according to the second characteristic value.
In one embodiment, the processor when executing the computer program further performs the steps of:
selecting an unclogged piece of data to be reported, and calculating the similarity between the selected data to be reported and clustered groups of data clusters to be reported in the same barrel file; when the similarity is larger than a first threshold value, merging the selected data to be reported into a similar data cluster to be reported; and when the similarity is smaller than a first threshold value, newly creating the selected data to be reported as a data cluster to be reported.
In one embodiment, the processor when executing the computer program further performs the steps of:
and selecting the first N data clusters to be reported with the highest score for reporting, or selecting the data clusters to be reported with the score exceeding a second threshold for reporting.
In one embodiment, the processor when executing the computer program further performs the steps of:
dividing the data to be reported into a plurality of directory names according to the paths of the data to be reported in the data cluster to be reported; regular substitution is carried out on the paths according to the directory names, and the merging degree of the substituted paths is calculated; when the merging degree is lower than a third threshold value, continuing regular substitution of the paths until the merging degree of the substituted paths is higher than the third threshold value; extracting a regular expression of the merging path, and calculating the coverage rate of the regular expression on corresponding data to be reported and the global coverage rate of the regular expression on all the data to be reported in the data cluster to be reported; and selecting a regular expression according to the coverage rate and the global coverage rate, and reporting the data cluster to be reported according to the selected regular expression.
It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric RandomAccess Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can take many forms, such as static Random access memory (Static Random Access Memory, SRAM) or Dynamic Random access memory (Dynamic Random AccessMemory, DRAM), among others. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims (10)

1. The data reporting method is characterized by comprising the following steps:
acquiring data to be reported of a file;
extracting the characteristic value of the data to be reported;
classifying the data to be reported into different barrel files for storage according to the characteristic values;
performing similarity calculation on the data to be reported under the same barrel file, clustering the data to be reported according to the similarity, and generating a plurality of groups of data clusters to be reported, wherein the data clusters to be reported comprise normal data samples and malicious data samples;
And scoring each group of data clusters to be reported according to the ratio of the normal data sample to the malicious data sample under each group of data clusters to be reported, and selecting a plurality of groups of data clusters to be reported according to the score for reporting.
2. The data reporting method according to claim 1, wherein the data to be reported comprises: the file report path data and the malicious file path data comprise a first identification code, a file name, a file path, a directory and generation time of the file.
3. The method for reporting data according to claim 1, wherein extracting the characteristic value of the data to be reported comprises:
extracting the catalogue of the file by taking the first identification code as a main key;
dividing the catalog to obtain a plurality of byte fragments;
and calculating second identification codes of the byte fragments, combining the second identification codes, and generating a first characteristic value of the data to be reported.
4. The data reporting method according to claim 3, wherein extracting the characteristic value of the data to be reported further comprises:
Randomly scrambling each row of the first characteristic value for a plurality of times;
mapping a first set corresponding to the first characteristic value obtained after each scrambling into a second set, wherein each mapping value in the second set is not repeated;
searching the mapping values in the second set in order from small to large until the first characteristic value corresponding to the searched mapping value is a first preset value;
obtaining the number of digits corresponding to the searched mapping value, and combining a plurality of the number of digits to obtain a second characteristic value;
and classifying the data to be reported according to the second characteristic value.
5. The data reporting method of claim 1, wherein clustering the data to be reported in the bucket file comprises:
selecting an unclogged piece of data to be reported, and calculating the similarity between the selected data to be reported and clustered groups of data clusters to be reported in the same barrel file;
when the similarity is larger than a first threshold, merging the selected data to be reported into the similar data cluster to be reported;
and when the similarity is smaller than a first threshold value, creating the selected data to be reported as one data cluster to be reported.
6. The method for reporting data according to claim 1, wherein selecting a plurality of groups of data clusters to be reported according to the score for reporting comprises:
and selecting the first N data clusters with the highest scores for reporting, or selecting the data clusters with the scores exceeding a second threshold for reporting.
7. The method for reporting data according to claim 1, wherein reporting the data cluster to be reported comprises:
dividing the data to be reported into a plurality of directory names according to the path of the data to be reported in the data cluster to be reported;
regular substitution is carried out on the paths according to the directory names, and the merging degree of the paths after substitution is calculated;
when the merging degree is lower than a third threshold value, continuing regular substitution of the paths until the merging degree of the substituted paths is higher than the third threshold value;
extracting a regular expression of a merging path, and calculating the coverage rate of the regular expression on the corresponding data to be reported and the global coverage rate of the regular expression on all the data to be reported in the data cluster to be reported;
and selecting the regular expression according to the coverage rate and the global coverage rate, and reporting the data cluster to be reported according to the selected regular expression.
8. A data reporting apparatus, comprising:
the acquisition module is used for acquiring data to be reported of the file;
the extraction module is used for extracting the characteristic value of the data to be reported;
barrel separating module: the data to be reported are classified into different barrel files for storage according to the characteristic values;
the clustering module is used for calculating the similarity of the data to be reported under the same barrel file, clustering the data to be reported according to the similarity, and generating a plurality of groups of data clusters to be reported, wherein the data clusters to be reported comprise normal data samples and malicious data samples;
and the scoring module is used for scoring each group of data clusters to be reported according to the ratio of the normal data sample to the malicious data sample under each group of data clusters to be reported, and selecting a plurality of groups of data clusters to be reported according to the score for reporting.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the data reporting method of any one of claims 1 to 7 when the computer program is executed by the processor.
10. A computer-readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the steps of the data reporting method as claimed in any one of claims 1 to 7.
CN202311103374.1A 2023-08-30 2023-08-30 Data reporting method, device, computer equipment and storage medium Active CN116821053B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311103374.1A CN116821053B (en) 2023-08-30 2023-08-30 Data reporting method, device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311103374.1A CN116821053B (en) 2023-08-30 2023-08-30 Data reporting method, device, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN116821053A CN116821053A (en) 2023-09-29
CN116821053B true CN116821053B (en) 2023-11-21

Family

ID=88116991

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311103374.1A Active CN116821053B (en) 2023-08-30 2023-08-30 Data reporting method, device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116821053B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117573875A (en) * 2023-12-05 2024-02-20 安芯网盾(北京)科技有限公司 Method and device for optimizing homonymy file clustering algorithm

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1711536A (en) * 2002-10-03 2005-12-21 古格公司 Method and apparatus for characterizing documents based on clusters of related words
CN107807939A (en) * 2016-09-09 2018-03-16 阿里巴巴集团控股有限公司 The method for sorting and equipment of data object
CN110647626A (en) * 2019-07-30 2020-01-03 浙江工业大学 REST data service clustering method based on Internet service domain
WO2021169173A1 (en) * 2020-02-29 2021-09-02 深圳壹账通智能科技有限公司 Data clustering storage method and apparatus, computer device, and storage medium
CN113591082A (en) * 2021-07-06 2021-11-02 之江实验室 Text classification-based Android mixed feature malicious code classification method
CN113821630A (en) * 2020-06-19 2021-12-21 菜鸟智能物流控股有限公司 Data clustering method and device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11216491B2 (en) * 2016-03-31 2022-01-04 Splunk Inc. Field extraction rules from clustered data samples

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1711536A (en) * 2002-10-03 2005-12-21 古格公司 Method and apparatus for characterizing documents based on clusters of related words
CN107807939A (en) * 2016-09-09 2018-03-16 阿里巴巴集团控股有限公司 The method for sorting and equipment of data object
CN110647626A (en) * 2019-07-30 2020-01-03 浙江工业大学 REST data service clustering method based on Internet service domain
WO2021169173A1 (en) * 2020-02-29 2021-09-02 深圳壹账通智能科技有限公司 Data clustering storage method and apparatus, computer device, and storage medium
CN113821630A (en) * 2020-06-19 2021-12-21 菜鸟智能物流控股有限公司 Data clustering method and device
CN113591082A (en) * 2021-07-06 2021-11-02 之江实验室 Text classification-based Android mixed feature malicious code classification method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Scoring based unsupervised approach to classify research papers;K.M. Anil Kumar 等;《2016 International Conference on Advanced Robotics and Mechatronics (ICARM)》;505-511 *
海量文件系统中基于特征实现文件多维度浏览;贺扬;何连跃;陈博;徐俊;徐照淼;;计算机工程与科学(05);32-37 *
物联网实体相似性分析技术研究;刘素艳;《中国博士学位论文电子期刊网》;I136-127 *

Also Published As

Publication number Publication date
CN116821053A (en) 2023-09-29

Similar Documents

Publication Publication Date Title
Chakrabarti et al. An efficient filter for approximate membership checking
CN110941959B (en) Text violation detection, text restoration method, data processing method and equipment
CN109325032B (en) Index data storage and retrieval method, device and storage medium
CN116821053B (en) Data reporting method, device, computer equipment and storage medium
CN109271545A (en) A kind of characteristic key method and device, storage medium and computer equipment
US20220222233A1 (en) Clustering of structured and semi-structured data
CN111666258B (en) Information processing method and device, information query method and device
CN114780606A (en) Big data mining method and system
US20210056085A1 (en) Deduplication of data via associative similarity search
CN112347477A (en) Family variant malicious file mining method and device
US9817855B2 (en) Method and system for determining a measure of overlap between data entries
US20220171815A1 (en) System and method for generating filters for k-mismatch search
US11106703B1 (en) Clustering of structured and semi-structured data
Lee et al. Similar pair identification using locality-sensitive hashing technique
Van Dam et al. Duplicate detection in web shops using LSH to reduce the number of computations
Moia et al. A comparative analysis about similarity search strategies for digital forensics investigations
US9830355B2 (en) Computer-implemented method of performing a search using signatures
CN114332745A (en) Near-repetitive video big data cleaning method based on deep neural network
CN113742344A (en) Method and device for indexing power system data
CN116738009B (en) Method for archiving and backtracking data
CN110895573A (en) Retrieval method and device
CN114328076B (en) Log information extraction method, device, computer equipment and storage medium
Tabona et al. Exploring solutions put forth to solve computer forensic investigations of large storage media.
Rajathi et al. Multipoint Bitmap Filter for Large Volume Data Query Processing
CN117216239A (en) Text deduplication method, text deduplication device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant