CN112580027A

CN112580027A - Malicious sample determination method and device, storage medium and electronic equipment

Info

Publication number: CN112580027A
Application number: CN202011477799.5A
Authority: CN
Inventors: 鲍青波
Original assignee: Beijing Topsec Technology Co Ltd; Beijing Topsec Network Security Technology Co Ltd; Beijing Topsec Software Co Ltd
Current assignee: Beijing Topsec Technology Co Ltd; Beijing Topsec Network Security Technology Co Ltd; Beijing Topsec Software Co Ltd
Priority date: 2020-12-15
Filing date: 2020-12-15
Publication date: 2021-03-30

Abstract

The embodiment of the disclosure provides a method, a device, a storage medium and an electronic device for determining a malicious sample, wherein the method comprises the following steps: acquiring network behavior data of the two samples to obtain two sample files, wherein the network behavior data is behavior data when communication is carried out based on a network protocol; determining the matching degree of the two sample files according to an ESIM algorithm; detecting whether the matching degree is greater than a preset matching degree threshold value; and determining that the two sample files are malicious samples under the condition that the matching degree is larger than a preset matching degree threshold value. The embodiment of the disclosure can calculate the matching degree of the sample files, and determine that the two obtained sample files are malicious samples when the matching degree exceeds the preset matching degree threshold value, so that the whole analysis process is efficient and accurate, hacker intrusion can be effectively inhibited, and network security is improved.

Description

Malicious sample determination method and device, storage medium and electronic equipment

Technical Field

The present disclosure relates to the field of network security, and in particular, to a method and an apparatus for determining a malicious sample, a storage medium, and an electronic device.

Background

In performing a hacker attacker correlation analysis, samples (i.e., malicious samples) are often an effective means of correlating an attacker or his/her behavior. The attack means and tools of the attacker can be known through sample analysis. However, an attacker can evade automatic correlation analysis detection by continuously changing some details, such as changing a byte or changing some IOC (Inversion of Control) directions, so as to reduce the coupling degree between computer codes, and further make effective analysis difficult.

In the existing scheme for sample correlation analysis, the following two schemes are generally adopted:

in the first scheme, a method based on static file comparison is similar to text comparison for detecting similarity, but in many cases, comparison is performed in a simplified manner because original codes of samples cannot be acquired. For example, a quick comparison is performed by comparing hash values of files, but a sample file will also produce different MD5 or hash values by changing one byte, which are identical in nature but cannot be detected; for another example, a malicious sample is decompiled to obtain a smali file, a class name and a method name are extracted, and the combination of the class name and the method name is used as a characteristic dimension to compare various samples.

And the second scheme is that analysis is carried out based on behavior modeling of a malicious sample or malicious software on the terminal, and behavior similarity of the malicious sample or the malicious software is analyzed through a calling sequence behavior of the malicious sample or the malicious software on an API (Application Programming Interface) level of an operating system and by combining a machine learning algorithm. However, in many cases, the behavior data of the terminal is prohibited from being acquired, and thus the method cannot be widely applied.

Therefore, the method based on static file comparison is easy to be interfered, for example, if a certain byte in a file changes or even only one space is added, different hash value results are caused, so that an error result that two files are dissimilar is caused, and therefore, the method has larger application limitation; the method for performing correlation analysis based on the behavior data of the terminal has the defect that the behavior data is difficult to obtain and cannot be applied. In the prior art, the scheme for analyzing the sample cannot be effectively analyzed, the effect of inhibiting the attack behavior of a hacker is weak, and the network security cannot be guaranteed.

Disclosure of Invention

In view of this, the embodiments of the present disclosure provide a method, an apparatus, a storage medium, and an electronic device for determining a malicious sample, so as to solve the following problems in the prior art: in the prior art, the scheme for analyzing the sample cannot be effectively analyzed, the effect of inhibiting the attack behavior of a hacker is weak, and the network security cannot be guaranteed.

In one aspect, an embodiment of the present disclosure provides a method for determining a malicious sample, including: acquiring network behavior data of the two samples to obtain two sample files, wherein the network behavior data is behavior data when communication is carried out based on a network protocol; determining the matching degree of the two sample files according to an ESIM (short text matching) algorithm; detecting whether the matching degree is larger than a preset matching degree threshold value; and determining that the two sample files are malicious samples if the two sample files are larger than the preset matching degree threshold value.

In some embodiments, the determining a degree of matching of two sample files according to the ESIM algorithm includes: performing word segmentation processing on the network behavior data in each sample file according to a preset classification rule so as to determine sample data in a set with different attributes according to word segmentation results; calculating the matching degree of sample data in a first set in a first sample file and sample data in a second set in a second sample file by an ESIM algorithm, wherein the first set and the second set have the same attribute; and determining the matching degree of the first sample file and the second sample file according to the matching degree of each attribute.

In some embodiments, the determining the matching degree of the first sample file and the second sample file according to the matching degree of each attribute includes: and determining the matching degree of the first sample file and the second sample file according to the preset weight value of each attribute and the matching degree of each attribute.

In some embodiments, before determining the matching degree of the two sample files according to the ESIM algorithm, the method further includes: and performing de-duplication processing on the network behavior data in each sample file, so that the network behavior data in a single sample file is not duplicated.

On the other hand, an embodiment of the present disclosure provides a device for determining a malicious sample, including: the acquisition module is used for acquiring network behavior data of the two samples to obtain two sample files, wherein the network behavior data is behavior data when communication is carried out based on a network protocol; the calculation module is used for determining the matching degree of the two sample files according to an ESIM algorithm; the detection module is used for detecting whether the matching degree is greater than a preset matching degree threshold value; and the determining module is used for determining that the two sample files are malicious samples under the condition that the matching degree is greater than the preset matching degree threshold value.

In some embodiments, the calculation module comprises: the word segmentation unit is used for performing word segmentation processing on the network behavior data in each sample file according to a preset classification rule so as to determine sample data in a set with different attributes according to word segmentation results; the computing unit is used for computing the matching degree of sample data in a first set in a first sample file and sample data in a second set in a second sample file through an ESIM algorithm, wherein the first set and the second set have the same attribute; and the determining unit is used for determining the matching degree of the first sample file and the second sample file according to the matching degree of each attribute.

In some embodiments, the determining unit is specifically configured to: and determining the matching degree of the first sample file and the second sample file according to the preset weight value of each attribute and the matching degree of each attribute.

In some embodiments, further comprising: and the processing module is used for carrying out de-duplication processing on the network behavior data in each sample file so as to ensure that the network behavior data in a single sample file is not duplicated.

In another aspect, an embodiment of the present disclosure provides a storage medium storing a computer program, where the computer program is executed by a processor to implement the method provided in any embodiment of the present disclosure.

On the other hand, the embodiment of the present disclosure provides an electronic device, which at least includes a memory and a processor, where the memory stores a computer program, and the processor implements the method provided in any embodiment of the present disclosure when executing the computer program on the memory.

The embodiment of the disclosure can calculate the matching degree of the sample files, and determine that the two obtained sample files are malicious samples when the matching degree exceeds the preset matching degree threshold value, so that the whole analysis process is efficient and accurate, hacker intrusion can be effectively inhibited, and network security is improved.

Drawings

In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only some embodiments described in the present disclosure, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a flowchart of a method for determining a malicious sample according to a first embodiment of the present disclosure;

fig. 2 is a specific flowchart of a method for determining a malicious sample according to a first embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of a malicious sample determination apparatus according to a second embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of an electronic device according to a fourth embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described below clearly and completely with reference to the accompanying drawings of the embodiments of the present disclosure. It is to be understood that the described embodiments are only a few embodiments of the present disclosure, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the described embodiments of the disclosure without any inventive step, are within the scope of protection of the disclosure.

Unless otherwise defined, technical or scientific terms used herein shall have the ordinary meaning as understood by one of ordinary skill in the art to which this disclosure belongs. The use of "first," "second," and similar terms in this disclosure is not intended to indicate any order, quantity, or importance, but rather is used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that the element or item listed before the word covers the element or item listed after the word and its equivalents, but does not exclude other elements or items. The terms "connected" or "coupled" and the like are not restricted to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", and the like are used merely to indicate relative positional relationships, and when the absolute position of the object being described is changed, the relative positional relationships may also be changed accordingly.

To maintain the following description of the embodiments of the present disclosure clear and concise, a detailed description of known functions and known components have been omitted from the present disclosure.

A first embodiment of the present disclosure provides a method for determining a malicious sample, where a flow of the method is shown in fig. 1, and the method includes steps S101 to S104:

s101, obtaining respective network behavior data of the two samples to obtain two sample files, wherein the network behavior data is behavior data when communication is carried out based on a network protocol.

The malicious samples are generally communicated by various protocols, such as uploading and downloading data or sending control commands, and the malicious samples are communicated by common protocols such as TCP, UDP, HTTP and the like or protocols such as P2P and the like. The embodiment of the disclosure takes HTTP protocol data as an example, and obtains network behavior data of a malicious sample, and the key point is to extract URL information of sample access.

And S102, determining the matching degree of the two sample files according to an ESIM algorithm.

In specific implementation, the result of data processing in each sample file is regarded as a text, the matching degree between the two sample files is calculated through an ESIM algorithm, in specific implementation, if the data volume in the sample file is not very large, the matching degree can be calculated together with all the data in the whole sample file, and if the data volume in the sample file is very large, the data can be classified.

If the data are classified, when the matching degree of the two sample files is determined according to an ESIM algorithm, word segmentation processing is firstly carried out on the network behavior data in each sample file according to a preset classification rule, so that sample data in a set with different attributes are determined according to word segmentation results; calculating the matching degree of sample data in a first set in a first sample file and sample data in a second set in a second sample file by an ESIM algorithm, wherein the first set and the second set have the same attribute; and finally, determining the matching degree of the first sample file and the second sample file according to the matching degree of each attribute.

In general, if classification is performed, one sample file has a plurality of attributes, and therefore, a set corresponding to each attribute is finally used as a basis for matching degree calculation, that is, a plurality of sub-matching degrees are calculated for one sample file, and in this case, the sum of all the sub-matching degrees in a single sample file may be used as the matching degree of the entire sample file.

Of course, since different attributes have different degrees of importance for the malicious sample, different weight values may be set for each attribute. Further, when the matching degree of the first sample file and the second sample file is determined based on the matching degree of each attribute, the matching degree of the first sample file and the second sample file may be determined based on a predetermined weight value of each attribute and the matching degree of each attribute.

S103, detecting whether the matching degree is larger than a preset matching degree threshold value.

The predetermined threshold of the degree of match can be set by a person skilled in the art based on empirical values, which will not be described in detail here.

And S104, determining that the two sample files are malicious samples under the condition that the matching degree is greater than a preset matching degree threshold value.

Normally, due to the randomness of network behaviors, the matching degree of the two acquired sample files should be low, and if the matching degree of the two acquired sample files is high and exceeds a preset matching degree threshold value, the two acquired sample files are usually malicious samples. Of course, there may be one sample that has been determined to be a malicious sample in the two samples, and this is considered as a basis for matching, which is also within the scope of the embodiments of the present disclosure.

In specific implementation, in order to reduce the data processing amount, before determining the matching degree of two sample files according to the ESIM algorithm, the network behavior data in each sample file may be subjected to deduplication processing, so that the network behavior data in a single sample file is not duplicated.

The above process is described in detail with reference to specific examples.

The embodiment of the disclosure provides a method for determining a malicious sample, which is based on a dynamic network behavior detection matching degree, and utilizes an ESIM algorithm to perform modeling through behavior data of the malicious sample on network communication, so as to achieve the purpose of malicious sample similarity detection. The method utilizes a detection method based on sample dynamic behavior to avoid that an attacker bypasses detection by changing sample static characteristics; the utilized data is data based on network flow, and the algorithm of deep learning is applied, so that the universality and the algorithm accuracy of practical application are higher.

The embodiment of the present disclosure uses all URL information lists accessed by two malicious sample files as raw data input, and the main processes are as follows (1) to (4):

(1) network behavior data is obtained.

(2) And (4) preprocessing data. The method comprises the steps of carrying out duplicate removal on acquired data to obtain URL data of all visits of each sample, and then carrying out word segmentation on all URL sets of the URL data, wherein the word segmentation is part of a preprocessing process for input data, the URL is divided into three categories, namely a domain name part, a path part and a parameter part, and URL links are limited to English letters, numbers or < - >, "? "and the like, so the word segmentation process is simpler, and only the separator"/"is used for disassembling.

(3) The degree of match between the two samples is calculated.

Firstly, the similarity between each sentence in the text is calculated, then different weights are set according to the categories, and finally weighting calculation is carried out to obtain the matching degree of the two texts. The matching degrees of the domain name part, the path part and the parameter part need to be calculated separately, in order to explain the calculation process of the present disclosure, the matching degree calculation of the domain name part is taken as an example to explain the process, and the calculation of other parts is similar, and is not repeated here.

In the disclosed embodiment, the following calculation process is performed using the ESIM algorithm:

there may be multiple URLs visited by a malicious sample in a period of time, and it is assumed that the domain name partial text lists after splitting the visited URLs of the two samples A, B are respectively L_A、L_BAnd the list lengths are set to be M and N respectively.

For L_A、L_BDetecting whether any two domain names are matched by using an ESIM algorithm, and then summing and averaging to obtain the classification matching degree of the domain name part, and recording as follows:

wherein, s (L)_Ai，L_Bj) Represents L_A、L_BIf any two domain names are matched, the value is 0 or 1.

According to the above calculation process, the classification of the path part and the parameter part is calculated respectivelyThe matching degrees are respectively marked as S_{Route of travel}、S_Parameter(s)Their expression forms and S_{Domain name}Similarly.

Setting different weight values for three categories of a domain name part, a path part and a parameter part, calculating a comprehensive matching degree result, and recording the result as:

S＝w₁×S_{domain name}+w₂×S_{Route of travel}+w₃×S_Parameter(s)；

Wherein, w₁、w₂And w₃And all the samples have weight values corresponding to respective classification attributes, a is a comprehensive threshold value, and when S is greater than a, the two samples are judged to be matched.

(4) And outputting the result.

The above processes (1) - (4) can be schematically shown in fig. 2, and fig. 2 does not limit the embodiment of the present disclosure.

In summary, the embodiment of the disclosure utilizes the ESIM algorithm to perform modeling through behavior data of the malicious sample on network communication, thereby achieving the goal of malicious sample similarity detection; meanwhile, in the detection process, data are split according to the access characteristics, the data are divided into different text sentence categories, matching is carried out among the different categories, and different weight values are set, so that the detection process is more suitable for the actual safe environment, and the detection accuracy rate of malicious sample similarity is improved.

The second embodiment of the present disclosure also provides a device for determining a malicious sample, where a structural schematic of the device is shown in fig. 3, and the device includes:

the acquisition module 10 is configured to acquire respective network behavior data of the two samples to obtain two sample files, where the network behavior data is behavior data when communication is performed based on a network protocol; the calculating module 20 is coupled with the obtaining module 10 and is used for determining the matching degree of the two sample files according to the ESIM algorithm; a detection module 30, coupled to the calculation module 20, for detecting whether the matching degree is greater than a predetermined matching degree threshold; and a determining module 40, coupled to the detecting module 30, for determining that the two sample files are malicious samples if the matching degree is greater than a predetermined matching degree threshold.

In a specific implementation, the computing module may include: the word segmentation unit is used for performing word segmentation processing on the network behavior data in each sample file according to a preset classification rule so as to determine sample data in a set with different attributes according to word segmentation results; the calculation unit is coupled with the word segmentation unit and used for calculating the matching degree of sample data in a first set under a first sample file and sample data in a second set under a second sample file through an ESIM algorithm, wherein the first set and the second set have the same attribute; and the determining unit is coupled with the calculating unit and used for determining the matching degree of the first sample file and the second sample file according to the matching degree of each attribute.

In general, if classification is performed, one sample file has a plurality of attributes, and therefore, a set corresponding to each attribute is finally used as a basis for matching degree calculation, that is, a plurality of sub-matching degrees are calculated for one sample file, and in this case, the sum of all the sub-matching degrees in a single sample file may be used as the matching degree of the entire sample file. The determining unit is specifically configured to: and determining the matching degree of the first sample file and the second sample file according to the preset weight value of each attribute and the matching degree of each attribute.

In a specific implementation, in order to reduce the data processing amount, the apparatus may further include a processing module, coupled to the obtaining module and the calculating module, configured to perform de-duplication processing on the network behavior data in each sample file, so that the network behavior data in a single sample file is not duplicated.

A third embodiment of the present disclosure provides a storage medium, which is a computer-readable medium storing a computer program, which when executed by a processor implements the method provided in any embodiment of the present disclosure, including the following steps S11 to S14:

s11, acquiring network behavior data of the two samples to obtain two sample files, wherein the network behavior data is behavior data when communication is carried out based on a network protocol;

s12, determining the matching degree of the two sample files according to the ESIM algorithm;

s13, detecting whether the matching degree is larger than a preset matching degree threshold value;

and S14, determining that the two sample files are malicious samples under the condition that the matching degree is greater than a preset matching degree threshold value.

When the computer program is executed by the processor to determine the matching degree of the two sample files according to the ESIM algorithm, the processor specifically executes the following steps: performing word segmentation processing on the network behavior data in each sample file according to a preset classification rule so as to determine sample data in a set with different attributes according to word segmentation results; calculating the matching degree of sample data in a first set in a first sample file and sample data in a second set in a second sample file by an ESIM algorithm, wherein the first set and the second set have the same attribute; and determining the matching degree of the first sample file and the second sample file according to the matching degree of each attribute.

When the computer program is executed by the processor to determine the matching degree of the first sample file and the second sample file according to the matching degree of each attribute, the computer program is specifically executed by the processor to: and determining the matching degree of the first sample file and the second sample file according to the preset weight value of each attribute and the matching degree of each attribute.

Before the step of determining the matching degree of the two sample files according to the ESIM algorithm is executed by the processor, the computer program further executes the following steps by the processor: and performing de-duplication processing on the network behavior data in each sample file, so that the network behavior data in a single sample file is not duplicated.

Optionally, in this embodiment, the storage medium may include, but is not limited to: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes. Optionally, in this embodiment, the processor executes the method steps described in the above embodiments according to the program code stored in the storage medium. Optionally, the specific examples in this embodiment may refer to the examples described in the above embodiments and optional implementation manners, and this embodiment is not described herein again. It will be apparent to those skilled in the art that the modules or steps of the present disclosure described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. As such, the present disclosure is not limited to any specific combination of hardware and software.

A fourth embodiment of the present disclosure provides an electronic device, a schematic structural diagram of the electronic device may be as shown in fig. 4, where the electronic device includes at least a memory 901 and a processor 902, the memory 901 stores a computer program, and the processor 902, when executing the computer program on the memory 901, implements the method provided in any embodiment of the present disclosure. Illustratively, the electronic device computer program steps are as follows S21-S24:

s21, acquiring network behavior data of the two samples to obtain two sample files, wherein the network behavior data is behavior data when communication is carried out based on a network protocol;

s22, determining the matching degree of the two sample files according to the ESIM algorithm;

s23, detecting whether the matching degree is larger than a preset matching degree threshold value;

and S24, determining that the two sample files are malicious samples under the condition that the matching degree is greater than a preset matching degree threshold value.

When the processor executes the computer program which is stored in the memory and used for determining the matching degree of the two sample files according to the ESIM algorithm, the following computer program is specifically executed: performing word segmentation processing on the network behavior data in each sample file according to a preset classification rule so as to determine sample data in a set with different attributes according to word segmentation results; calculating the matching degree of sample data in a first set in a first sample file and sample data in a second set in a second sample file by an ESIM algorithm, wherein the first set and the second set have the same attribute; and determining the matching degree of the first sample file and the second sample file according to the matching degree of each attribute.

When the processor executes the computer program stored in the memory and used for determining the matching degree of the first sample file and the second sample file according to the matching degree of each attribute, the following computer programs are specifically executed: and determining the matching degree of the first sample file and the second sample file according to the preset weight value of each attribute and the matching degree of each attribute.

The processor, before executing the computer program stored on the memory for determining the degree of matching of two sample files according to the ESIM algorithm, further executes the computer program of: and performing de-duplication processing on the network behavior data in each sample file, so that the network behavior data in a single sample file is not duplicated.

Moreover, although exemplary embodiments have been described herein, the scope thereof includes any and all embodiments based on the disclosure with equivalent elements, modifications, omissions, combinations (e.g., of various embodiments across), adaptations or alterations. The elements of the claims are to be interpreted broadly based on the language employed in the claims and not limited to examples described in the present specification or during the prosecution of the application, which examples are to be construed as non-exclusive. It is intended, therefore, that the specification and examples be considered as exemplary only, with a true scope and spirit being indicated by the following claims and their full scope of equivalents.

The above description is intended to be illustrative and not restrictive. For example, the above-described examples (or one or more versions thereof) may be used in combination with each other. For example, other embodiments may be used by those of ordinary skill in the art upon reading the above description. In addition, in the foregoing detailed description, various features may be grouped together to streamline the disclosure. This should not be interpreted as an intention that a disclosed feature not claimed is essential to any claim. Rather, the subject matter of the present disclosure may lie in less than all features of a particular disclosed embodiment. Thus, the following claims are hereby incorporated into the detailed description as examples or embodiments, with each claim standing on its own as a separate embodiment, and it is contemplated that these embodiments may be combined with each other in various combinations or permutations. The scope of the disclosure should be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled.

While the present disclosure has been described in detail with reference to the embodiments, the present disclosure is not limited to the specific embodiments, and those skilled in the art can make various modifications and alterations based on the concept of the present disclosure, and the modifications and alterations should fall within the scope of the present disclosure as claimed.

Claims

1. A method for determining a malicious sample, comprising:

acquiring network behavior data of the two samples to obtain two sample files, wherein the network behavior data is behavior data when communication is carried out based on a network protocol;

determining the matching degree of the two sample files according to an ESIM algorithm;

detecting whether the matching degree is larger than a preset matching degree threshold value;

and determining that the two sample files are malicious samples if the two sample files are larger than the preset matching degree threshold value.

2. A method for determining a malicious sample according to claim 1, wherein the determining a matching degree of two sample files according to the ESIM algorithm includes:

performing word segmentation processing on the network behavior data in each sample file according to a preset classification rule so as to determine sample data in a set with different attributes according to word segmentation results;

calculating the matching degree of sample data in a first set in a first sample file and sample data in a second set in a second sample file by an ESIM algorithm, wherein the first set and the second set have the same attribute;

and determining the matching degree of the first sample file and the second sample file according to the matching degree of each attribute.

3. The method for determining a malicious sample according to claim 2, wherein determining the matching degree of the first sample file and the second sample file according to the matching degree of each attribute includes:

and determining the matching degree of the first sample file and the second sample file according to the preset weight value of each attribute and the matching degree of each attribute.

4. A method for determining a malicious sample according to any one of claims 1 to 3, wherein before determining the degree of matching between the two sample files according to the ESIM algorithm, the method further comprises:

and performing de-duplication processing on the network behavior data in each sample file, so that the network behavior data in a single sample file is not duplicated.

5. An apparatus for determining a malicious sample, comprising:

the acquisition module is used for acquiring network behavior data of the two samples to obtain two sample files, wherein the network behavior data is behavior data when communication is carried out based on a network protocol;

the calculation module is used for determining the matching degree of the two sample files according to an ESIM algorithm;

the detection module is used for detecting whether the matching degree is greater than a preset matching degree threshold value;

and the determining module is used for determining that the two sample files are malicious samples under the condition that the matching degree is greater than the preset matching degree threshold value.

6. The apparatus for determining a malicious sample according to claim 5, wherein the calculation module includes:

the word segmentation unit is used for performing word segmentation processing on the network behavior data in each sample file according to a preset classification rule so as to determine sample data in a set with different attributes according to word segmentation results;

the computing unit is used for computing the matching degree of sample data in a first set in a first sample file and sample data in a second set in a second sample file through an ESIM algorithm, wherein the first set and the second set have the same attribute;

and the determining unit is used for determining the matching degree of the first sample file and the second sample file according to the matching degree of each attribute.

7. The apparatus for determining a malicious sample according to claim 6, wherein the determining unit is specifically configured to: and determining the matching degree of the first sample file and the second sample file according to the preset weight value of each attribute and the matching degree of each attribute.

8. The apparatus for determining a malicious sample according to any one of claims 5 to 7, further comprising:

and the processing module is used for carrying out de-duplication processing on the network behavior data in each sample file so as to ensure that the network behavior data in a single sample file is not duplicated.

9. A storage medium storing a computer program, characterized in that the computer program realizes the steps of the method of any one of claims 1 to 4 when executed by a processor.

10. An electronic device comprising at least a memory, a processor, the memory having a computer program stored thereon, wherein the processor, when executing the computer program on the memory, is adapted to carry out the steps of the method of any of claims 1 to 4.