CN111988341A

CN111988341A - Data processing method, apparatus, computer system and storage medium

Info

Publication number: CN111988341A
Application number: CN202010950419.9A
Authority: CN
Inventors: 白敏�; 黄朝文; 白皓文; 汪列军
Original assignee: Secworld Information Technology Beijing Co Ltd; Qax Technology Group Inc
Current assignee: Secworld Information Technology Beijing Co Ltd; Qax Technology Group Inc
Priority date: 2020-09-10
Filing date: 2020-09-10
Publication date: 2020-11-24
Anticipated expiration: 2040-09-10
Also published as: CN111988341B

Abstract

The present disclosure provides a data processing method, including: acquiring security-related original data; extracting multiple entity objects from the original data; processing multiple entity objects by using a threat intelligence data set to obtain each of the multiple entity objects. tag features of each entity object, wherein the tag feature is used to represent the security attribute and/or malicious attribute of the entity object; and the threat information of each entity object is determined according to the tag feature of each entity object. The present disclosure provides a data processing apparatus, a computer system and a storage medium.

Description

Data processing method, apparatus, computer system and storage medium

技术领域technical field

本公开涉及网络安全技术领域，更具体地，涉及一种数据处理方法、装置、计算机系统和存储介质。The present disclosure relates to the technical field of network security, and more particularly, to a data processing method, apparatus, computer system and storage medium.

背景技术Background technique

随着计算机和互联网技术的快速发展，网络安全问题日益成为人们关注的焦点。攻击者的攻击方式越来越多样，恶意信息可能以多种形式存在网络数据流中，因此，对网络数据中的恶意信息进行判定和标记显得尤为重要。With the rapid development of computer and Internet technology, network security has increasingly become the focus of attention. Attackers' attack methods are more and more diverse, and malicious information may exist in network data streams in various forms. Therefore, it is particularly important to determine and mark malicious information in network data.

目前相关技术中对恶意家族及攻击团伙的信息掌握不全，对于网络数据中的恶意信息的判定及标记需要依赖人工审核，判定效率低，且存在误判。At present, the information of malicious families and attack gangs is not fully grasped in the related art, and the judgment and marking of malicious information in network data need to rely on manual review, which results in low judgment efficiency and misjudgment.

发明内容SUMMARY OF THE INVENTION

有鉴于此，本公开提供了一种数据处理方法、装置、计算机系统和存储介质。In view of this, the present disclosure provides a data processing method, apparatus, computer system and storage medium.

本公开的一个方面提供了一种数据处理方法，包括：获取与安全相关的原始数据；从所述原始数据中提取多个实体对象；利用威胁情报数据集对所述多个实体对象进行处理，得到所述多个实体对象中每个实体对象各自的标签特征，其中，所述标签特征用于表征所述实体对象的安全属性和/或恶意属性；以及根据所述每个实体对象的标签特征，确定所述每个实体对象的威胁信息。One aspect of the present disclosure provides a data processing method, including: acquiring security-related raw data; extracting a plurality of entity objects from the raw data; processing the plurality of entity objects by using a threat intelligence data set, Obtaining the respective label features of each entity object in the plurality of entity objects, wherein the label features are used to characterize the security attributes and/or malicious properties of the entity objects; and according to the label features of each entity object , and determine the threat information of each entity object.

根据本公开的实施例，所述威胁情报数据集包括多个知识库；利用威胁情报数据集对所述多个实体对象进行处理包括：针对所述多个实体对象中的每个实体对象，利用多个知识库中的至少一个知识库对所述实体对象进行处理，得到所述至少一个知识库中的每一个知识库的处理结果，其中，每一个所述知识库包括标记有标签特征的多个实体对象；以及根据每一个所述知识库的处理结果，确定所述实体对象的标签特征。According to an embodiment of the present disclosure, the threat intelligence data set includes multiple knowledge bases; using the threat intelligence data set to process the multiple entity objects includes: for each entity object in the multiple entity objects, using At least one knowledge base in the plurality of knowledge bases processes the entity object to obtain a processing result of each knowledge base in the at least one knowledge base, wherein each of the knowledge bases includes multiple knowledge bases marked with label features. and determining the label feature of the entity object according to the processing result of each of the knowledge bases.

根据本公开的实施例，利用多个知识库中的至少一个知识库对所述实体对象进行处理包括：针对所述至少一个知识库中的每一个知识库，确定所述知识库中是否包括与所述实体对象相同的目标实体对象；以及如果确定所述知识库中包括与所述实体对象相同的目标实体对象，则利用所述目标实体对象的标签特征标记所述实体对象。According to an embodiment of the present disclosure, using at least one knowledge base in a plurality of knowledge bases to process the entity object includes: for each knowledge base in the at least one knowledge base, determining whether the knowledge base includes a The entity object is the same target entity object; and if it is determined that the knowledge base includes the same target entity object as the entity object, marking the entity object with the tag feature of the target entity object.

根据本公开的实施例，根据所述每个实体对象的标签特征，确定所述每个实体对象的威胁信息包括：针对所述每个实体对象，利用网络模型对所述实体对象的标签特征进行处理，得到所述实体对象的威胁信息。According to an embodiment of the present disclosure, determining the threat information of each entity object according to the tag feature of each entity object includes: for each entity object, performing a network model on the tag feature of the entity object. processing to obtain the threat information of the entity object.

根据本公开的实施例，所述实体对象包括文件；所述方法还包括：对所述文件进行特征提取，得到所述文件的静态特征和动态特征；利用沙箱运行所述文件，得到所述文件的行为特征；以及根据所述文件的静态特征、动态特征、行为特征和标签特征中的至少一个，确定所述文件的威胁信息。According to an embodiment of the present disclosure, the entity object includes a file; the method further includes: performing feature extraction on the file to obtain static features and dynamic features of the file; running the file with a sandbox to obtain the file behavior characteristics of the file; and determining threat information of the file according to at least one of the static characteristics, dynamic characteristics, behavior characteristics and tag characteristics of the file.

根据本公开的实施例，所述还包括：将利用所述威胁情报数据集处理后的实体对象进行整合；以及根据所述实体对象的标签特征，将整合后的实体对象进行关联，得到实体对象关系数据集。According to an embodiment of the present disclosure, the method further includes: integrating the entity objects processed by using the threat intelligence data set; and associating the integrated entity objects according to the label characteristics of the entity objects to obtain the entity objects relational datasets.

根据本公开的实施例，所述还包括：获取新的实体对象；利用所述实体对象关系数据集处理所述新的实体对象，得到所述新的实体对象的标签特征；以及根据所述新的实体对象的标签特征，确定所述实体对象的威胁信息。According to an embodiment of the present disclosure, the method further includes: acquiring a new entity object; processing the new entity object by using the entity-object relationship data set to obtain a label feature of the new entity object; and according to the new entity object The label feature of the entity object determines the threat information of the entity object.

根据本公开的实施例，所述实体对象包括文件、域名、IP和网页地址中的至少一种；所述至少一个知识库包括白名单库、黑名单库、备案域名库、信誉文件库、失陷主机库和信誉IP库中的至少一个；所述威胁信息包括恶意类型、攻击者信息和攻击手段中的至少一种。According to an embodiment of the present disclosure, the entity object includes at least one of a file, a domain name, an IP, and a webpage address; the at least one knowledge base includes a whitelist database, a blacklist database, a domain name database for filing, a reputation file At least one of a host library and a reputation IP library; the threat information includes at least one of malicious type, attacker information and attack means.

本公开的另一个方面提供了一种数据处理装置，包括：第一获取模块，用于获取与安全相关的原始数据；第一提取模块，用于从所述原始数据中提取多个实体对象；第一处理模块，用于利用威胁情报数据集对所述多个实体对象进行处理，得到所述多个实体对象中每个实体对象各自的标签特征，其中，所述标签特征用于表征所述实体对象的安全属性和/或恶意属性；以及确定模块，用于根据所述每个实体对象的标签特征，确定所述每个实体对象的威胁信息。Another aspect of the present disclosure provides a data processing apparatus, comprising: a first acquisition module for acquiring security-related original data; a first extraction module for extracting a plurality of entity objects from the original data; The first processing module is configured to process the plurality of entity objects by using the threat intelligence data set to obtain the respective label features of each entity object in the plurality of entity objects, wherein the label features are used to characterize the a security attribute and/or malicious attribute of an entity object; and a determining module, configured to determine threat information of each entity object according to the tag feature of each entity object.

本公开的另一方面提供了一种计算机可读存储介质，存储有计算机可执行指令，所述指令在被执行时用于实现如上所述的方法。Another aspect of the present disclosure provides a computer-readable storage medium storing computer-executable instructions, which when executed, are used to implement the method as described above.

本公开的另一方面提供了一种计算机程序，所述计算机程序包括计算机可执行指令，所述指令在被执行时用于实现如上所述的方法。Another aspect of the present disclosure provides a computer program comprising computer-executable instructions, which when executed, are used to implement the method as described above.

本公开的另一方面提供了一种计算机系统，包括：一个或多个处理器；存储装置，用于存储一个或多个程序，其中，当所述一个或多个程序被所述一个或多个处理器执行时，使得所述一个或多个处理器实现如上所述的方法。Another aspect of the present disclosure provides a computer system including: one or more processors; a storage device for storing one or more programs, wherein when the one or more programs are executed by the one or more programs When executed by a plurality of processors, the one or more processors are caused to implement the method as described above.

根据本公开的实施例，采用了从原始数据中提取多个实体对象，利用威胁情报数据集对所述多个实体对象进行处理，得到每个实体对象各自的标签特征，根据所述每个实体对象的标签特征，确定所述每个实体对象的威胁信息的技术手段。由于利用威胁情报数据集对实体对象进行处理，标签化实体对象的恶意信息，所以至少部分地克服了相关技术中对恶意信息的判定及标记依赖人工审核造成的判定效率低、准确率低的技术问题，进而达到了提高恶意信息的判定和标记效率和准确率的技术效果。According to the embodiment of the present disclosure, multiple entity objects are extracted from the original data, and the multiple entity objects are processed by using the threat intelligence data set to obtain the respective label characteristics of each entity object. The tag feature of the object, the technical means to determine the threat information of each entity object. Since the threat intelligence data set is used to process the entity objects and label the malicious information of the entity objects, it at least partially overcomes the low judgment efficiency and low accuracy caused by relying on manual review for the judgment and marking of malicious information in the related art. Therefore, the technical effect of improving the efficiency and accuracy of malicious information determination and marking is achieved.

附图说明Description of drawings

通过以下参照附图对本公开实施例的描述，本公开的上述以及其他目的、特征和优点将更为清楚，在附图中：The above and other objects, features and advantages of the present disclosure will become more apparent from the following description of embodiments of the present disclosure with reference to the accompanying drawings, in which:

图1示意性示出可以应用本公开实施例的数据处理方法的示例性系统架构；FIG. 1 schematically shows an exemplary system architecture to which the data processing method of the embodiment of the present disclosure can be applied;

图2示意性示出了根据本公开实施例的数据处理方法的流程图；FIG. 2 schematically shows a flowchart of a data processing method according to an embodiment of the present disclosure;

图3示意性示出了根据本公开实施例的利用威胁情报数据集对多个实体对象进行处理的方法的流程图；3 schematically shows a flowchart of a method for processing multiple entity objects by using a threat intelligence data set according to an embodiment of the present disclosure;

图4示意性示出了根据本公开实施例的利用多个知识库中的至少一个知识库对实体对象进行处理的方法的流程图；FIG. 4 schematically shows a flowchart of a method for processing an entity object by using at least one knowledge base in a plurality of knowledge bases according to an embodiment of the present disclosure;

图5示意性示出了根据本公开另一实施例的数据处理方法的流程图；FIG. 5 schematically shows a flowchart of a data processing method according to another embodiment of the present disclosure;

图6示意性示出了根据本公开另一实施例的数据处理方法的流程图；FIG. 6 schematically shows a flowchart of a data processing method according to another embodiment of the present disclosure;

图7示意性示出了根据本公开另一实施例的数据处理方法的流程图；FIG. 7 schematically shows a flowchart of a data processing method according to another embodiment of the present disclosure;

图8示意性示出了根据本公开的实施例的数据处理装置的框图；以及FIG. 8 schematically shows a block diagram of a data processing apparatus according to an embodiment of the present disclosure; and

图9示意性示出了根据本公开实施例的计算机系统的框图。9 schematically illustrates a block diagram of a computer system according to an embodiment of the present disclosure.

具体实施方式Detailed ways

以下，将参照附图来描述本公开的实施例。但是应该理解，这些描述只是示例性的，而并非要限制本公开的范围。在下面的详细描述中，为便于解释，阐述了许多具体的细节以提供对本公开实施例的全面理解。然而，明显地，一个或多个实施例在没有这些具体细节的情况下也可以被实施。此外，在以下说明中，省略了对公知结构和技术的描述，以避免不必要地混淆本公开的概念。Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood, however, that these descriptions are exemplary only, and are not intended to limit the scope of the present disclosure. In the following detailed description, for convenience of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the present disclosure. It will be apparent, however, that one or more embodiments may be practiced without these specific details. Also, in the following description, descriptions of well-known structures and techniques are omitted to avoid unnecessarily obscuring the concepts of the present disclosure.

在此使用的术语仅仅是为了描述具体实施例，而并非意在限制本公开。在此使用的术语“包括”、“包含”等表明了所述特征、步骤、操作和/或部件的存在，但是并不排除存在或添加一个或多个其他特征、步骤、操作或部件。The terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit the present disclosure. The terms "comprising", "comprising" and the like as used herein indicate the presence of stated features, steps, operations and/or components, but do not preclude the presence or addition of one or more other features, steps, operations or components.

在此使用的所有术语(包括技术和科学术语)具有本领域技术人员通常所理解的含义，除非另外定义。应注意，这里使用的术语应解释为具有与本说明书的上下文相一致的含义，而不应以理想化或过于刻板的方式来解释。All terms (including technical and scientific terms) used herein have the meaning as commonly understood by one of ordinary skill in the art, unless otherwise defined. It should be noted that terms used herein should be construed to have meanings consistent with the context of the present specification and should not be construed in an idealized or overly rigid manner.

在使用类似于“A、B和C等中至少一个”这样的表述的情况下，一般来说应该按照本领域技术人员通常理解该表述的含义来予以解释(例如，“具有A、B和C中至少一个的系统”应包括但不限于单独具有A、单独具有B、单独具有C、具有A和B、具有A和C、具有B和C、和/或具有A、B、C的系统等)。在使用类似于“A、B或C等中至少一个”这样的表述的情况下，一般来说应该按照本领域技术人员通常理解该表述的含义来予以解释(例如，“具有A、B或C中至少一个的系统”应包括但不限于单独具有A、单独具有B、单独具有C、具有A和B、具有A和C、具有B和C、和/或具有A、B、C的系统等)。Where expressions like "at least one of A, B, and C, etc.," are used, they should generally be interpreted in accordance with the meaning of the expression as commonly understood by those skilled in the art (eg, "has A, B, and C") At least one of the "systems" shall include, but not be limited to, systems with A alone, B alone, C alone, A and B, A and C, B and C, and/or A, B, C, etc. ). Where expressions like "at least one of A, B, or C, etc.," are used, they should generally be interpreted in accordance with the meaning of the expression as commonly understood by those skilled in the art (eg, "has A, B, or C, etc." At least one of the "systems" shall include, but not be limited to, systems with A alone, B alone, C alone, A and B, A and C, B and C, and/or A, B, C, etc. ).

APT(Advanced Persistent Threat，简称APT，高级持续威胁)还被称作“针对特定目标”的攻击，是一种有组织、有特定目标、持续时间极长的新型攻击。APT攻击持续不断，恶意信息层出不穷，安全厂商也不断地利用各种工具检测网络攻击，以期追踪和定位到恶意信息。APT (Advanced Persistent Threat, APT, Advanced Persistent Threat for short) is also called "targeted" attack, which is a new type of attack that is organized, has a specific target and has a very long duration. APT attacks continue, and malicious information emerges in an endless stream. Security vendors are constantly using various tools to detect network attacks in order to track and locate malicious information.

安全厂商内部一般设置有用于网络攻击检测的运营平台，该平台可以运行有多种检测网络攻击的工具，例如，采用沙箱检测网络攻击，利用网页爬虫检测网络攻击，以及利用沙箱运行日志、网络爬虫日志等检测网络攻击。但是，这都需要人工参与，如，针对不同情况，人为设置沙箱运行条件、人工编写爬虫代码等。并且检测结果需要人工进一步地分析和审核，效率低下，可能出现误判或漏判的情况。Security vendors generally set up an operation platform for network attack detection. This platform can run a variety of tools to detect network attacks. For example, use sandbox to detect network attacks, use web crawler to detect network attacks, and use sandbox to run logs, Web crawler logs, etc. to detect network attacks. However, all of this requires manual participation, such as manually setting sandbox operating conditions and manually writing crawler code for different situations. In addition, the test results need to be further analyzed and reviewed manually, which is inefficient and may cause misjudgment or omission.

基于此，本公开的实施例提供了一种数据处理方法和装置。该方法包括：获取与安全相关的原始数据；从原始数据中提取多个实体对象；利用威胁情报数据集对多个实体对象进行处理，得到多个实体对象中每个实体对象各自的标签特征，其中，标签特征用于表征实体对象的安全属性和/或恶意属性；以及根据每个实体对象的标签特征，确定每个实体对象的威胁信息。Based on this, embodiments of the present disclosure provide a data processing method and apparatus. The method includes: acquiring security-related original data; extracting multiple entity objects from the original data; processing the multiple entity objects by using a threat intelligence data set to obtain the respective label features of each entity object in the multiple entity objects, The tag feature is used to represent the security attribute and/or malicious attribute of the entity object; and the threat information of each entity object is determined according to the tag feature of each entity object.

图1示意性示出可以应用本公开实施例的数据处理方法的示例性系统架构100。需要注意的是，图1所示仅为可以应用本公开实施例的系统架构的示例，以帮助本领域技术人员理解本公开的技术内容，但并不意味着本公开实施例不可以用于其他设备、系统、环境或场景。FIG. 1 schematically illustrates an exemplary system architecture 100 to which data processing methods of embodiments of the present disclosure may be applied. It should be noted that FIG. 1 is only an example of a system architecture to which the embodiments of the present disclosure can be applied, so as to help those skilled in the art to understand the technical content of the present disclosure, but it does not mean that the embodiments of the present disclosure cannot be used for other A device, system, environment or scene.

如图1所示，根据该实施例的系统架构100可以包括电子设备101、基于威胁情报数据集的检测平台102和人工运营标记平台103。As shown in FIG. 1 , the system architecture 100 according to this embodiment may include an electronic device 101 , a detection platform 102 based on a threat intelligence data set, and a manual operation marking platform 103 .

电子设备101在获取到与安全相关的原始数据后，可以先进行数据提取，得到多个IOC(Indicators of compromise，简称IOC，失陷指标)实体对象。IOC实体对象可以有多种类型，如文件、IP、域名、HOST、网页地址等，其中，文件可以用文件的MD5(Message DigestAlgorithm，信息摘要算法)表示，网页地址中的可以包括主域名HOST、URL(UniformResource Locator，统一资源定位符)、URI(Uniform Resource Identifier，统一资源标识符)。然后可以将IOC数据分别通过检测平台102和人工运营标记平台103进行恶意信息的检测。具体地，原始数据可以是网络数据流、文件、报文等。After obtaining the original security-related data, the electronic device 101 may first perform data extraction to obtain multiple IOC (Indicators of Compromise, IOC for short) entity objects. IOC entity objects can be of various types, such as file, IP, domain name, HOST, web page address, etc. Among them, the file can be represented by the MD5 (Message Digest Algorithm, message digest algorithm) of the file, and the web page address can include the main domain name HOST, URL (UniformResource Locator, Uniform Resource Locator), URI (Uniform Resource Identifier, Uniform Resource Identifier). Then the IOC data can be detected by the detection platform 102 and the manual operation marking platform 103 to detect malicious information. Specifically, the original data may be a network data stream, a file, a message, or the like.

检测平台102可以包括多个知识库，每个知识库可以是存储有不同业务类型的威胁情报数据的数据库，例如白名单数据库、黑名单数据库、失陷主机数据库、信誉文件库、信誉IP库、备案域名库等。The detection platform 102 may include a plurality of knowledge bases, and each knowledge base may be a database storing threat intelligence data of different business types, such as a whitelist database, a blacklist database, a compromised host database, a reputation file database, a reputation IP database, and a filing Domain name database, etc.

其中，威胁情报数据可以是对多种不同维度的源数据进行处理分析、判定和/或标记后得到的数据。其中，不同维度的源数据可以包括开源情报数据、商业情报数据(如各安全厂商发布的攻击事件)、安全产品的运行数据(如攻击告警信息)、文件深度解析引擎输出的数据(如文件类型)、多病毒引擎扫描的判定结果以及人工运营数据(如爬虫数据、沙箱日志)等。The threat intelligence data may be data obtained by processing, analyzing, judging and/or marking source data of multiple different dimensions. Among them, the source data of different dimensions can include open source intelligence data, business intelligence data (such as attack events released by various security vendors), operation data of security products (such as attack alarm information), and data output by the file depth analysis engine (such as file types). ), the judgment results of multi-virus engine scans, and manual operation data (such as crawler data, sandbox logs), etc.

针对这些源数据提取出安全实体对象，如IP、网络地址、域名、文件等，实体对象经过字段比对及提取，截取出有用的属性信息，如安全属性、恶意属性、上下文属性等，根据实体对象的属性判定对象是否为恶意信息，从而进行恶意信息的标定，对实体对象进行恶意判定和标记后可以将其保存在数据库中，形成威胁情报数据集。Based on these source data, secure entity objects, such as IP, network address, domain name, file, etc., are extracted. After field comparison and extraction of entity objects, useful attribute information, such as security attributes, malicious attributes, and context attributes, is extracted. The attribute of the object determines whether the object is malicious information, so as to carry out the calibration of malicious information. After malicious determination and marking of the entity object, it can be stored in the database to form a threat intelligence data set.

人工运营标记平台103运行有网页爬虫工具、沙箱工具等，可以用于辅助检测平台102的检测，也可以对检测平台102的检测结果进行调整、校正等。人工运营标记平台103主要利用网页爬虫数据、检测日志、沙箱运行得到的网络活动信息等，分析出恶意信息并进行人工判定和标记。The manual operation marking platform 103 runs a web crawler tool, a sandbox tool, etc., which can be used to assist the detection of the detection platform 102 , and can also adjust and correct the detection results of the detection platform 102 . The manual operation marking platform 103 mainly uses web crawler data, detection logs, network activity information obtained by sandbox operation, etc., to analyze malicious information and perform manual judgment and marking.

IOC实体对象经过检测平台102和/或人工运营标记平台103的检测，在每一个检测环节都会根据检测结果打上标签，该标签可以表征该IOC对象的安全属性和/或恶意属性。例如，某一文件经过检测平台102中的白名单数据库、黑名单库和/或信誉文件库的检测，如果该文件为恶意文件，可以标记有文件类型、攻击手段等，例如可以标记黑文件、在某漏洞攻击中被使用等标签。该文件在经过人工运营标记平台103的检测后，如果检测到该文件为某恶意家族生成的恶意样本，则可以被标记攻击者为某已知的恶意家族的标签。The IOC entity object is detected by the detection platform 102 and/or the manual operation labeling platform 103, and each detection link will be labeled according to the detection result, and the label can represent the security attribute and/or malicious attribute of the IOC object. For example, if a certain file is detected by the whitelist database, blacklist library and/or reputation file library in the detection platform 102, if the file is a malicious file, it can be marked with the file type, attack method, etc. The tag is used in a vulnerability attack. After the file is detected by the manual operation marking platform 103, if it is detected that the file is a malicious sample generated by a malicious family, the attacker can be marked as a tag of a known malicious family.

需要说明的是，本公开实施例所提供的数据处理方法一般可以由电子设备101执行。相应地，本公开实施例所提供的数据处理装置一般可以设置于电子设备101中。本公开实施例所提供的数据处理方法也可以由不同于电子设备101且能够与电子设备101通信的服务器或服务器集群执行。相应地，本公开实施例所提供的数据处理装置也可以设置于不同于电子设备101且能够与电子设备101通信的服务器或服务器集群中。It should be noted that, the data processing method provided by the embodiments of the present disclosure may generally be executed by the electronic device 101 . Correspondingly, the data processing apparatus provided by the embodiments of the present disclosure may generally be provided in the electronic device 101 . The data processing method provided by the embodiment of the present disclosure may also be executed by a server or a server cluster that is different from the electronic device 101 and can communicate with the electronic device 101 . Correspondingly, the data processing apparatus provided by the embodiment of the present disclosure may also be provided in a server or a server cluster that is different from the electronic device 101 and can communicate with the electronic device 101 .

应该理解，图1中的电子设备101、检测平台102、检测平台102中的知识库、人工运营标记平台103以及人工运营标记平台103中的检测工具的数目仅仅是示意性的。根据实现需要，可以具有任意数目的电子设备101、检测平台102、知识库、人工运营标记平台103和检测工具。It should be understood that the numbers of the electronic device 101 , the detection platform 102 , the knowledge base in the detection platform 102 , the manual operation marking platform 103 and the detection tools in the manual operation marking platform 103 in FIG. 1 are only illustrative. According to implementation needs, there may be any number of electronic devices 101, detection platforms 102, knowledge bases, manual operation marking platforms 103, and detection tools.

图2示意性示出了根据本公开实施例的数据处理方法的流程图。FIG. 2 schematically shows a flowchart of a data processing method according to an embodiment of the present disclosure.

如图2所示，该方法包括操作S201～S204。As shown in FIG. 2, the method includes operations S201-S204.

在操作S201，获取与安全相关的原始数据。In operation S201, security-related raw data is acquired.

在操作S202，从原始数据中提取多个实体对象。In operation S202, a plurality of entity objects are extracted from the original data.

根据本公开实施例，原始数据例如可以是网络数据流、业务数据包等。从原始数据中提取出IOC实体对象，IOC实体对象可以包括MD5、IP、域名、HOST、URL等。还可以对IOC数据进行归一化处理、去噪处理、去重处理、字段补齐等，以规范数据。According to an embodiment of the present disclosure, the original data may be, for example, a network data stream, a service data packet, or the like. The IOC entity object is extracted from the original data, and the IOC entity object can include MD5, IP, domain name, HOST, URL, etc. The IOC data can also be normalized, denoised, deduplicated, field filled, etc. to normalize the data.

在操作S203，利用威胁情报数据集对多个实体对象进行处理，得到多个实体对象中每个实体对象各自的标签特征，其中，标签特征用于表征实体对象的安全属性和/或恶意属性。In operation S203, multiple entity objects are processed by using the threat intelligence data set to obtain respective tag features of each entity object in the multiple entity objects, wherein the tag features are used to represent the security attribute and/or malicious attribute of the entity object.

根据本公开实施例，威胁情报数据集包括多个知识库，例如白名单数据库、黑名单数据库、失陷主机数据库、信誉文件库、信誉IP库、备案域名库等。According to an embodiment of the present disclosure, the threat intelligence data set includes multiple knowledge bases, such as a whitelist database, a blacklist database, a compromised host database, a reputation file database, a reputation IP database, a record domain name database, and the like.

威胁情报数据对多维度源数据进行处理后得到的，其形式上可以是一些属性的复合表达式。威胁情报数据涉及的属性可以包括：可观测行为特征：如网络阻塞、系统遭受到的破坏等现象；威胁特征指标：通过查看特征可以判定资产信息是否真的遭受到了这个威胁的攻击；安全事件描述：包括恶意攻击的行为、受害目标、利用了什么弱点、影响及后果、杀伤链信息等；攻击意图描述：为什么要发起这次攻击，包括攻击者特征、意图、所属组织等；脆弱性特征描述：针对被攻击系统所采取的行动；溯源信息：攻击发起方的具体信息，包括组织、国家邮箱、账号等特征。Threat intelligence data is obtained by processing multi-dimensional source data, and its form can be a compound expression of some attributes. Attributes involved in threat intelligence data can include: observable behavior characteristics: such as network congestion, system damage, etc.; threat characteristic indicators: by looking at the characteristics, it can be determined whether the asset information is really attacked by this threat; security event description : Including malicious attack behavior, victim target, exploited weaknesses, impact and consequences, kill chain information, etc.; Attack intent description: Why the attack was launched, including attacker characteristics, intent, organization, etc.; Vulnerability characteristics description : Actions taken against the attacked system; traceability information: specific information of the attack initiator, including characteristics such as organization, national mailbox, and account number.

在上述属性特征中，可以将表征实体对象自身恶意特征或安全特征的属性作为标签特征，将表征实体对象上下文关联关系的特征作为上下文属性。Among the above attribute features, the attribute representing the malicious feature or security feature of the entity object can be used as the label feature, and the feature representing the contextual relationship of the entity object can be used as the context attribute.

根据本公开实施例，多维度源数据可以来自开源情报数据、商业情报数据、安全产品的运行数据、文件深度解析引擎输出的数据、多病毒引擎扫描的判定结果、网页爬虫数据、检测日志、沙箱运行得到的网络活动信息等，覆盖多个环节的源数据，自动化获取全链路场景下的数据，保证数据来源的多维度、广度和深度。According to the embodiment of the present disclosure, the multi-dimensional source data may come from open source intelligence data, business intelligence data, operation data of security products, data output by a file in-depth parsing engine, determination results of multi-virus engine scanning, web crawler data, detection logs, sand It covers the source data of multiple links, and automatically obtains data in full-link scenarios to ensure the multi-dimensionality, breadth and depth of data sources.

根据本公开实施例，利用威胁情报数据集对IOC实体对象进行检测，由于威胁情报数据集对多维度的来源数据进行处理得到的，相比于传统的利用单一来源数据对IOC实体对象进行检测，保证了数据的广度和深度，进而保证了来源数据的持续性和新鲜度，提高了IOC实体对象的分析检测的全面性和准确性。According to the embodiment of the present disclosure, the threat intelligence data set is used to detect the IOC entity object, because the threat intelligence data set is obtained by processing multi-dimensional source data, compared with the traditional detection of the IOC entity object using the single source data, It ensures the breadth and depth of the data, thus ensuring the continuity and freshness of the source data, and improving the comprehensiveness and accuracy of the analysis and detection of IOC entity objects.

根据本公开实施例，利用威胁情报数据集中的至少一个知识库对每一个实体对象进行处理，例如可以将实体对象与知识库中的实体对象进行比对，如果匹配成功，则可以将知识库中的与当前实体对象相匹配的实体对象的标签属性分配给该当前实体对象。According to the embodiment of the present disclosure, at least one knowledge base in the threat intelligence data set is used to process each entity object. For example, the entity object can be compared with the entity object in the knowledge base. If the matching is successful, the knowledge base can be The tag property of the entity object that matches the current entity object is assigned to the current entity object.

根据本公开实施例，还可以将知识库中的与当前实体对象相匹配的实体对象的上下文属性分配给该当前实体对象，以富化当前实体对象的属性信息。According to the embodiment of the present disclosure, the context attribute of the entity object in the knowledge base that matches the current entity object can also be assigned to the current entity object, so as to enrich the attribute information of the current entity object.

在操作S204，根据每个实体对象的标签特征，确定每个实体对象的威胁信息。In operation S204, threat information of each entity object is determined according to the tag feature of each entity object.

根据本公开实施例，每一个实体对象经过至少一个知识库以及人工运营标记平台的研判后，会标记有多个标签属性。例如，某IOC对象经过知识库1处理后标记有标签1、标签2、标签3，经过知识库3处理后标记有标签4、标签5、标签6，经过人工运营标记平台处理后标记有标签7，标签8。标签中可能包括一些属性标签，也可能包括恶意标签。According to the embodiment of the present disclosure, each entity object will be marked with a plurality of tag attributes after being judged by at least one knowledge base and a manually operated tagging platform. For example, an IOC object is marked with tag 1, tag 2, and tag 3 after being processed by knowledge base 1, marked with tag 4, tag 5, and tag 6 after being processed by knowledge base 3, and marked with tag 7 after being processed by the manual operation marking platform. , label 8. The tags may include some attribute tags, and may also include malicious tags.

根据本公开实施例，可以主要关注恶意标签，从中选取出代表性最强的恶意标签来标记该实体对象。例如，可以选取出标签1、标签4和标签7标记该实体对象，根据这些恶意标签可以确定攻击者信息、攻击手段、恶意类型等威胁信息，具体可以包括攻击团伙名称、国家、邮箱、账号、攻击者利用的漏洞信息、攻击者采用的攻击工具等。According to the embodiment of the present disclosure, malicious tags can be mainly concerned, and the most representative malicious tags can be selected to mark the entity object. For example, label 1, label 4, and label 7 can be selected to mark the entity object, and threat information such as attacker information, attack method, and malicious type can be determined according to these malicious labels. Vulnerability information exploited by attackers, attack tools used by attackers, etc.

根据本公开实施例，操作S204可以包括：针对每个实体对象，利用网络模型对实体对象的标签特征进行处理，得到实体对象的威胁信息。其中，网络模型可以是基于机器学习聚类算法训练得到的分类模型，该分类模型的输入可以是实体对象的标签特征，输出可以是标签特征归属的攻击团伙、攻击手段和恶意类型等威胁信息。According to an embodiment of the present disclosure, operation S204 may include: for each entity object, using a network model to process the label feature of the entity object to obtain threat information of the entity object. The network model may be a classification model trained based on a machine learning clustering algorithm, the input of the classification model may be the label feature of the entity object, and the output may be threat information such as the attack group to which the label feature belongs, the attack method, and the malicious type.

根据本公开实施例，通过机器学习聚类算法，将不同知识库返回的标签进行聚类后，可以自动判定数据标签的归属信息及家族团伙信息。According to the embodiment of the present disclosure, after clustering the tags returned by different knowledge bases through a machine learning clustering algorithm, the attribution information and family gang information of the data tags can be automatically determined.

其中，机器学习聚类算法可以是随机森林算法或者在随机森林算法的基础上引入IBk(k最近邻分类)算法，可以弥补随机森林陷入局部最优解导致误报的弱点，IBk算法通过对随机森林算法使用的训练集，进行训练，生成IBk模型，该模型支持一个未知样本从训练集中查找最相似的三个样本。通过结合随机森林算法和IBk算法，实现高精度、高准确度的APT样本、恶意家族样本的检出。还可以使用其他机器学习的分类聚类算法，本公开对机器学习的算法类型不做限定。Among them, the machine learning clustering algorithm can be the random forest algorithm or the IBk (k nearest neighbor classification) algorithm is introduced on the basis of the random forest algorithm, which can make up for the weakness of the random forest falling into the local optimal solution resulting in false positives. The training set used by the forest algorithm is trained to generate an IBk model that supports an unknown sample to find the three most similar samples from the training set. By combining random forest algorithm and IBk algorithm, high-precision and high-accuracy APT samples and malicious family samples can be detected. Other machine learning classification and clustering algorithms may also be used, and the present disclosure does not limit the types of machine learning algorithms.

根据本公开实施例，从原始数据中提取多个实体对象，利用威胁情报数据集对多个实体对象进行处理，得到每个实体对象各自的标签特征，根据每个实体对象的标签特征，确定每个实体对象的威胁信息。由于利用威胁情报数据集对实体对象进行处理，标签化实体对象的恶意信息，能够提高恶意信息的判定和标记效率和准确率。According to the embodiment of the present disclosure, multiple entity objects are extracted from the original data, and the threat intelligence data set is used to process the multiple entity objects to obtain the respective label features of each entity object. Threat information for an entity object. Since the threat intelligence data set is used to process the entity objects and label the malicious information of the entity objects, the efficiency and accuracy of the determination and labeling of malicious information can be improved.

进一步地，根据实体对象的标签特征，能够快速定位到恶意信息的攻击团伙信息、恶意家族信息、攻击手法等威胁信息，针对如APT等组织的攻击，能够实现快速追踪和准确的定位。Further, according to the label characteristics of the entity object, it can quickly locate the threat information such as malicious information such as attack group information, malicious family information, and attack methods. For attacks on organizations such as APT, fast tracking and accurate positioning can be achieved.

图3示意性示出了根据本公开实施例的利用威胁情报数据集对多个实体对象进行处理的方法的流程图。FIG. 3 schematically shows a flowchart of a method for processing a plurality of entity objects by using a threat intelligence data set according to an embodiment of the present disclosure.

如图3所示，操作S203包括操作S301～S302。As shown in FIG. 3 , operation S203 includes operations S301 to S302.

在操作S301，利用多个知识库中的至少一个知识库对实体对象进行处理，得到至少一个知识库中的每一个知识库的处理结果，其中，每一个知识库包括标记有标签特征的多个实体对象。In operation S301, the entity object is processed by using at least one knowledge base in the plurality of knowledge bases, and a processing result of each knowledge base in the at least one knowledge base is obtained, wherein each knowledge base includes a plurality of entity object.

根据本公开实施例，针对不同的IOC对象，需要进行不同的研判流程。例如，如果IOC对象是一个文件，则需要进行白名单库、信誉文件库、以及人工运营研判流程。如果IOC对象是一个IP，则需要进行白名单库、IP信誉库、以及人工运营标记流程的研判。如果IOC对象是一个域名，则需要进行白名单库、失陷主机库、备案域名库以及人工运营研判流程。如果IOC对象是一个URL，则需要进行白名单库以及人工运营研判流程。According to the embodiments of the present disclosure, different research and judgment processes need to be performed for different IOC objects. For example, if the IOC object is a file, a whitelist library, a reputation file library, and a manual operation research and judgment process are required. If the IOC object is an IP, it is necessary to conduct research and judgment on the whitelist database, IP reputation database, and manual operation marking process. If the IOC object is a domain name, a whitelist database, a compromised host database, a record domain name database, and a manual operation research and judgment process are required. If the IOC object is a URL, a whitelist library and a manual operation research and judgment process are required.

根据本公开实施例，针对每一种类型的IOC，在不同知识库中的研判结果可能相同或不同，在经过至少一个知识库的研判后，可以利用人工运营标记平台进行调整，提高标记准确性。According to the embodiment of the present disclosure, for each type of IOC, the research and judgment results in different knowledge bases may be the same or different. After at least one knowledge base is researched and judged, the manual operation marking platform can be used to make adjustments to improve the marking accuracy. .

例如，针对某一URL对象，经过白名单库的检测后，没有标记恶意标签，而经过人工运营标记平台的检测后，标记了恶意下载标签，则确定该URL是恶意的。For example, for a URL object, after the detection by the whitelist library, no malicious label is marked, but after the detection by the manual operation labeling platform, the malicious download label is marked, then it is determined that the URL is malicious.

在操作S302，根据每一个知识库的处理结果，确定实体对象的标签特征。In operation S302, the label feature of the entity object is determined according to the processing result of each knowledge base.

根据本公开实施例，针对每一个IOC，每经过一个知识库的处理可以标记有丰富的标签特征，经过不同的知识库处理得到相同或不同的标记，可以将相同的标记去重，将不同的标记进行整合，得到实体对象完整的标签特征。According to the embodiments of the present disclosure, for each IOC, each knowledge base can be marked with rich tag features, and the same or different tags can be obtained through different knowledge base processing, the same tags can be deduplicated, and different The tags are integrated to obtain the complete tag features of the entity object.

图4示意性示出了根据本公开实施例的利用多个知识库中的至少一个知识库对实体对象进行处理的方法的流程图。FIG. 4 schematically shows a flowchart of a method for processing an entity object by using at least one knowledge base among a plurality of knowledge bases according to an embodiment of the present disclosure.

如图4所示，操作S301包括操作S401～S402。As shown in FIG. 4 , operation S301 includes operations S401 to S402.

在操作S401，针对至少一个知识库中的每一个知识库，确定知识库中是否包括与实体对象相同的目标实体对象。In operation S401, for each knowledge base in the at least one knowledge base, it is determined whether the knowledge base includes the same target entity object as the entity object.

在操作S402，如果确定知识库中包括与实体对象相同的目标实体对象，则利用目标实体对象的标签特征标记实体对象。In operation S402, if it is determined that the knowledge base includes the same target entity object as the entity object, the entity object is marked with the tag feature of the target entity object.

例如，针对某一域名，将其与失陷主机库中的主机域名进行比对，如果失陷主机库中存在该域名，则将在失陷主机库中该域名的标签属性标记给当前域名。For example, for a domain name, compare it with the host domain name in the compromised host database, if the domain name exists in the compromised host database, mark the label attribute of the domain name in the compromised host database to the current domain name.

根据本公开实施例，操作S301还包括：针对至少一个知识库中的每一个知识库，根据知识库中的实体对象的上下文特征对当前实体对象的上下文特征进行信息补全，以富化标记生成的标签信息，保证了标签化信息产出的稳定性和准确率。According to an embodiment of the present disclosure, operation S301 further includes: for each knowledge base in the at least one knowledge base, performing information completion on the contextual feature of the current entity object according to the contextual feature of the entity object in the knowledge base, so as to generate an enriched mark The label information ensures the stability and accuracy of label information output.

图5示意性示出了根据本公开另一实施例的数据处理方法的流程图。FIG. 5 schematically shows a flowchart of a data processing method according to another embodiment of the present disclosure.

如图5所示，包括操作S501～S503。As shown in FIG. 5 , operations S501 to S503 are included.

在操作S501，对文件进行特征提取，得到文件的静态特征和动态特征。In operation S501, feature extraction is performed on the file to obtain static features and dynamic features of the file.

在操作S502，利用沙箱运行文件，得到文件的行为特征。In operation S502, the file is run using the sandbox to obtain behavior characteristics of the file.

根据本公开实施例，针对文件类型的IOC对象，还可以利用深度文件解析引擎和沙箱运行进行处理。According to the embodiment of the present disclosure, the IOC object of the file type can also be processed by using a deep file parsing engine and sandbox operation.

具体地，通过深度文件解析引擎可以对文件进行静态分析和动态分析，得到文件的静态特征和动态特征。静态特征可以包括文件名、文件大小、检测首次出现时间等。动态特征可以是文件运行的一些信息和映射到的信息，包括最大最小流、流类型、编译器类型、PDB(Program Database File，程序数据库文件)长度、字符串长度、字典元素个数、数组大小、关联文件信息等。Specifically, through the deep file parsing engine, static analysis and dynamic analysis of the file can be performed to obtain the static and dynamic features of the file. Static features can include file name, file size, time of first occurrence of detection, etc. Dynamic features can be some information about file operation and information mapped to it, including maximum and minimum streams, stream types, compiler types, PDB (Program Database File) length, string length, number of dictionary elements, and array size , associated file information, etc.

具体地，通过沙箱运行文件，可以得到提交区域、沙箱网络信息、释放文件信息等特征。Specifically, by running the file in the sandbox, features such as the submission area, sandbox network information, and release file information can be obtained.

在操作S503，根据文件的静态特征、动态特征、行为特征和标签特征中的至少一个，确定文件的威胁信息。In operation S503, threat information of the file is determined according to at least one of a static feature, a dynamic feature, a behavior feature, and a tag feature of the file.

根据本公开实施例，针对基于深度文件解析引擎得到的静态特征和动态特征、基于沙箱运行得到的行为特征以及基于威胁情报数据集得到的标签特征，综合进行判定。可以将文件标定所需所有上下文信息全部获取到，进行多维度统计关联后，得到最终的标签属性和上下文属性。According to the embodiments of the present disclosure, comprehensively determine the static features and dynamic features obtained based on the deep file parsing engine, the behavior features obtained based on the sandbox operation, and the label features obtained based on the threat intelligence data set. All the context information required for file calibration can be obtained, and after multi-dimensional statistical association, the final label attributes and context attributes can be obtained.

图6示意性示出了根据本公开另一实施例的数据处理方法的流程图。FIG. 6 schematically shows a flowchart of a data processing method according to another embodiment of the present disclosure.

如图6所示，包括操作S601～S602。As shown in FIG. 6 , operations S601 to S602 are included.

在操作S601，将利用威胁情报数据集处理后的实体对象进行整合。In operation S601, the entity objects processed using the threat intelligence data set are integrated.

在操作S602，根据实体对象的标签特征，将整合后的实体对象进行关联，得到实体对象关系数据集。In operation S602, the integrated entity objects are associated according to the label features of the entity objects to obtain an entity-object relationship data set.

根据本公开实施例，IOC数据在经过检测平台102和/或人工运营标记平台103的检测和标签化处理后，可以将IOC数据进行整合，例如，对同一IOC数据进行去重处理，字段的补齐、标准化处理等，将同一IOC数据整合在一起。还可以根据标签特征将相关联的IOC聚合，例如属于同一恶意家族的IOC数据关联在一起，例如，一个IP可以关联到一个文件或URL等，最终形成新的具有关联关系的IOC数据集。According to the embodiment of the present disclosure, after the IOC data is detected and tagged by the detection platform 102 and/or the manual operation tagging platform 103, the IOC data can be integrated. Integrate the same IOC data together. The associated IOCs can also be aggregated according to the tag characteristics, for example, the IOC data belonging to the same malicious family are associated together, for example, an IP can be associated with a file or URL, etc., and finally a new associated IOC dataset is formed.

图7示意性示出了根据本公开另一实施例的数据处理方法的流程图。FIG. 7 schematically shows a flowchart of a data processing method according to another embodiment of the present disclosure.

如图7所示，包括操作S701～S703。As shown in FIG. 7 , operations S701 to S703 are included.

在操作S701，获取新的实体对象。In operation S701, a new entity object is acquired.

在操作S702，利用实体对象关系数据集处理新的实体对象，得到新的实体对象的标签特征。In operation S702, the new entity object is processed by using the entity object relationship data set to obtain the label feature of the new entity object.

在操作S703，根据新的实体对象的标签特征，确定实体对象的威胁信息。In operation S703, the threat information of the entity object is determined according to the tag feature of the new entity object.

根据本公开实施例，利用该具有关联关系的IOC数据集可以提供统一的API接口，关联不同业务分析系统，可以提供不同的查询处理业务。以便在获取到新的IOC实体对象可以通过API接口，查询得到该IOC实体对象的恶意类型、攻击手段、所属于的恶意家族等威胁信息。According to the embodiment of the present disclosure, a unified API interface can be provided by using the IOC data set with an association relationship, and different service analysis systems can be associated, and different query processing services can be provided. In order to obtain a new IOC entity object, you can query and obtain threat information such as the malicious type, attack method, and malicious family of the IOC entity object through the API interface.

图8示意性示出了根据本公开的实施例的数据处理装置的框图。FIG. 8 schematically shows a block diagram of a data processing apparatus according to an embodiment of the present disclosure.

如图8所示，数据处理装置800包括第一获取模块801、第一提取模块802、第一处理模块803、和确定模块804。As shown in FIG. 8 , the data processing apparatus 800 includes a first acquisition module 801 , a first extraction module 802 , a first processing module 803 , and a determination module 804 .

第一获取模块801用于获取与安全相关的原始数据。The first acquisition module 801 is used to acquire security-related raw data.

第一提取模块802用于从原始数据中提取多个实体对象。The first extraction module 802 is used to extract a plurality of entity objects from the original data.

第一处理模块803用于利用威胁情报数据集对多个实体对象进行处理，得到多个实体对象中每个实体对象各自的标签特征，其中，标签特征用于表征实体对象的安全属性和/或恶意属性。The first processing module 803 is configured to process the multiple entity objects by using the threat intelligence data set to obtain the respective tag features of each entity object in the multiple entity objects, wherein the tag features are used to represent the security attributes and/or the entity objects. malicious properties.

第一确定模块804用于根据每个实体对象的标签特征，确定每个实体对象的威胁信息。The first determining module 804 is configured to determine threat information of each entity object according to the tag feature of each entity object.

根据本公开实施例，第一处理模块803包括处理单元和确定单元。According to an embodiment of the present disclosure, the first processing module 803 includes a processing unit and a determination unit.

处理单元用于利用多个知识库中的至少一个知识库对实体对象进行处理，得到至少一个知识库中的每一个知识库的处理结果，其中，每一个知识库包括标记有标签特征的多个实体对象。The processing unit is configured to process the entity object by using at least one knowledge base in the plurality of knowledge bases, and obtain the processing result of each knowledge base in the at least one knowledge base, wherein each knowledge base includes a plurality of entity object.

确定单元用于根据每一个知识库的处理结果，确定实体对象的标签特征。The determining unit is used for determining the label feature of the entity object according to the processing result of each knowledge base.

根据本公开实施例，处理单元包括确定子单元和标记子单元。According to an embodiment of the present disclosure, the processing unit includes a determination subunit and a marking subunit.

确定子单元用于针对至少一个知识库中的每一个知识库，确定知识库中是否包括与实体对象相同的目标实体对象。The determining subunit is used for determining, for each knowledge base in the at least one knowledge base, whether the knowledge base includes the same target entity object as the entity object.

标记子单元用于如果确定知识库中包括与实体对象相同的目标实体对象，则利用目标实体对象的标签特征标记实体对象。The marking subunit is used to mark the entity object with the tag feature of the target entity object if it is determined that the knowledge base includes the same target entity object as the entity object.

根据本公开实施例，第一确定模块804用于针对每个实体对象，利用网络模型对实体对象的标签特征进行处理，得到实体对象的威胁信息。According to the embodiment of the present disclosure, the first determination module 804 is configured to process, for each entity object, the tag feature of the entity object by using the network model, and obtain threat information of the entity object.

根据本公开实施例，数据处理装置800还包括第二提取模块、第二处理模块和第二确定模块。According to an embodiment of the present disclosure, the data processing apparatus 800 further includes a second extraction module, a second processing module, and a second determination module.

第二提取模块用于对文件进行特征提取，得到文件的静态特征和动态特征。The second extraction module is used to perform feature extraction on the file to obtain static features and dynamic features of the file.

第二处理模块用于利用沙箱运行文件，得到文件的行为特征。The second processing module is used to run the file by using the sandbox to obtain the behavior characteristics of the file.

第二确定模块用于根据文件的静态特征、动态特征、行为特征和标签特征中的至少一个，确定文件的威胁信息。The second determining module is configured to determine threat information of the file according to at least one of the static feature, dynamic feature, behavior feature and tag feature of the file.

根据本公开实施例，数据处理装置800还包括整合模块和关联模块。According to an embodiment of the present disclosure, the data processing apparatus 800 further includes an integration module and an association module.

整合模块用于将利用威胁情报数据集处理后的实体对象进行整合。The integration module is used to integrate the entity objects processed by the threat intelligence data set.

关联模块用于根据实体对象的标签特征，将整合后的实体对象进行关联，得到实体对象关系数据集。The association module is used to associate the integrated entity objects according to the label characteristics of the entity objects to obtain an entity object relationship data set.

根据本公开实施例，数据处理装置800还包括第二获取模块、第三处理模块和第三确定模块。According to an embodiment of the present disclosure, the data processing apparatus 800 further includes a second acquisition module, a third processing module, and a third determination module.

第二获取模块用于获取新的实体对象。The second acquisition module is used to acquire the new entity object.

第三处理模块用于利用实体对象关系数据集处理新的实体对象，得到新的实体对象的标签特征。The third processing module is used to process the new entity object by using the entity-object relationship data set to obtain the label feature of the new entity object.

第三确定模块用于根据新的实体对象的标签特征，确定实体对象的威胁信息。The third determining module is used for determining the threat information of the entity object according to the label feature of the new entity object.

根据本公开实施例，实体对象包括文件、域名、IP和网页地址中的至少一种。至少一个知识库包括白名单库、黑名单库、备案域名库、信誉文件库、失陷主机库和信誉IP库中的至少一个。威胁信息包括恶意类型、攻击者信息和攻击手段中的至少一种。According to an embodiment of the present disclosure, the entity object includes at least one of a file, a domain name, an IP, and a web page address. The at least one knowledge base includes at least one of a whitelist base, a blacklist base, a registered domain name base, a reputation file base, a compromised host base, and a reputation IP base. The threat information includes at least one of malicious type, attacker information, and attack means.

根据本公开的实施例的模块、子模块、单元、子单元中的任意多个、或其中任意多个的至少部分功能可以在一个模块中实现。根据本公开实施例的模块、子模块、单元、子单元中的任意一个或多个可以被拆分成多个模块来实现。根据本公开实施例的模块、子模块、单元、子单元中的任意一个或多个可以至少被部分地实现为硬件电路，例如现场可编程门阵列(FPGA)、可编程逻辑阵列(PLA)、片上系统、基板上的系统、封装上的系统、专用集成电路(ASIC)，或可以通过对电路进行集成或封装的任何其他的合理方式的硬件或固件来实现，或以软件、硬件以及固件三种实现方式中任意一种或以其中任意几种的适当组合来实现。或者，根据本公开实施例的模块、子模块、单元、子单元中的一个或多个可以至少被部分地实现为计算机程序模块，当该计算机程序模块被运行时，可以执行相应的功能。Any of the modules, sub-modules, units, sub-units, or at least part of the functions of any of them according to embodiments of the present disclosure may be implemented in one module. Any one or more of the modules, sub-modules, units, and sub-units according to the embodiments of the present disclosure may be divided into multiple modules for implementation. Any one or more of the modules, sub-modules, units, sub-units according to embodiments of the present disclosure may be implemented at least in part as hardware circuits, such as field programmable gate arrays (FPGA), programmable logic arrays (PLA), A system on a chip, a system on a substrate, a system on a package, an application specific integrated circuit (ASIC), or any other reasonable means of hardware or firmware that integrates or packages circuits, or can be implemented in software, hardware, and firmware Any one of these implementations or an appropriate combination of any of them is implemented. Alternatively, one or more of the modules, sub-modules, units, and sub-units according to embodiments of the present disclosure may be implemented at least in part as computer program modules that, when executed, may perform corresponding functions.

例如，第一获取模块801、第一提取模块802、第一处理模块803、和确定模块804中的任意多个可以合并在一个模块/单元/子单元中实现，或者其中的任意一个模块/单元/子单元可以被拆分成多个模块/单元/子单元。或者，这些模块/单元/子单元中的一个或多个模块/单元/子单元的至少部分功能可以与其他模块/单元/子单元的至少部分功能相结合，并在一个模块/单元/子单元中实现。根据本公开的实施例，第一获取模块801、第一提取模块802、第一处理模块803、和确定模块804中的至少一个可以至少被部分地实现为硬件电路，例如现场可编程门阵列(FPGA)、可编程逻辑阵列(PLA)、片上系统、基板上的系统、封装上的系统、专用集成电路(ASIC)，或可以通过对电路进行集成或封装的任何其他的合理方式等硬件或固件来实现，或以软件、硬件以及固件三种实现方式中任意一种或以其中任意几种的适当组合来实现。或者，第一获取模块801、第一提取模块802、第一处理模块803、和确定模块804中的至少一个可以至少被部分地实现为计算机程序模块，当该计算机程序模块被运行时，可以执行相应的功能。For example, any one of the first acquisition module 801, the first extraction module 802, the first processing module 803, and the determination module 804 may be combined in one module/unit/subunit, or any one of the modules/units. /subunits can be split into multiple modules/units/subunits. Alternatively, at least part of the functionality of one or more of these modules/units/subunits may be combined with at least part of the functionality of other modules/units/subunits and combined in one module/unit/subunit realized in. According to an embodiment of the present disclosure, at least one of the first acquisition module 801, the first extraction module 802, the first processing module 803, and the determination module 804 may be at least partially implemented as a hardware circuit, such as a field programmable gate array ( FPGA), Programmable Logic Array (PLA), System-on-Chip, System-on-Substrate, System-on-Package, Application-Specific Integrated Circuit (ASIC), or any other reasonable means by which circuits can be integrated or packaged such as hardware or firmware It can be realized by any one of the three implementation manners of software, hardware and firmware, or by any suitable combination of any of them. Alternatively, at least one of the first acquisition module 801, the first extraction module 802, the first processing module 803, and the determination module 804 may be at least partially implemented as a computer program module that, when executed, may execute corresponding function.

需要说明的是，本公开的实施例中数据处理装置部分与本公开的实施例中数据处理方法部分是相对应的，数据处理装置部分的描述具体参考数据处理方法部分，在此不再赘述。It should be noted that the part of the data processing apparatus in the embodiment of the present disclosure corresponds to the part of the data processing method in the embodiment of the present disclosure, and the description of the part of the data processing apparatus refers to the part of the data processing method, which is not repeated here.

图9示意性示出了根据本公开实施例的适于实现上文描述的方法的计算机系统的框图。图9示出的计算机系统仅仅是一个示例，不应对本公开实施例的功能和使用范围带来任何限制。Figure 9 schematically illustrates a block diagram of a computer system suitable for implementing the methods described above, according to an embodiment of the present disclosure. The computer system shown in FIG. 9 is only an example, and should not impose any limitation on the function and scope of use of the embodiments of the present disclosure.

如图9所示，根据本公开实施例的计算机系统900包括处理器901，其可以根据存储在只读存储器(ROM)902中的程序或者从存储部分908加载到随机访问存储器(RAM)903中的程序而执行各种适当的动作和处理。处理器901例如可以包括通用微处理器(例如CPU)、指令集处理器和/或相关芯片组和/或专用微处理器(例如，专用集成电路(ASIC))，等等。处理器901还可以包括用于缓存用途的板载存储器。处理器901可以包括用于执行根据本公开实施例的方法流程的不同动作的单一处理单元或者是多个处理单元。As shown in FIG. 9 , a computer system 900 according to an embodiment of the present disclosure includes a processor 901 that can be loaded into a random access memory (RAM) 903 according to a program stored in a read only memory (ROM) 902 or from a storage section 908 program to perform various appropriate actions and processes. The processor 901 may include, for example, a general-purpose microprocessor (eg, a CPU), an instruction set processor and/or a related chipset, and/or a special-purpose microprocessor (eg, an application-specific integrated circuit (ASIC)), among others. The processor 901 may also include on-board memory for caching purposes. The processor 901 may include a single processing unit or multiple processing units for performing different actions of the method flow according to the embodiments of the present disclosure.

在RAM 903中，存储有系统900操作所需的各种程序和数据。处理器901、ROM 902以及RAM 903通过总线904彼此相连。处理器901通过执行ROM 902和/或RAM 903中的程序来执行根据本公开实施例的方法流程的各种操作。需要注意，所述程序也可以存储在除ROM 902和RAM 903以外的一个或多个存储器中。处理器901也可以通过执行存储在所述一个或多个存储器中的程序来执行根据本公开实施例的方法流程的各种操作。In the RAM 903, various programs and data necessary for the operation of the system 900 are stored. The processor 901 , the ROM 902 and the RAM 903 are connected to each other through a bus 904 . The processor 901 performs various operations of the method flow according to the embodiment of the present disclosure by executing the programs in the ROM 902 and/or the RAM 903 . Note that the program may also be stored in one or more memories other than the ROM 902 and the RAM 903 . The processor 901 may also perform various operations of the method flow according to the embodiments of the present disclosure by executing programs stored in the one or more memories.

根据本公开的实施例，系统900还可以包括输入/输出(I/O)接口905，输入/输出(I/O)接口905也连接至总线904。系统900还可以包括连接至I/O接口905的以下部件中的一项或多项：包括键盘、鼠标等的输入部分906；包括诸如阴极射线管(CRT)、液晶显示器(LCD)等以及扬声器等的输出部分907；包括硬盘等的存储部分908；以及包括诸如LAN卡、调制解调器等的网络接口卡的通信部分909。通信部分909经由诸如因特网的网络执行通信处理。驱动器910也根据需要连接至I/O接口905。可拆卸介质911，诸如磁盘、光盘、磁光盘、半导体存储器等等，根据需要安装在驱动器910上，以便于从其上读出的计算机程序根据需要被安装入存储部分908。According to embodiments of the present disclosure, the system 900 may also include an input/output (I/O) interface 905 that is also connected to the bus 904 . System 900 may also include one or more of the following components connected to I/O interface 905: input portion 906 including keyboard, mouse, etc.; including components such as cathode ray tube (CRT), liquid crystal display (LCD), etc., and speakers An output section 907 including a hard disk, etc.; a storage section 908 including a hard disk, etc.; and a communication section 909 including a network interface card such as a LAN card, a modem, and the like. The communication section 909 performs communication processing via a network such as the Internet. A drive 910 is also connected to the I/O interface 905 as needed. A removable medium 911, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, etc., is mounted on the drive 910 as needed so that a computer program read therefrom is installed into the storage section 908 as needed.

根据本公开的实施例，根据本公开实施例的方法流程可以被实现为计算机软件程序。例如，本公开的实施例包括一种计算机程序产品，其包括承载在计算机可读存储介质上的计算机程序，该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中，该计算机程序可以通过通信部分909从网络上被下载和安装，和/或从可拆卸介质911被安装。在该计算机程序被处理器901执行时，执行本公开实施例的系统中限定的上述功能。根据本公开的实施例，上文描述的系统、设备、装置、模块、单元等可以通过计算机程序模块来实现。According to the embodiments of the present disclosure, the method flow according to the embodiments of the present disclosure may be implemented as a computer software program. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a computer-readable storage medium, the computer program containing program code for performing the method illustrated in the flowchart. In such an embodiment, the computer program may be downloaded and installed from the network via the communication portion 909, and/or installed from the removable medium 911. When the computer program is executed by the processor 901, the above-described functions defined in the system of the embodiment of the present disclosure are performed. According to embodiments of the present disclosure, the above-described systems, apparatuses, apparatuses, modules, units, etc. may be implemented by computer program modules.

本公开还提供了一种计算机可读存储介质，该计算机可读存储介质可以是上述实施例中描述的设备/装置/系统中所包含的；也可以是单独存在，而未装配入该设备/装置/系统中。上述计算机可读存储介质承载有一个或者多个程序，当上述一个或者多个程序被执行时，实现根据本公开实施例的方法。The present disclosure also provides a computer-readable storage medium. The computer-readable storage medium may be included in the device/apparatus/system described in the above embodiments; it may also exist alone without being assembled into the device/system. device/system. The above-mentioned computer-readable storage medium carries one or more programs, and when the above-mentioned one or more programs are executed, implement the method according to the embodiment of the present disclosure.

根据本公开的实施例，计算机可读存储介质可以是非易失性的计算机可读存储介质。例如可以包括但不限于：便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开中，计算机可读存储介质可以是任何包含或存储程序的有形介质，该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。According to an embodiment of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium. Examples may include, but are not limited to, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), portable compact disk read only memory (CD- ROM), optical storage devices, magnetic storage devices, or any suitable combination of the above. In the present disclosure, a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.

例如，根据本公开的实施例，计算机可读存储介质可以包括上文描述的ROM 902和/或RAM 903和/或ROM 902和RAM 903以外的一个或多个存储器。For example, according to embodiments of the present disclosure, a computer-readable storage medium may include one or more memories other than ROM 902 and/or RAM 903 and/or ROM 902 and RAM 903 described above.

附图中的流程图和框图，图示了按照本公开各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上，流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分，上述模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意，在有些作为替换的实现中，方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如，两个接连地表示的方框实际上可以基本并行地执行，它们有时也可以按相反的顺序执行，这依所涉及的功能而定。也要注意的是，框图或流程图中的每个方框、以及框图或流程图中的方框的组合，可以用执行规定的功能或操作的专用的基于硬件的系统来实现，或者可以用专用硬件与计算机指令的组合来实现。本领域技术人员可以理解，本公开的各个实施例和/或权利要求中记载的特征可以进行多种组合和/或结合，即使这样的组合或结合没有明确记载于本公开中。特别地，在不脱离本公开精神和教导的情况下，本公开的各个实施例和/或权利要求中记载的特征可以进行多种组合和/或结合。所有这些组合和/或结合均落入本公开的范围。The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more logical functions for implementing the specified functions executable instructions. It should also be noted that, in some alternative implementations, the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It is also noted that each block of the block diagrams or flowchart illustrations, and combinations of blocks in the block diagrams or flowchart illustrations, can be implemented in special purpose hardware-based systems that perform the specified functions or operations, or can be implemented using A combination of dedicated hardware and computer instructions is implemented. Those skilled in the art will appreciate that various combinations and/or combinations of features recited in various embodiments and/or claims of the present disclosure are possible, even if such combinations or combinations are not expressly recited in the present disclosure. In particular, various combinations and/or combinations of the features recited in the various embodiments of the present disclosure and/or in the claims may be made without departing from the spirit and teachings of the present disclosure. All such combinations and/or combinations fall within the scope of this disclosure.

以上对本公开的实施例进行了描述。但是，这些实施例仅仅是为了说明的目的，而并非为了限制本公开的范围。尽管在以上分别描述了各实施例，但是这并不意味着各个实施例中的措施不能有利地结合使用。本公开的范围由所附权利要求及其等同物限定。不脱离本公开的范围，本领域技术人员可以做出多种替代和修改，这些替代和修改都应落在本公开的范围之内。Embodiments of the present disclosure have been described above. However, these examples are for illustrative purposes only, and are not intended to limit the scope of the present disclosure. Although the various embodiments are described above separately, this does not mean that the measures in the various embodiments cannot be used in combination to advantage. The scope of the present disclosure is defined by the appended claims and their equivalents. Without departing from the scope of the present disclosure, those skilled in the art can make various substitutions and modifications, and these substitutions and modifications should all fall within the scope of the present disclosure.

Claims

1. A method of data processing, comprising:

acquiring original data related to safety;

extracting a plurality of entity objects from the raw data;

processing the entity objects by using a threat intelligence data set to obtain respective label characteristics of each entity object in the entity objects, wherein the label characteristics are used for representing the security attributes and/or malicious attributes of the entity objects; and

and determining the threat information of each entity object according to the label characteristics of each entity object.

2. The method of claim 1, wherein the threat intelligence data set comprises a plurality of knowledge bases;

processing the plurality of physical objects with a threat intelligence dataset comprises: for each of the plurality of entity objects,

processing the entity object by using at least one knowledge base in a plurality of knowledge bases to obtain a processing result of each knowledge base in the at least one knowledge base, wherein each knowledge base comprises a plurality of entity objects marked with tag characteristics; and

and determining the label characteristics of the entity object according to the processing result of each knowledge base.

3. The method of claim 2, wherein processing the entity object with at least one of a plurality of repositories includes: for each of the at least one knowledge base,

determining whether a target entity object identical to the entity object is included in the knowledge base; and

and if the knowledge base comprises the target entity object which is the same as the entity object, marking the entity object by using the label characteristic of the target entity object.

4. The method of claim 1, wherein determining threat information for the each physical object based on the tag characteristics of the each physical object comprises: for each of the entity objects, a function is performed,

and processing the label characteristics of the entity object by using a network model to obtain threat information of the entity object.

5. The method of claim 1 or 2, wherein the entity object comprises a file;

the method further comprises the following steps:

extracting the characteristics of the file to obtain the static characteristics and the dynamic characteristics of the file;

running the file by using a sandbox to obtain the behavior characteristics of the file; and

and determining threat information of the file according to at least one of the static characteristic, the dynamic characteristic, the behavior characteristic and the label characteristic of the file.

6. The method of claim 1, further comprising:

integrating the entity objects processed by the threat intelligence data set; and

and associating the integrated entity objects according to the label characteristics of the entity objects to obtain an entity object relation data set.

7. The method of claim 6, further comprising:

acquiring a new entity object;

processing the new entity object by using the entity object relation data set to obtain the label characteristic of the new entity object; and

and determining the threat information of the entity object according to the label characteristics of the new entity object.

8. The method of any one of claims 1-7, wherein:

the entity object comprises at least one of a file, a domain name, an IP and a webpage address;

the at least one knowledge base comprises at least one of a white name list base, a black name list base, a recorded domain name base, a credit file base, a lost host base and a credit IP base;

the threat information includes at least one of a malicious type, attacker information, and means of attack.

9. A data processing apparatus comprising:

the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring original data related to safety;

a first extraction module, configured to extract a plurality of entity objects from the raw data;

the first processing module is used for processing the entity objects by utilizing a threat intelligence data set to obtain respective label characteristics of each entity object in the entity objects, wherein the label characteristics are used for representing the security attributes and/or malicious attributes of the entity objects; and

and the determining module is used for determining the threat information of each entity object according to the label characteristics of each entity object.

10. A computer system, comprising:

one or more processors;

a memory for storing one or more programs,

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-8.

11. A computer readable storage medium having stored thereon executable instructions which, when executed by a processor, cause the processor to carry out the method of any one of claims 1 to 8.