CN115062002A

CN115062002A - Stream data processing method and device

Info

Publication number: CN115062002A
Application number: CN202210524751.8A
Authority: CN
Inventors: 刘进
Original assignee: Secworld Information Technology Beijing Co Ltd; Qax Technology Group Inc
Current assignee: Secworld Information Technology Beijing Co Ltd; Qax Technology Group Inc
Priority date: 2022-05-13
Filing date: 2022-05-13
Publication date: 2022-09-16

Abstract

The embodiment of the invention provides a streaming data processing method and device. Wherein, the method comprises the following steps: determining corresponding configuration parameters according to the current target log; in response to receiving at least one rule selected by a user, determining a target model according to the at least one rule and configuration parameters, wherein the data processing capacity of the target model is matched with the target log; processing the target log through a correlation analysis engine according to the target model to obtain and store a corresponding flow analysis result; and reading the flow analysis result of the preset time period, carrying out global weight judgment and global statistics on the flow analysis result of the preset time period through the bloom filter to obtain a global flow analysis result, and storing the global flow analysis result in a database. The method and the device realize the judgment, merging, counting, whitening and grouping of real-time streaming data.

Description

Stream data processing method and device

技术领域technical field

本发明涉及计算机技术领域，尤其涉及一种流式数据处理方法及装置。The present invention relates to the field of computer technology, and in particular, to a method and device for processing stream data.

背景技术Background technique

布隆过滤器，本质上是比较巧妙的概率型数据结构(二进制向量)，存放的不是0就是1。Bloom filter is essentially a relatively clever probabilistic data structure (binary vector), which stores either 0 or 1.

目前，布隆过滤器技术大多数都用于基于有限的数据量去重等问题，例如：Google的分布式数据库Bigtable使用布隆过滤器来查找不存在的行或列；Google Chrome浏览器使用布隆过滤器加速安全浏览服务；SPIN模型检测器使用布隆过滤器在大规模验证问题时跟踪可达状态空间；Venti文档存储系统也采用布隆过滤器来检测先前存储的数据。这样势必会导致布隆过滤器存储数据量极限受限，扩展性不足等问题。At present, most of the bloom filter techniques are used for problems such as deduplication based on a limited amount of data, for example: Google's distributed database Bigtable uses bloom filters to find non-existent rows or columns; Google Chrome uses Bloom filters Bloom filters speed up safe browsing services; the SPIN model detector uses bloom filters to track reachable state spaces when validating problems at scale; the Venti document storage system also employs bloom filters to detect previously stored data. This will inevitably lead to problems such as limited storage data volume and insufficient scalability of the Bloom filter.

发明内容SUMMARY OF THE INVENTION

针对现有技术中的问题，本发明实施例提供一种流式数据处理方法及装置。In view of the problems in the prior art, embodiments of the present invention provide a method and apparatus for processing stream data.

具体地，本发明实施例提供了以下技术方案：Specifically, the embodiments of the present invention provide the following technical solutions:

第一方面，本发明实施例提供了一种流式数据处理方法，包括：根据当前的目标日志确定对应的配置参数；响应于接收到用户选中的至少一个规则，根据所述至少一个规则和所述配置参数确定目标模型，所述目标模型的数据处理能力与所述目标日志的大小匹配；根据所述目标模型，通过关联分析引擎对所述目标日志进行处理，得到并存储对应的流量分析结果；读取预设时间段的流量分析结果，通过布隆过滤器对所述预设时间段的流量分析结果进行全局判重和全局统计，得到全局流量分析结果，并将所述全局流量分析结果保存在数据库。In a first aspect, an embodiment of the present invention provides a streaming data processing method, including: determining a corresponding configuration parameter according to a current target log; in response to receiving at least one rule selected by a user, according to the at least one rule and all The configuration parameters determine the target model, and the data processing capability of the target model matches the size of the target log; according to the target model, the target log is processed by the correlation analysis engine, and the corresponding traffic analysis result is obtained and stored. ; Read the traffic analysis results of the preset time period, perform global judgment and global statistics on the traffic analysis results of the preset time period through the Bloom filter, obtain the global traffic analysis results, and use the global traffic analysis results. saved in the database.

进一步地，关联分析引擎包括Sabre引擎。Further, the association analysis engine includes a Sabre engine.

进一步地，配置参数包括以下至少一项：对应所述目标日志的流量大小、对应所述目标日志的黑名单、对应所述目标日志的存储地址、对应所述目标日志的目标字段和对应所述目标日志的归并字段。Further, the configuration parameters include at least one of the following: a traffic size corresponding to the target log, a blacklist corresponding to the target log, a storage address corresponding to the target log, a target field corresponding to the target log, and a target field corresponding to the target log. The merge field for the target log.

进一步地，根据当前的目标日志确定对应的配置参数之前，还包括：预设至少一个初始规则，所述至少一个初始规则用于被用户选择。Further, before determining the corresponding configuration parameter according to the current target log, the method further includes: presetting at least one initial rule, and the at least one initial rule is used to be selected by the user.

进一步地，读取预设时间段的流量分析结果，通过布隆过滤器对所述预设时间段的流量分析结果进行全局判重和全局统计，得到全局流量分析结果，并将所述全局流量分析结果保存在数据库，包括：读取预设时间段的流量分析结果，通过布隆过滤器对所述预设时间段的流量分析结果进行全局判重和全局统计，得到全局流量分析结果；将所述全局流量分析结果换算为对应的二进制向量并保存在所述布隆过滤器中；将所述全局流量分析结果保存在数据库。Further, read the traffic analysis results of the preset time period, perform global weighting and global statistics on the traffic analysis results of the preset time period through the Bloom filter, obtain the global traffic analysis results, and use the global traffic analysis results. The analysis results are stored in the database, including: reading the traffic analysis results of a preset time period, and performing global weighting and global statistics on the traffic analysis results of the preset time period through a Bloom filter, so as to obtain the global traffic analysis results; The global flow analysis result is converted into a corresponding binary vector and stored in the Bloom filter; the global flow analysis result is stored in a database.

进一步地，所述方法还包括：设置定时删除任务，根据所述定时删除任务清理所述数据库中的数据，以及将所述布隆过滤器中的对应所述数据的二进制向量置零。Further, the method further includes: setting a scheduled deletion task, clearing the data in the database according to the scheduled deletion task, and setting a binary vector corresponding to the data in the Bloom filter to zero.

进一步地，所述根据所述目标模型，通过关联分析引擎对所述目标日志进行处理，包括：根据所述目标模型，通过关联分析引擎对所述目标日志进行判重、归并、计数、加白和分组。Further, according to the target model, the target log is processed by an association analysis engine, including: according to the target model, the target log is judged, merged, counted, and whitened by an association analysis engine. and grouping.

第二方面，本发明实施例还提供了一种流式数据处理装置，包括：第一处理模块，用于根据当前的目标日志确定对应的配置参数；第二处理模块，用于响应于接收到用户选中的至少一个规则，根据所述至少一个规则和所述配置参数确定目标模型，所述目标模型的数据处理能力与所述目标日志的大小匹配；第三处理模块，用于根据所述目标模型，通过关联分析引擎对所述目标日志进行处理，得到并存储对应的流量分析结果；第四处理模块，用于读取预设时间段的流量分析结果，通过布隆过滤器对所述预设时间段的流量分析结果进行全局判重和全局统计，得到全局流量分析结果，并将所述全局流量分析结果保存在数据库。In a second aspect, an embodiment of the present invention further provides a streaming data processing apparatus, including: a first processing module, configured to determine a corresponding configuration parameter according to a current target log; a second processing module, configured to respond to the received At least one rule selected by the user, a target model is determined according to the at least one rule and the configuration parameter, and the data processing capability of the target model matches the size of the target log; the third processing module is used for according to the target model. The model, processes the target log through the correlation analysis engine, and obtains and stores the corresponding traffic analysis result; the fourth processing module is used for reading the traffic analysis result of the preset time period, It is assumed that the traffic analysis results of the time period are subjected to global weight judgment and global statistics to obtain the global traffic analysis results, and the global traffic analysis results are stored in the database.

第三方面，本发明实施例还提供了一种电子设备，包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序，所述处理器执行所述程序时实现如第一方面所述流式数据处理方法的步骤。In a third aspect, an embodiment of the present invention further provides an electronic device, including a memory, a processor, and a computer program stored in the memory and running on the processor, the processor implementing the first program when executing the program The steps of the streaming data processing method described in the aspect.

第四方面，本发明实施例还提供了一种非暂态计算机可读存储介质，其上存储有计算机程序，该计算机程序被处理器执行时实现如第一方面所述流式数据处理方法的步骤。In a fourth aspect, an embodiment of the present invention further provides a non-transitory computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, implements the stream data processing method described in the first aspect. step.

第五方面，本发明实施例还提供了一种计算机程序产品，其上存储有可执行指令，该指令被处理器执行时使处理器实现第一方面所述流式数据处理方法的步骤。In a fifth aspect, an embodiment of the present invention further provides a computer program product that stores executable instructions thereon, and when the instructions are executed by a processor, enables the processor to implement the steps of the streaming data processing method described in the first aspect.

本发明实施例提供的流式数据处理方法及装置，根据构建的不同的目标模型，通过关联分析引擎对流式数据进行实时计算和实时统计，实现了对数据集筛选和过滤。再通过布隆过滤器处理当前流式数据，扩展了布隆过滤器的数据处理能力；通过布隆过滤器高效的标记这些数据实现了对预设时间段的流式数据的全局统计和全局判重；布隆过滤器仅对流量分析结果进行全局判重和全局统计得到全局流量分析结果，将全局流量分析结果对应的数据保存在数据库中，解决了布隆过滤器存储数据有限的问题。The streaming data processing method and device provided by the embodiments of the present invention perform real-time calculation and real-time statistics on streaming data through an association analysis engine according to different target models constructed, so as to realize the screening and filtering of data sets. Then, the current streaming data is processed through the bloom filter, which expands the data processing capability of the bloom filter; the data can be efficiently marked by the bloom filter to realize the global statistics and global judgment of the streaming data of the preset time period. Heavy; the Bloom filter only performs global weighting and global statistics on the traffic analysis results to obtain the global traffic analysis results, and saves the data corresponding to the global traffic analysis results in the database, which solves the problem of limited data storage by the Bloom filter.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍，显而易见地，下面描述中的附图是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description These are some embodiments of the present invention. For those of ordinary skill in the art, other drawings can also be obtained according to these drawings without creative efforts.

图1为本发明的流式数据处理方法实施例流程图；1 is a flow chart of an embodiment of a streaming data processing method according to the present invention;

图2为流式数据处理方法的框架示意图；Fig. 2 is the frame schematic diagram of the stream data processing method;

图3为流式数据处理方法的业务模块的设计示意图；Fig. 3 is the design schematic diagram of the business module of the stream data processing method;

图4为本发明的流式数据处理装置实施例结构示意图；4 is a schematic structural diagram of an embodiment of a streaming data processing apparatus according to the present invention;

图5为本发明电子设备实体实施例结构示意图。FIG. 5 is a schematic structural diagram of a physical embodiment of an electronic device according to the present invention.

具体实施方式Detailed ways

为使本发明实施例的目的、技术方案和优点更加清楚，下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。In order to make the purposes, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments These are some embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

图1为本发明流式数据处理方法实施例流程图。如图1所示，本发明实施例的流式数据处理方法包括：FIG. 1 is a flowchart of an embodiment of a stream data processing method according to the present invention. As shown in FIG. 1 , the streaming data processing method according to the embodiment of the present invention includes:

S101，根据当前的目标日志确定对应的配置参数。S101: Determine corresponding configuration parameters according to the current target log.

目标日志为流量日志，即记录流量的日志，包括记录进出口流量的大小、点击流数据等，流量数据大，对应的流量日志也就大。The target log is the traffic log, that is, the log that records the traffic, including the size of the inbound and outbound traffic, click stream data, etc. The larger the traffic data, the larger the corresponding traffic log.

作为示例，流量日志可以包括时间戳，源IP，目的IP，源端口，目的端口，进出流量等。通常，会将一个流整合为一个记录后发往日志服务器。一个流指相同的源IP，目的IP和目的端口。As an example, traffic logs may include timestamps, source IP, destination IP, source port, destination port, incoming and outgoing traffic, etc. Typically, a stream is aggregated into one record and sent to the log server. A flow refers to the same source IP, destination IP and destination port.

作为示例，当前的目标日志可以是前一秒的流量日志。As an example, the current target log may be the traffic log of the previous second.

作为示例，若流量日志仅包括时间戳和进出流量大小，当前的目标日志为前一秒的流量日志，那么对应当前的目标日志的配置参数可以从流量日志中得到：目标字段、时间、进出流量大小。As an example, if the traffic log only includes the timestamp and the size of incoming and outgoing traffic, and the current target log is the traffic log of the previous second, then the configuration parameters corresponding to the current target log can be obtained from the traffic log: target field, time, incoming and outgoing traffic size.

S102，响应于接收到用户选中的至少一个规则，根据至少一个规则和配置参数确定目标模型，目标模型的数据处理能力与目标日志的大小匹配。S102, in response to receiving at least one rule selected by the user, determine a target model according to the at least one rule and configuration parameters, where the data processing capability of the target model matches the size of the target log.

配置参数是基于流量的变化而动态调整的，至少一个规则是基于用户选择得到的不同的规则的组合，从而，根据至少一个规则和配置参数这两个动态因素确定目标模型。The configuration parameters are dynamically adjusted based on changes in traffic, and at least one rule is based on a combination of different rules selected by the user, so that the target model is determined according to the two dynamic factors of the at least one rule and the configuration parameter.

作为示例，人机交互界面可以显示多个场景的名称，用户可以通过人机交互界面选择至少一个场景的方式选择至少一个规则。人机交互界面也可以展示预先设置的多个规则的名称和规则的用途(例如，某个规则用于判重、归并、计数、加白或者分组)，用于用户选择。As an example, the human-computer interaction interface may display the names of multiple scenarios, and the user may select at least one rule by selecting at least one scenario through the human-computer interaction interface. The human-computer interaction interface can also display the names of multiple preset rules and the uses of the rules (for example, a rule is used for weighting, merging, counting, whitening or grouping) for user selection.

作为示例，用户可以通过人机交互界面选择场景1(场景1对应的规则1用于将流量数据判重)和场景2(场景2对应的规则2用于将流量数据归并)，配置参数可以是从流量日志中得到的进出流量大小(例如为10MB)，那么根据规则1和规则2以及配置参数确定的目标模型可以为一个能够对流量大小为10MB的数据进行判重和归并的模型。As an example, the user can select scenario 1 (rule 1 corresponding to scenario 1 is used to judge traffic data) and scenario 2 (rule 2 corresponding to scenario 2 is used to merge traffic data) through the human-computer interaction interface, and the configuration parameters can be The size of the incoming and outgoing traffic obtained from the traffic log (for example, 10MB), then the target model determined according to rule 1 and rule 2 and the configuration parameters can be a model that can judge and merge data with a traffic size of 10MB.

作为示例，目标模型可以根据用户选中的至少一个规则、配置参数以及认证信息确定。认证信息用于在数据交互的过程中进行安全认证。As an example, the target model may be determined according to at least one rule selected by the user, configuration parameters, and authentication information. Authentication information is used to perform security authentication during data interaction.

根据至少一个规则和配置参数确定目标模型，实现了封装多种规则构建不同的模型，筛选出多种数据结构，适配不同场景。The target model is determined according to at least one rule and configuration parameters, so as to encapsulate multiple rules to build different models, filter out multiple data structures, and adapt to different scenarios.

S103，根据目标模型，通过关联分析引擎对目标日志进行处理，得到并存储对应的流量分析结果。S103, according to the target model, the target log is processed by the correlation analysis engine, and the corresponding traffic analysis result is obtained and stored.

关联分析引擎(例如Sabre引擎)是可实现实时计算和实时统计的大数据流式分布式关联分析引擎。The correlation analysis engine (such as the Sabre engine) is a big data streaming distributed correlation analysis engine that can realize real-time calculation and real-time statistics.

作为示例，根据当前的目标日志，确定与之对应的目标模型，然后，关联分析引擎根据目标模型对当前的目标日志进行判重、归并、计数、加白和分组，完成对的实时分析，并将实时分析的流量分析结果存储。本申请对存储的方式不做限定。As an example, according to the current target log, the corresponding target model is determined, and then, the correlation analysis engine performs weighting, merging, counting, whitening and grouping on the current target log according to the target model, and completes the real-time analysis of the pair. Store the traffic analysis results of real-time analysis. This application does not limit the storage method.

关联分析引擎，可以是能够实时计算和实时统计大数据流式关联分析引擎。The correlation analysis engine may be a streaming correlation analysis engine capable of real-time calculation and real-time statistics of big data.

S104，读取预设时间段的流量分析结果，通过布隆过滤器对预设时间段的流量分析结果进行全局判重和全局统计，得到全局流量分析结果，并将全局流量分析结果保存在数据库。S104: Read the traffic analysis result of the preset time period, perform global weighting and global statistics on the traffic analysis result of the preset time period through the Bloom filter, obtain the global traffic analysis result, and save the global traffic analysis result in the database .

在一些实例中，上述的流量分析结果是实时存储的，例如每一秒实时存储一次流量分析结果，当存储到一定量的流量分析结果时，可以读取预设时间段的流量分析结果。作为示例，若已经存储有前两个小时的流量分析结果，且预设时间段为两小时，那么读取这两小时的流量分析结果，通过布隆过滤器对预设时间段的流量分析结果进行全局判重和全局统计(即得到前两个小时的判重和统计结果)，得到全局流量分析结果，并将全局流量分析结果保存在数据库。本发明数据库类型和对全局流量分析结果的保存方式不做限定。In some instances, the above traffic analysis results are stored in real time, for example, the traffic analysis results are stored in real time every second, and when a certain amount of traffic analysis results are stored, the traffic analysis results in a preset time period can be read. As an example, if the traffic analysis results for the first two hours have been stored and the preset time period is two hours, read the traffic analysis results of the two hours, and use the Bloom filter to analyze the traffic analysis results in the preset time period. Carry out global weight judgment and global statistics (that is, obtain the judgment weight and statistical results of the first two hours), obtain the global traffic analysis results, and save the global traffic analysis results in the database. The database type and the storage manner of the global traffic analysis result are not limited in the present invention.

布隆过滤器，本质上是二进制的数据结构，用来判断某个元素(key)是否在某个集合中。布隆过滤器采用的布隆算法是以二进制数据集合为基础的判重算法。通过布隆过滤器，实现流式数据的全局判重和全局统计。Bloom filter, which is essentially a binary data structure, is used to determine whether an element (key) is in a certain set. The Bloom algorithm used in the Bloom filter is a weighting algorithm based on binary data sets. Through the Bloom filter, the global judgment and global statistics of streaming data are realized.

在上述实施例的基础上，关联分析引擎可以是Sabre引擎。On the basis of the above embodiment, the association analysis engine may be a Sabre engine.

Sabre引擎，是一种可以实时计算和实时统计大数据流式关联分析引擎。Sabre engine is a stream correlation analysis engine that can calculate and count big data in real time.

本发明实施例提供的流式数据处理方法，采用Sabre引擎的实时计算和实时统计大数据的优点，进一步便于对当前的目标日志的实时计算和实时统计。The streaming data processing method provided by the embodiment of the present invention adopts the advantages of the real-time calculation and real-time statistics of big data of the Sabre engine, which further facilitates the real-time calculation and real-time statistics of the current target log.

在上述实施例的基础上，配置参数可以包括以下至少一项：对应目标日志的流量大小、对应目标日志的黑名单、对应目标日志的存储地址、对应目标日志的目标字段和对应目标日志的归并字段。On the basis of the above embodiment, the configuration parameters may include at least one of the following: the traffic size corresponding to the target log, the blacklist corresponding to the target log, the storage address of the corresponding target log, the target field corresponding to the target log, and the merging of the corresponding target log field.

配置参数可以根据当前的目标日志获得的。例如流量大小、存储地址以及目标字段可以直接在对应的目标日志中读取，黑名单可以是根据目标字段进行分析后确定的。不用类型的目标日志对应不同的黑名单。根据目标日志中的目标字段获取信息。根据归并字段将获取的信息整合。同样的，配置参数也可以包括对应目标日志的白名单。实现了循环调控数据处理的配置参数，适配不同数据量的目标模型。Configuration parameters can be obtained from the current target log. For example, the traffic size, storage address, and target fields can be directly read in the corresponding target log, and the blacklist can be determined after analyzing the target fields. Different types of target logs correspond to different blacklists. Get information based on the target field in the target log. Integrate the acquired information according to the merge fields. Similarly, configuration parameters can also include a whitelist of corresponding target logs. The configuration parameters of cyclic regulation data processing are realized, and the target model of different data volume is adapted.

在上述实施例的基础上，根据当前的目标日志确定对应的配置参数之前，还可以包括：预设至少一个初始规则，至少一个初始规则用于被用户选择。On the basis of the above embodiment, before determining the corresponding configuration parameter according to the current target log, the method may further include: preset at least one initial rule, and at least one initial rule is used for selection by the user.

预设的至少一个初始规则可以是针对不同类型的目标日志的判重、归并、计数、加白或分组。基于用户选定不同的初始规则，能够组合成新的规则，从而不断动态扩充可选规则。The preset at least one initial rule may be weighting, merging, counting, whitening or grouping for different types of target logs. Based on the different initial rules selected by the user, new rules can be combined to continuously expand the optional rules dynamically.

在上述实施例的基础上，读取预设时间段的流量分析结果，通过布隆过滤器对预设时间段的流量分析结果进行全局判重和全局统计，得到全局流量分析结果，并将全局流量分析结果保存在数据库，可以包括：读取预设时间段的流量分析结果，通过布隆过滤器对预设时间段的流量分析结果进行全局判重和全局统计，得到全局流量分析结果；将全局流量分析结果换算为对应的二进制向量并保存在布隆过滤器中；将全局流量分析结果保存在数据库。On the basis of the above-mentioned embodiment, the traffic analysis results of the preset time period are read, and the traffic analysis results of the preset time period are globally judged and counted through the Bloom filter to obtain the global traffic analysis results, and the global traffic analysis results are obtained. The traffic analysis results are stored in the database, which may include: reading the traffic analysis results of the preset time period, and performing global weighting and global statistics on the traffic analysis results of the preset time period through the Bloom filter, so as to obtain the global traffic analysis results; The global traffic analysis results are converted into corresponding binary vectors and stored in the Bloom filter; the global traffic analysis results are stored in the database.

布隆过滤器中保存的二进制向量与保存在数据库中的全局流量分析结果对应。The binary vector saved in the bloom filter corresponds to the global traffic analysis result saved in the database.

在上述实施例的基础上，还可以包括：设置定时删除任务，根据定时删除任务清理数据库中的数据，以及将布隆过滤器中的对应数据的二进制向量置零。Based on the above embodiment, the method may further include: setting a scheduled deletion task, cleaning data in the database according to the scheduled deletion task, and setting the binary vector of the corresponding data in the Bloom filter to zero.

定时删除任务也可以是根据时间设定的删除任务。例如，设置布隆过滤器和数据库只存储一个星期的数据，若下一个星期的数据需要存进来，就需要删除存储时间最早的数据。The scheduled deletion task may also be a deletion task set according to time. For example, set the Bloom filter and the database to store only one week's data. If the next week's data needs to be stored, the data with the earliest storage time needs to be deleted.

定时删除任务也可以设定布隆过滤器和数据库中的存储数据的最大数量，若布隆过滤器和数据库达到了存储极限，那么当新的数据需要存储时，会相应的删除存储时间最久的数据。The scheduled deletion task can also set the maximum number of stored data in the bloom filter and database. If the bloom filter and database reach the storage limit, when new data needs to be stored, the one with the longest storage time will be deleted accordingly. data.

通过将布隆过滤器中的对应数据的二进制向量置零，实现了布隆过滤器的删除(重置)功能。The delete (reset) function of the bloom filter is implemented by zeroing the binary vector of the corresponding data in the bloom filter.

在上述实施例的基础上，根据所述目标模型，通过关联分析引擎对所述目标日志进行处理，包括：根据所述目标模型，通过关联分析引擎对所述目标日志进行判重、归并、计数、加白和分组。On the basis of the above embodiment, according to the target model, the target log is processed by the correlation analysis engine, including: according to the target model, the target log is judged, merged, and counted by the correlation analysis engine. , whitening and grouping.

判重，可以是判断日志中的数据是否存在重复，作为示例，将日志中的目标字段的内容进行对比，剔除重复的内容。To judge the severity, it may be to judge whether the data in the log is duplicated. As an example, compare the contents of the target fields in the log to eliminate duplicate contents.

归并，可以是将日志中的符合条件的数据整合在一起。Merge, which can be the integration of eligible data in the log.

计数，可以是计算日志中存在多少个数据段，或者计算日志中的重复的数据的总和等。可以根据需要对指定特征的数据计数。The count can be to calculate how many data segments exist in the log, or to calculate the sum of duplicate data in the log, etc. You can count data for a specified feature as needed.

加白，可以是根据预先设置的白名单，将在白名单中的数据直接过滤掉，或者根据预先设置的黑名单，保留在黑名单中的数据。也可以同时根据白名单和黑名单对日志流量进行筛选，本发明对此不做限定。白名单和黑名单中的内容可以根据具体需要设定。Adding white can be to directly filter out the data in the whitelist according to a preset whitelist, or to keep the data in the blacklist according to a preset blacklist. The log traffic can also be screened according to the whitelist and the blacklist at the same time, which is not limited in the present invention. The contents of the whitelist and blacklist can be set according to specific needs.

分组，可以是根据日志中的目标字段对日志流量数据分组。例如目标字段为id，日志流量数据的字段包括id和对应id的数据内容。目标字段为所有为偶数的id，则可以将所有为偶数的字段分为一组，剩下的所有为奇数的字段分为一组。The grouping can be to group the log traffic data according to the target field in the log. For example, the target field is id, and the field of log traffic data includes id and the data content of the corresponding id. If the target field is all even-numbered ids, then all even-numbered fields can be grouped into one group, and all remaining odd-numbered fields can be grouped into one group.

如图2所示，根据上述实施例的方案的框架图，采用布隆过滤器结合流式分析引擎(sabre引擎)，实现了对流式(流量)数据的过滤、全局判重和全局统计功能。系统初始化预置若干(可配)规则，用于组合成其他规则的数据源头。配置参数依赖于流式数据量的变化进行自动调整，并定时同步到模型中，模型随着规则和配置的变化而更新。针对流式数据量的变化，模型会自动进行调整，对应的数据处理能力和入库能力也会有相应的调整，直到接入的数据量与处理能力达到平衡。As shown in FIG. 2 , according to the framework diagram of the solution of the above embodiment, the functions of filtering, global weighting and global statistics of streaming (traffic) data are realized by using Bloom filter combined with a streaming analysis engine (sabre engine). The system initializes a number of preset (configurable) rules, which are used to form data sources for other rules. Configuration parameters are automatically adjusted depending on changes in the amount of streaming data, and are regularly synchronized to the model. The model is updated with changes in rules and configurations. In response to changes in the amount of streaming data, the model will be automatically adjusted, and the corresponding data processing capabilities and storage capabilities will also be adjusted accordingly, until the amount of data accessed and the processing capability are balanced.

根据上述实施例的方案的业务模块的设计可以参考图3。图3中Request的内容可以为:主体属性(例如请求格式)，客体资源属性(例如选择的规则的id)，规则名称、规则的启停信息。Response的内容可以为:允许，拒绝和报错。Reference may be made to FIG. 3 for the design of the service module of the solution according to the above embodiment. The content of Request in FIG. 3 may be: subject attribute (eg, request format), object resource attribute (eg, the id of the selected rule), rule name, and start/stop information of the rule. The content of Response can be: Allow, Deny and Error.

1、系统预置默认支持的规则，前端可以直接选择不同场景进行组合；1. The system presets the rules supported by default, and the front-end can directly select different scenarios to combine;

2、用户选择不同的场景会组合不同的规则，sabre引擎接收到这些规则会构建出不同的数据处理模型；2. The user selects different scenarios to combine different rules, and the sabre engine receives these rules and builds different data processing models;

3、引擎根据构建的模型处理数据(判重、归并、计数、加白和分组)，不同的模型输出数据集不同；3. The engine processes data (weighting, merging, counting, whitening and grouping) according to the built model, and different models output different data sets;

4、根据布隆算法将输出的数据算成对应的二进制向量，保存到布隆过滤器；4. Calculate the output data into a corresponding binary vector according to the Bloom algorithm, and save it to the Bloom filter;

5、入库对象将过滤和判重后的数据保存到数据库进行持久化且同步记录在布隆过滤器进行标记；5. The warehousing object saves the filtered and weighted data to the database for persistence and records it synchronously in the Bloom filter for marking;

6、业务模块查询数据库获取到最新的数据集；6. The business module queries the database to obtain the latest dataset;

7、定时任务对象依据数据库存储的数据量和存储时间，同步清理数据库和删除布隆过滤器(置零)中的数据；7. The scheduled task object synchronously cleans the database and deletes the data in the Bloom filter (zero) according to the amount of data stored in the database and storage time;

8、监控流式数据量的变化，自适应的调整对应配置参数，sabre引擎应用最新的模型和配置参数，动态调控数据处理的输出数据，达到流式数据量与数据处理能力平衡。8. Monitor the changes in the amount of streaming data, and adjust the corresponding configuration parameters adaptively. The sabre engine applies the latest models and configuration parameters to dynamically adjust the output data of data processing to achieve a balance between the amount of streaming data and data processing capabilities.

综上，以预置的规则为基础，构建出多种数据处理模型，提供了更高效和方便的手段，筛选出流式数据中需要的数据，适配不同的场景。针对流式数据量的变化，模型自动调整，达到流式数据快速处理的效果。In summary, based on the preset rules, a variety of data processing models are constructed, providing a more efficient and convenient means to filter out the data needed in streaming data and adapt to different scenarios. In response to changes in the amount of streaming data, the model is automatically adjusted to achieve the effect of fast processing of streaming data.

图4为本发明流式数据处理装置实施例结构示意图。如图4所示，该流式数据处理装置，包括：FIG. 4 is a schematic structural diagram of an embodiment of a streaming data processing apparatus according to the present invention. As shown in Figure 4, the stream data processing device includes:

第一处理模块401，用于根据当前的目标日志确定对应的配置参数；The first processing module 401 is configured to determine corresponding configuration parameters according to the current target log;

第二处理模块402，用于响应于接收到用户选中的至少一个规则，根据至少一个规则和配置参数确定目标模型，目标模型的数据处理能力与目标日志的大小匹配；The second processing module 402 is configured to, in response to receiving at least one rule selected by the user, determine a target model according to at least one rule and configuration parameters, and the data processing capability of the target model matches the size of the target log;

第三处理模块403，用于根据目标模型，通过关联分析引擎对目标日志进行处理，得到并存储对应的流量分析结果；The third processing module 403 is configured to process the target log through the correlation analysis engine according to the target model, and obtain and store the corresponding traffic analysis result;

第四处理模块404，用于读取预设时间段的流量分析结果，通过布隆过滤器对预设时间段的流量分析结果进行全局判重和全局统计，得到全局流量分析结果，并将全局流量分析结果保存在数据库。The fourth processing module 404 is used to read the traffic analysis results of the preset time period, and perform global weighting and global statistics on the traffic analysis results of the preset time period through the Bloom filter, so as to obtain the global traffic analysis results, and analyze the global traffic analysis results. The traffic analysis results are saved in the database.

可选地，关联分析引擎包括Sabre引擎。Optionally, the association analysis engine includes a Sabre engine.

可选地，配置参数包括以下至少一项：对应目标日志的流量大小、对应目标日志的黑名单、对应目标日志的存储地址、对应目标日志的目标字段和对应目标日志的归并字段。Optionally, the configuration parameters include at least one of the following: a traffic size corresponding to the target log, a blacklist corresponding to the target log, a storage address corresponding to the target log, a target field corresponding to the target log, and a merge field corresponding to the target log.

可选地，装置还包括：Optionally, the device further includes:

第五处理模块405，用于预设至少一个初始规则，至少一个初始规则用于被用户选择。The fifth processing module 405 is configured to preset at least one initial rule, and the at least one initial rule is used to be selected by the user.

可选地，第四处理模块404，用于：Optionally, the fourth processing module 404 is used for:

读取预设时间段的流量分析结果，通过布隆过滤器对预设时间段的流量分析结果进行全局判重和全局统计，得到全局流量分析结果；Read the traffic analysis results of the preset time period, and perform global weighting and global statistics on the traffic analysis results of the preset time period through the Bloom filter to obtain the global traffic analysis results;

将全局流量分析结果换算为对应的二进制向量并保存在布隆过滤器中；Convert the global traffic analysis result to the corresponding binary vector and save it in the Bloom filter;

将全局流量分析结果保存在数据库。Save the global traffic analysis results in the database.

可选地，装置还包括：Optionally, the device further includes:

第六处理模块406，用于设置定时删除任务，根据定时删除任务清理数据库中的数据，以及将布隆过滤器中的对应数据的二进制向量置零。The sixth processing module 406 is configured to set the scheduled deletion task, clean up the data in the database according to the scheduled deletion task, and set the binary vector of the corresponding data in the Bloom filter to zero.

可选地，第三处理模块403，还用于：根据所述目标模型，通过关联分析引擎对所述目标日志进行判重、归并、计数、加白和分组。Optionally, the third processing module 403 is further configured to: according to the target model, perform weighting, merging, counting, whitening, and grouping on the target logs through an association analysis engine.

举个例子如下：An example is as follows:

图5示例了一种电子设备的实体结构示意图，如图5示，该电子设备可以包括：处理器(processor)501、通信接口(Communications Interface)502、存储器(memory)503和通信总线504，其中，处理器501，通信接口502，存储器503通过通信总线504完成相互间的通信。处理器501可以调用存储器503中的逻辑指令，以执行如下方法：根据当前的目标日志确定对应的配置参数；响应于接收到用户选中的至少一个规则，根据至少一个规则和配置参数确定目标模型，目标模型的数据处理能力与目标日志的大小匹配；根据目标模型，通过关联分析引擎对目标日志进行处理，得到并存储对应的流量分析结果；读取预设时间段的流量分析结果，通过布隆过滤器对预设时间段的流量分析结果进行全局判重和全局统计，得到全局流量分析结果，并将全局流量分析结果保存在数据库。FIG. 5 illustrates a schematic diagram of the physical structure of an electronic device. As shown in FIG. 5 , the electronic device may include: a processor (processor) 501, a communication interface (Communications Interface) 502, a memory (memory) 503 and a communication bus 504, wherein , the processor 501 , the communication interface 502 , and the memory 503 communicate with each other through the communication bus 504 . The processor 501 can invoke the logic instructions in the memory 503 to perform the following method: determine the corresponding configuration parameter according to the current target log; in response to receiving at least one rule selected by the user, determine the target model according to the at least one rule and the configuration parameter, The data processing capability of the target model matches the size of the target log; according to the target model, the target log is processed by the correlation analysis engine to obtain and store the corresponding traffic analysis results; The filter performs global weighting and global statistics on the traffic analysis results of the preset time period, obtains the global traffic analysis results, and saves the global traffic analysis results in the database.

此外，上述的存储器503中的逻辑指令可以通过软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本发明各个实施例方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器(ROM，Read-Only Memory)、随机存取存储器(RAM，Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。In addition, the above-mentioned logic instructions in the memory 503 can be implemented in the form of software functional units and can be stored in a computer-readable storage medium when sold or used as an independent product. Based on this understanding, the technical solution of the present invention can be embodied in the form of a software product in essence, or the part that contributes to the prior art or the part of the technical solution. The computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods of various embodiments of the present invention. The aforementioned storage medium includes: U disk, mobile hard disk, Read-Only Memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program codes .

另一方面，本发明实施例还提供一种计算机程序产品，计算机程序产品包括存储在非暂态计算机可读存储介质上的计算机程序，计算机程序包括程序指令，当程序指令被计算机执行时，计算机能够执行上述各实施例提供的流式数据处理方法，例如包括：根据当前的目标日志确定对应的配置参数；响应于接收到用户选中的至少一个规则，根据至少一个规则和配置参数确定目标模型，目标模型的数据处理能力与目标日志的大小匹配；根据目标模型，通过关联分析引擎对目标日志进行处理，得到并存储对应的流量分析结果；读取预设时间段的流量分析结果，通过布隆过滤器对预设时间段的流量分析结果进行全局判重和全局统计，得到全局流量分析结果，并将全局流量分析结果保存在数据库。On the other hand, an embodiment of the present invention further provides a computer program product, the computer program product includes a computer program stored on a non-transitory computer-readable storage medium, the computer program includes program instructions, and when the program instructions are executed by a computer, the computer program The streaming data processing methods provided by the above embodiments can be executed, for example, including: determining the corresponding configuration parameters according to the current target log; in response to receiving at least one rule selected by the user, determining the target model according to the at least one rule and the configuration parameters, The data processing capability of the target model matches the size of the target log; according to the target model, the target log is processed by the correlation analysis engine to obtain and store the corresponding traffic analysis results; The filter performs global weighting and global statistics on the traffic analysis results of the preset time period, obtains the global traffic analysis results, and saves the global traffic analysis results in the database.

又一方面，本发明还提供一种非暂态计算机可读存储介质，其上存储有计算机程序，该计算机程序被处理器执行时实现以执行上述各实施例提供的流式数据处理方法，例如包括：根据当前的目标日志确定对应的配置参数；响应于接收到用户选中的至少一个规则，根据至少一个规则和配置参数确定目标模型，目标模型的数据处理能力与目标日志的大小匹配；根据目标模型，通过关联分析引擎对目标日志进行处理，得到并存储对应的流量分析结果；读取预设时间段的流量分析结果，通过布隆过滤器对预设时间段的流量分析结果进行全局判重和全局统计，得到全局流量分析结果，并将全局流量分析结果保存在数据库。In another aspect, the present invention also provides a non-transitory computer-readable storage medium on which a computer program is stored, and the computer program is implemented when executed by a processor to execute the stream data processing methods provided by the above embodiments, for example Including: determining the corresponding configuration parameter according to the current target log; in response to receiving at least one rule selected by the user, determining the target model according to the at least one rule and the configuration parameter, and the data processing capability of the target model matches the size of the target log; according to the target The model processes the target log through the correlation analysis engine, and obtains and stores the corresponding traffic analysis results; reads the traffic analysis results of the preset time period, and performs global judgment on the traffic analysis results of the preset time period through the Bloom filter. and global statistics, obtain the global traffic analysis results, and save the global traffic analysis results in the database.

以上所描述的装置实施例仅仅是示意性的，其中作为分离部件说明的模块可以是或者也可以不是物理上分开的，作为模块显示的部件可以是或者也可以不是物理模块，即可以位于一个地方，或者也可以分布到多个网络模块上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下，即可以理解并实施。The device embodiments described above are only illustrative, wherein the modules described as separate components may or may not be physically separated, and the components shown as modules may or may not be physical modules, that is, they may be located in one place , or distributed to multiple network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment. Those of ordinary skill in the art can understand and implement it without creative effort.

通过以上的实施方式的描述，本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现，当然也可以通过硬件。基于这样的理解，上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品可以存储在计算机可读存储介质中，如ROM/RAM、磁碟、光盘等，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行各个实施例或者实施例的某些部分的方法。From the description of the above embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus a necessary general hardware platform, and certainly can also be implemented by hardware. Based on this understanding, the above-mentioned technical solutions can be embodied in the form of software products in essence or the parts that make contributions to the prior art, and the computer software products can be stored in computer-readable storage media, such as ROM/RAM, magnetic Disks, optical discs, etc., include instructions for causing a computer device (which may be a personal computer, server, or network device, etc.) to perform the methods of various embodiments or portions of embodiments.

最后应说明的是：以上实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, but not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that it can still be The technical solutions described in the foregoing embodiments are modified, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method of streaming data processing, the method comprising:

determining corresponding configuration parameters according to the current target log;

in response to receiving at least one rule selected by a user, determining a target model according to the at least one rule and the configuration parameters, wherein the data processing capacity of the target model is matched with the size of the target log;

processing the target log through a correlation analysis engine according to the target model to obtain and store a corresponding flow analysis result;

reading a flow analysis result of a preset time period, carrying out global duplication judgment and global statistics on the flow analysis result of the preset time period through a bloom filter to obtain a global flow analysis result, and storing the global flow analysis result in a database.

2. The streaming data processing method of claim 1, wherein the correlation analysis engine comprises a Sabre engine.

3. The streaming data processing method according to any of claims 1 to 2, wherein the configuration parameters include at least one of: the flow corresponding to the target log, the blacklist corresponding to the target log, the storage address corresponding to the target log, the target field corresponding to the target log and the merging field corresponding to the target log.

4. The streaming data processing method according to any one of claims 1 to 2, wherein before determining the corresponding configuration parameter according to the current target log, the method further comprises:

at least one initial rule is preset, the at least one initial rule being for selection by a user.

5. The streaming data processing method according to any one of claims 1 to 2, wherein the reading of the traffic analysis result in the preset time period, performing global re-judgment and global statistics on the traffic analysis result in the preset time period through a bloom filter to obtain a global traffic analysis result, and storing the global traffic analysis result in a database includes:

reading a flow analysis result of a preset time period, and performing global weight judgment and global statistics on the flow analysis result of the preset time period through a bloom filter to obtain a global flow analysis result;

converting the global flow analysis result into a corresponding binary vector and storing the binary vector in the bloom filter;

and storing the global flow analysis result in a database.

6. The streaming data processing method of claim 5, wherein the method further comprises:

and setting a timed deleting task, clearing the data in the database according to the timed deleting task, and setting the binary vector corresponding to the data in the bloom filter to zero.

7. The streaming data processing method of claim 1, wherein the processing the target log by a correlation analysis engine according to the target model comprises:

and judging, merging, counting, whitening and grouping the target logs through a correlation analysis engine according to the target model.

8. A streaming data processing apparatus, characterized in that the method comprises:

the first processing module is used for determining corresponding configuration parameters according to the current target log;

the second processing module is used for responding to at least one rule selected by a user, determining a target model according to the at least one rule and the configuration parameters, and the data processing capacity of the target model is matched with the size of the target log;

the third processing module is used for processing the target log through a correlation analysis engine according to the target model to obtain and store a corresponding flow analysis result;

and the fourth processing module is used for reading the flow analysis result of the preset time period, performing global duplication judgment and global statistics on the flow analysis result of the preset time period through the bloom filter to obtain a global flow analysis result, and storing the global flow analysis result in the database.

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the streaming data processing method according to any of claims 1 to 6 are implemented when the processor executes the program.

10. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the streaming data processing method according to any one of claims 1 to 6.

11. A computer program product having stored thereon executable instructions, characterized in that the instructions, when executed by a processor, cause the processor to carry out the steps of the streaming data processing method according to any of claims 1 to 6.