CA3057038C - Data filtering method, apparatus, electronic apparatus and storage medium - Google Patents

Data filtering method, apparatus, electronic apparatus and storage medium Download PDF

Info

Publication number
CA3057038C
CA3057038C CA3057038A CA3057038A CA3057038C CA 3057038 C CA3057038 C CA 3057038C CA 3057038 A CA3057038 A CA 3057038A CA 3057038 A CA3057038 A CA 3057038A CA 3057038 C CA3057038 C CA 3057038C
Authority
CA
Canada
Prior art keywords
data
identification information
storage medium
computer
readable storage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CA3057038A
Other languages
French (fr)
Other versions
CA3057038A1 (en
Inventor
Wanqiang LIU
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
10353744 Canada Ltd
Original Assignee
10353744 Canada Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 10353744 Canada Ltd filed Critical 10353744 Canada Ltd
Publication of CA3057038A1 publication Critical patent/CA3057038A1/en
Application granted granted Critical
Publication of CA3057038C publication Critical patent/CA3057038C/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The embodiments of the present invention disclose a data filtering method, a data filter, an electronic device, and a storage medium and relate to the technical field of big data. The method includes the following steps: generating a broadcast variable based on identification information of multiple pieces of data in a first data table and broadcasting said broadcast variable to each working node; extracting the identification information of newly added data generated at said working nodes and determining whether it exists in said broadcast variable; if yes, filtering the corresponding newly added data to a resilient distributed dataset to be processed. The present invention can solve the problems of delay and poor data processing efficiency due to occupation of a considerable amount of memory by the temporary table during data filtering.

Description

DATA FILTERING METHOD, APPARATUS, ELECTRONIC APPARATUS AND
STORAGE MEDIUM
Technical Field [0001] The present invention relates to the technical field of big data, in particular to a data filtering method, a data filter, an electronic device, and a computer-readable storage medium.
Background
[0002] The rapid development of Internet technology has ushered us into an era of big data, in which massive rapidly-updating real-time data is created. Against this backdrop, data filtering technology has emerged.
[0003] At present, related data filtering technologies function as follows:
the newly added monitoring and alerting data in Kafka is read by a Spark program to generate a temporary table directly by Spark SQL. Then, the temporary table and the alerting table in the database are subjected to a join query, and finally, the results of the join query are added into the database.
However, a considerable amount of memory is required when directly converting the newly added data into a temporary table, and more reading and writing operations will be needed when conducting a join query on two data tables, which, therefore, results in frequent delays and poorer data processing efficiency.
[0004] It should be noted that the information disclosed in the Supporting Technologies section above is only meant to enhance the understanding of the background of the invention, and thus it can include information that does not constitute the existing technology known to those having ordinary skill in the art.
Summary
[0005] The embodiments of the present invention aims to provide a data filtering method, a data filter, an electronic device, and a computer-readable storage medium, which can be used to overcome the problems of severe memory occupation and frequent delays during data filtering due to limitations and defects of the related technologies to some extent.
[0006] Other features and advantages of the present invention will be apparent from the following detailed descriptions or learned by practicing the present disclosure.

Date Recue/Date Received 2023-03-09
[0007] According to the first aspect of the embodiments of the present invention, a data filtering method is provided. The method comprises: generating a broadcast variable based on identification information of multiple pieces of data in a first data table and broadcasting said broadcast variable to each working node; extracting the identification information of newly added data generated at said working nodes and determining whether it exists in said broadcast variable;
if yes, filtering the corresponding newly added data to a resilient distributed dataset to be processed.
[0008] In some embodiments of the present invention, the step of generating a broadcast variable based on identification information of multiple pieces of data in a first data table according to the above scheme comprises: acquiring the identification information of multiple pieces of data in the first data table as a first keyword; generating a BitSet corresponding to the identification information by putting said first keyword into a hash function; and generating a broadcast variable with said BitSet as initial data.
[0009] In some embodiments of the present invention, the step of determining whether the identification information of the newly added data exists in the broadcast variable according to the above scheme comprises: using the identification information of the newly added data as a second keyword and putting said second keyword into a hash function; and determining whether the second keyword exists in said BitSet according to the hashing results.
100101 According to the above scheme, the data filtering method in some embodiments of the present invention further comprises: generating a temporary table based on the resilient distributed dataset to be processed and performing a join query on said temporary table and a second data table.
100111 In some embodiments of the present invention, the step of generating a temporary table based on the resilient distributed dataset to be processed according to the above scheme comprises:
creating a sub-thread, and converting the to-be-processed resilient distributed dataset into a DateFrame by said sub-thread; and generating a temporary table based on said DateFrame.
[0012] According to the above scheme, the data filtering method in some embodiments of the present invention is characterized by: said first data table being an alerting rule table, and said identification information referring to date, IP address, and alert type.
[0013] According to the above scheme, the data filtering method in some embodiments of the present invention is characterized by: said hash function specifically referring to the MurmurHash algorithm.

Date Recue/Date Received 2023-03-09 [0014] According to the second aspect of the embodiments of the present invention, a data filtering method is provided. This method involves: a broadcast unit for generating a broadcast variable based on identification information of multiple pieces of data in the first data table and broadcasting said broadcast variable to each working node; a determination unit for extracting the identification information of newly added data generated at said working nodes and determining whether it exists in the broadcast variable;
and a filtering unit for filtering the corresponding newly added data to a resilient distributed dataset to be processed when the identification information of the newly added data exists in said broadcast variable.
[0015] According to the third aspect of the embodiments of the present invention, an electronic device is provided. Said electronic device includes a processor and a memory having stored thereon computer-readable instructions, which can be used to implement the data filtering method when executed by said processor;
[0016] According to the fourth aspect of the embodiments of the present invention, a computer-readable medium is provided. Said computer-readable medium includes a computer program, which can be used to implement the data filtering method described in the above embodiments when executed by a processor.
[0017] The technical scheme provided in the embodiments of the present invention achieves the following beneficial effects:
[0018] According to the technical scheme of some embodiments in the present invention, a broadcast variable is generated based on identification information of multiple pieces of data in a first data table and broadcast to each working node; the identification information of newly added data is extracted and judged to determine whether it exists in the broadcast variable; and the corresponding newly added data is filtered into a resilient distributed dataset to be processed if its identification information exists in the broadcast variable. Firstly, since the broadcast variable generated according to the identification information of the multiple pieces of data in the first data table is broadcast to each working node, all tasks processed by the executor can share one copy of the data, thereby avoiding copying a large amount of data. Secondly, since the newly added data of each working node is filtered according to the broadcast variable, and the temporary table is generated based on the filtered data, the memory space occupied by the temporary table can be reduced. Thirdly, the join query of the generated temporary table and the second data table can eliminate a large number of reading and writing operations, avoid delays, and improve data processing efficiency.

Date Recue/Date Received 2023-03-09 [0019] It should be understood that both the foregoing general description and the detailed descriptions below are merely exemplary and explanatory and are not intended to limit the present invention.
Brief Description of the Drawings [0020] The accompanying drawings herein are incorporated into the specification as a part of it and used for illustrating the embodiments of the present invention and giving an explanation to the principles contained therein together with the specification. Notably, the drawings described below are only some of the embodiments of the present invention. A person having ordinary skill in the art can infer other drawings based on these drawings without any creative labor. In the accompanying drawings:
[0021] Fig 1 is a schematic diagram showing the process of the data filtering method as described in some embodiments of the present invention;
[0022] Fig 2 is a schematic diagram showing the data filtering process as described in some embodiments of the present invention;
[0023] Fig 3 is a schematic diagram showing the data filter as described in some embodiments of the present invention;
[0024] Fig 4 is a structural schematic of the computer of the electronic device as described in some embodiments of the present invention;
[0025] Fig 5 is a schematic diagram showing the computer-readable storage medium as described in some embodiments of the present invention.
Detailed Description [0026] Exemplary embodiments will now be described more fully, with reference to the accompanying drawings. However, the exemplary embodiments can be embodied in a variety of forms and should not be construed as being limited to the embodiments set forth herein. Rather, these embodiments are provided so that this present invention will be thorough and complete and will fully transmit the concepts of the exemplary embodiments to those skilled in the art. The reference numbers in the drawings denote the same or similar parts, so each number will only be described once.
[0027] Furthermore, the described features, structures, or characteristics can be combined in Date Recue/Date Received 2023-03-09 any suitable manner in one or more embodiments. Numerous specific details are set forth below to provide a better understanding of the embodiments of the present invention.
However, a person skilled in the art shall note that the technical scheme of the present invention may be implemented without one or more of the specific details, or other methods, components, apparatuses, and steps may be employed. In other instances, well-known methods, apparatuses, implementations, or operations are not shown or described in detail to avoid obscuring aspects of the present invention.
[0028] The block diagrams shown in the figures are merely functional entities and do not necessarily correspond to physically separate entities. That is, these functional entities may be implemented as software, implemented in one or more hardware modules or integrated circuits, or implemented in different networks and/or processor devices and/or microcontroller devices.
[0029] The flowcharts shown in the figures are merely illustrative. Not all of the content and operations/steps are necessarily included or performed in the order described.
For example, some operations/steps may be divided, and some operations/steps may be combined or partially merged.
Therefore, the actual execution order may vary depending on the situation.
[0030] In this exemplary embodiment, a data filtering method is provided first. Fig 1 is a schematic diagram showing the process of the data filtering method as described in some embodiments of the present invention; As shown in Fig. 1, the data filtering method includes the following steps:
[0031] step S110, generating a broadcast variable based on identification information of multiple pieces of data in a first data table and broadcasting said broadcast variable to each working node;
[0032] step S120, extracting the identification information of newly added data generated at said working nodes and determining whether it exists in the broadcast variable;
[0033] step S130, if yes, filtering the corresponding newly added data to a resilient distributed dataset to be processed.
Date Recue/Date Received 2023-03-09 [0034] According to the data filtering method in this exemplary embodiment, firstly, since the broadcast variable generated according to the identification information of the multiple pieces of data in the first data table is broadcast to each working node, all tasks processed by the executor can share one copy of the data, thereby avoiding copying a large amount of data. Secondly, since the newly added data of each working node is filtered according to the broadcast variable, and the temporary table is generated based on the filtered data, the memory space occupied by the temporary table can be reduced. Thirdly, the join query of the generated temporary table and the second data table can eliminate a large number of reading and writing operations, avoid delays, and improve data processing efficiency.
[0035] The data filtering method in this exemplary embodiment will be further described hereinafter.
[0036] As shown in Fig. 1, in step S110, a broadcast variable is generated based on identification information of multiple pieces of data in the first data table and broadcast to each working node.
[0037] In one exemplary embodiment of the present invention, multiple pieces of data in the first data table are acquired and the identification information contained therein is extracted. For example, when the first data table is an alerting rule table including an alert type field, the identification information of the data extracted from the first data table may be alert type information such as CPU load, excessively high memory occupation, and disk occupation. In addition, the alerting rule table can further include a date field and an IP
address field. Thus, the extracted identification information may be date, IP address or alert type contained in each piece of data.
[0038] Furthermore, a first keyword may be generated according to the identification information of each piece of data and then hashed to generate a BitSet corresponding to the identification information, i.e. a Bloom filter. Then the obtained BitSet is used as initial data to generate a broadcast variable. The broadcast variable is a shared variable in Spark, by which all tasks processed by the executor can share a piece of data, thereby reducing the workload of data copying. The use of a Bloom filter during the generation of the broadcast variable can significantly reduce the memory occupied by the broadcast variable.
100391 It should be noted that, although this exemplary embodiment adopts the alerting rule table as an example of the first data table, other suitable data tables, such as monitoring indicator Date Recue/Date Received 2023-03-09 tables or filtering rule tables, may also apply. These should also be included in the protection scope of the present invention. The previous data and the newly added data in the first data table can be saved in My SQL database.
[0040] In step S120, the identification information of newly added data generated at said working nodes is extracted and then judged to determine whether it exists in said broadcast variable.
[0041] In an exemplary embodiment of the present invention, before the newly added data generated at the working nodes forms an RRD (Resilient Distributed Dataset), a Bloom filter may be used to determine whether its identification information exists in the broadcast variable. For example, the identification information of the newly added data can be extracted and used as a second keyword. Then a corresponding Key is generated by putting said second keyword into a hash function and then judged to determine whether it exists in said broadcast variable. Specifically, the second keyword can be put into a number, K (for example 3), of hash functions to obtain K
array positions, such as Bitset positions. If any one of the array positions is 0, it is determined that the second keyword does not exist in the broadcast variable.
[0042] In this exemplary embodiment, the hash function may refer to the MurmurHash algorithm. MurmurHash is a non-encrypted hash function, which is suitable for general hashing retrieval operations and has the characteristics of high balance and low collision rate for complicated data. This function can be used to realize a Bloom filter.
[0043] It should be noted that the hash function in exemplary embodiments of the present invention may refer to other appropriate hash algorithms as well, such as CityHash, SpookyHash or FNV Hash. Regarding this point, the invention does not have any special limitations.
[0044] In step S130, the corresponding newly added data is filtered into a to-be-processed resilient distributed dataset when the identification information of the newly added data exists in said broadcast variable.
[0045] In an exemplary embodiment of the present invention, if it is determined that the Key generated from the second keyword by the hashing algorithm exists in the broadcast variable, the newly added data corresponding to the second keyword is filtered into the resilient distributed dataset to be processed, otherwise, it is directly ignored.
[0046] In an exemplary embodiment of the present invention, after the newly added data is filtered into the resilient distributed dataset to be processed, a first sub-thread is created so as to convert said to-be-processed resilient distributed dataset into a DataFrame.
The DataFrame is Date Recue/Date Received 2023-03-09 structured data stored in specified columns and can be used to process the data in the SQL manner, just like tables in traditional databases. A temporary table can be generated according to the converted DataFrame and then subjected to a join query (i.e. duplicated data check) together with the second data table so as to filter out the duplicated data and obtain the required data, thus ensuring a more accurate filtering effect. To perform a join query is to extract data from two tables to form a new combined one if related fields in the two tables meet the requirement of join. For example, if the first data table and the second data table are subjected to a join query, a field of the same alert type is obtained.
[0047] In an exemplary embodiment of the present invention, the temporary table generated according to the converted DataFrame is subjected to a join query together with the second data table to obtain the required data and then inserted into the target database after being processed by an operator.
[0048] In an exemplary embodiment of the present invention, a second sub-thread is created to acquire the identification information of multiple pieces of updated data in the first data table;
the identification information of each pieces of updated data is used as a first keyword and hashed to generate a Bitset corresponding to the identification information; and new broadcast variables are generated with the BitSet as the initial data and broadcast to each working node, completing the updating of broadcast variables.
[0049] Furthermore, if the data in the first data table is updated periodically, the broadcast variable may also be updated on a regular basis. That is, the data in the updated first data table is read at certain intervals to extract its identification information, and then a BitSet corresponding to the extracted identification information is generated by a hash algorithm and used as the initial data to generate a new broadcast variable.
[0050] Fig 2 is a schematic block diagram showing the data filtering process as described in some embodiments of the present invention;
[0051] As shown in Fig. 2, in step S201, the Streaming Context in Spark is initialized.
[0052] In step S202, the data in the first data table (i.e. the alerting rule table) is read to extract the identification information of each piece of data, wherein the identification information is "date + IP address + alerting type". Then a BitSet is generated by using the identification information of the data as the first keyword.
[0053] In step S203, a broadcast variable is generated with said BitSet as the initial data.

Date Recue/Date Received 2023-03-09 [0054] In step S204, the newly added data in working nodes is read, and a second keyword is generated from the identification information of each piece of data, i.e.
"date + IP address + alerting type".
[0055] In step S205, The Key generated in step S204 is judged to determine whether it exists in the broadcast variable.
[0056] In step S206, if the Key generated in step S204 is determined to exist in the broadcast variable, then a first sub-thread (i.e. operator process thread) is created.
[0057] In step S207, in the operator process thread, the filtered resilient distributed dataset is converted into a DataFrame, and a temporary table is generated and subjected to a join query together with the data existing in the alerting table to obtain the data that needs to be stored. Then the data is incorporated into the target database.
[0058] In step S208, the result of step S207 is returned, marking the end of data filtering.
[0059] In step S209, a second sub-thread, i.e. broadcast variable updating thread, is created.
[0060] In step S210, in the broadcast variable updating thread, the data in the alerting rule table is re-read, and a new BitSet is generated by using the identification information of each piece of data as the first keyword Key. Then the broadcast variable is re-assigned with the new BitSet, completing the update of the broadcast variable.
[0061] In step S211, the updating result of step S210 is returned, marking the end of broadcast variable updating.
[0062] It should be noted that although the method of the present invention is described in a particular order in the accompanying drawings, it is not required or implied that it must be executed in that particular order or that the desired results can only be achieved after all the steps have been performed. Additionally or alternatively, some steps may be omitted or combined into one step, and/or one step can be split into several steps for execution.
[0063] In addition, the embodiments of the present invention also disclose a data filter. Fig 3 is a schematic block diagram showing the data filter as described in some embodiments of the present invention. As shown in Fig. 3, the data filter 300 includes: a broadcast unit 310, a determination unit 320 and a filtering unit 330. The broadcast unit 310 is used for generating a broadcast variable based on identification information of multiple pieces of data in a first data table and broadcasting said broadcast variable to each working node; the determination unit 320 is used for extracting the identification information of newly added data generated at said working nodes Date Recue/Date Received 2023-03-09 and determining whether it exists in said broadcast variable; and the filtering unit 330 is used for filtering the corresponding newly added data to a resilient distributed dataset to be processed when the identification information of the newly added data exists in said broadcast variable.
[0064] According to the above technical scheme, the broadcast unit 310, in an exemplary embodiment of the present invention, is configured to acquire the identification information of multiple pieces of data in the first data table as a first keyword; generate a BitSet corresponding to the identification information by putting said first keyword into a hash function; and generate a broadcast variable with said BitSet as initial data.
[0065] According to the above technical scheme, the determination unit 320, in an exemplary embodiment of the present invention, is configured to put the identification information of the newly added data, as a second keyword, into a hash function; and determine whether said second keyword exists in said BitSet according to the hashing results.
[0066] In an exemplary embodiment of the present invention, the data filter further comprises a join query unit for generating a temporary table based on said resilient distributed dataset to be processed and for performing a join query on said temporary table and a second data table.
[0067] According to the above technical scheme, the join query unit in an exemplary embodiment of the present invention is configured to create a sub-thread and convert said to-be-processed resilient distributed dataset into a DateFrame by the sub-thread; and generate a temporary table based on said DateFrame.
[0068] Since the specific details of each module of the data filter mentioned above have been described in the corresponding data filtering method, they will not be described again here.
[0069] It should be noted that, although several modules or units of the data filter 300 are mentioned in the detailed description above, such divisions are not mandatory.
In fact, in accordance with embodiments of the present invention, the features and functions of two or more modules or units described above may be embodied in one module or unit.
Conversely, the features and functions of one of the modules or units described above may be further divided into multiple modules or units.
[0070] In exemplary embodiments of the present invention, an electronic device that can implement the above-mentioned data filtering method is also provided. Fig 4 is a structural schematic of the computer system 400 of the electronic device applicable for the embodiments of the present invention. The electronic device 400 shown in Fig. 4 is merely illustrative and Date Recue/Date Received 2023-03-09 exemplary and should not impose any limitations on the function and application scope of the embodiments as described in the present invention.
[0071] A person skilled in the art will understand that all aspects of the invention may be embodied as systems, methods, or program products and may specifically adopt the following forms: a complete hardware embodiment, a complete software embodiment (including firmware, microcode, etc.), or a hardware and software combined embodiment, which may be collectively referred to as a "circuit", "module", or "system".
[0072] In some possible embodiments, the electronic device according to the invention may include at least a processing unit and a memory unit, wherein the memory unit stores program codes thereon, which, when executed by the processing unit, allows said processing unit to execute the data filtering method according to various exemplary embodiments of the invention described in the "exemplary methods" section above in this specification. For example, the processing unit can execute steps as shown in Fig. 1: S110 generating a broadcast variable based on identification information of multiple pieces of data in a first data table and broadcasting said broadcast variable to each working node; 5120 extracting the identification information of newly added data generated by said working nodes and determining whether it exists in said broadcast variable; S130 if yes, filtering the corresponding newly added data to a resilient distributed dataset to be processed.
[0073] As shown in Fig. 4, the electronic device 400 is embodied in the form of a general-purpose computing device. The components of the electronic device 400 may include, but are not limited to, at least one processing unit 401, at least one memory unit 402, and a bus 403 that connects different system components (including the memory unit 402 and the processing unit 401).
[0074] The bus 403 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, or a local bus using any of a variety of bus architectures.
[0075] The memory unit 402 can include a readable medium in the form of volatile memory, such as a random access memory (RAM) 4021 and/or a cache memory 4022 and may further include a read only memory (ROM) 4023.
[0076] The memory unit 402 may also include a program/utility 4025 having a set (at least one) of program modules 4024 which comprise, but are not limited to: an operating system, one or more applications, other program modules, and program data. Each or some of these examples Date Recue/Date Received 2023-03-09 may involve the implementation of a network environment.
[0077] The electronic device 400 may also communicate with one or more external devices 404 (e.g., keyboards, pointing devices, Bluetooth devices, etc.) or one or more devices that enable the user to interact with the electronic device 400 and/or any devices (e.g., routers, modems, etc.) that enable the electronic device 400 to communicate with one or more other computing devices.
These communications can be realized via input/output (I/O) interface 405. In addition, the electronic device 400 can also communicate with one or more networks (e.g., a local area network (LAN), a wide area network (WAN), and/or a public network, such as the Internet) via a network adapter 406. As shown in the figures, the network adapter 406 communicates with other modules of the electronic device 400 via the bus 403. It should be understood that, although not shown in the figures, other hardware and/or software modules may be utilized in conjunction with the electronic device 400. These hardware and/or software modules include, but are not limited to:
microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, etc.
[0078] It should be noted that, although several units/modules or sub-units/ sub-modules of the data filter are mentioned in the detailed description above, such divisions are merely exemplary and not mandatory. In fact, in accordance with embodiments of the present invention, the features and functions of two or more units/modules described above may be embodied in one unit/module.
Conversely, the features and functions of one of the units/modules described above may be further divided into multiple modules or units.
[0079] In addition, although the method of the present invention is described in a particular order in the accompanying drawings, it is not required or implied that it must be executed in that particular order or that the desired results can only be achieved after all the steps have been performed. Additionally or alternatively, some steps may be omitted or combined into one step, and/or one step can be split into several steps for execution.
[0080] Though the spirit and principles of the invention have been described with reference to several specific embodiments, it should be understood that the invention is not limited to the specific embodiments disclosed herein, and that the division of various aspects is only for the convenience of expression and does not mean that the characteristics of these aspects cannot be combined for benefit. The invention aims to cover any modifications and equivalent arrangements falling in the spirit and scope indicated by the appended claims.

Date Recue/Date Received 2023-03-09 [0081] In another aspect, the present application further provides a computer-readable medium 500, which may be included in the electronic device described in the above embodiments or may be separately present and not assembled in the electronic device. The computer-readable medium 500 carries one or more programs, which allow the electronic device to implement the data filtering method as described in the above embodiments when executed by said electronic device.
[0082] It should be noted that, although several modules or units used by the device to execute the actions are mentioned in the detailed description above, such divisions are not mandatory. In fact, in accordance with embodiments of the present invention, the features and functions of two or more modules or units described above may be embodied in one module or unit. Conversely, the features and functions of one of the modules or units described above may be further divided into multiple modules or units.
[0083] Through the description of the above embodiments, a person skilled in the art will readily understand that the example embodiments described above may be implemented by software or by software in combination with necessary hardware. Therefore, the technical scheme according to the embodiments of the present invention may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a USB
flash drive, or a mobile hard disk, etc.) or on the network and may comprise a number of instructions that can be used to allow a computing device (which may be a personal computer, server, touch terminal, or network device, etc.) to execute the method in accordance with the embodiments of the present invention.
[0084] Other embodiments of the present invention will be easily thought of after those skilled in the art have considered the specifications and carried out the inventions disclosed here. The present application is intended to cover any variations, uses, or adaptations of the present invention, which are in accordance with the general principles of the invention and include common general knowledge or common technical means in this field that are not disclosed in the present invention.
The descriptions and embodiments are to be considered as illustrative only, and the true scope and spirit of the invention is indicated by the appended claims.
[0085] It should be understood that the invention is not limited to the precise structure described above and shown in the accompanying drawings and may be modified and altered in a variety of ways without departing from its scope. The scope of the invention is limited by the appended claims only.

Date Recue/Date Received 2023-03-09

Claims (97)

Claims:
1. An apparatus for data filtering, the apparatus comprising:
a broadcast unit configured to generate a broadcast variable based on identification information of multiple pieces of data in a first data table and broadcasting the broadcast variable to each working node;
a determination unit configured to extract the identification information of newly added data generated at working nodes and determine whether the identification information of the newly added data exists in the broadcast variable; and a filtering unit configured to, where if the identification information of the newly added data exists in the broadcast variable, filter the corresponding newly added data to a resilient distributed dataset to be processed.
2. The apparatus of claim 1, wherein the broadcast unit is further configured to acquire the identification information of multiple pieces of data in the first data table.
3. The apparatus of any one of claims 1 to 2, wherein the broadcast unit is further configured to execute the identification information of each piece of data as a first keyword.
4. The apparatus of any one of claims 1 to 3, wherein the broadcast unit is further configured to generate a BitSet corresponding to the identification information by putting a first keyword into a hash function.
5. The apparatus of any one of claims 1 to 4, wherein the broadcast unit is further configured to generate the broadcast variable with a BitSet as initial data.
6. The apparatus of any one of claims 1 to 5, wherein the determination unit is further configured to execute the identification information of the newly added data as a second keyword.

Date Recue/Date Received 2023-03-09
7. The apparatus of any one of claims 1 to 6, wherein the determination unit is further configured to put a second keyword into a hash function.
8. The apparatus of any one of claims 1 to 7, wherein the determination unit is further configured to determine whether a second keyword exists in a BitSet according to hashing results.
9. The apparatus of any one of claims 1 to 8, wherein the apparatus further includes a join query unit.
10. The apparatus of claim 9, wherein the join query unit is configured to generate a temporary table based on the resilient distributed dataset to be processed.
11. The apparatus of any one of claims 9 to 10, wherein the join query unit is further configured to perform a join query on a temporary table and a second data table.
12. The apparatus of any one of claims 9 to 11, wherein the join query unit is further configured to create a sub-thread.
13. The apparatus of any one of claims 9 to 12, wherein the join query unit is further configured to convert the resilient distributed dataset to be processed into a DateFrame by a sub-thread.
14. The apparatus of any one of claims 9 to 13, wherein the join query unit is further configured to generate a temporary table based on a DateFrame.
15. The apparatus of any one of claims 1 to 14, wherein the first data table is an alerting rule table.
16. The apparatus of any one of claims 1 to 15, wherein the identification information includes a date.
17. The apparatus of any one of claims 1 to 16, wherein the identification information further includes an Internet Protocol (IP) address.
Date Recite/Date Received 2023-03-09
18. The apparatus of any one of claims 1 to 17, wherein the identification information further includes an alert type.
19. The apparatus of any one of claims 4 to 18, wherein the hash function is a Murmur Hash algorithm.
20. The apparatus of any one of claims 4 to 18, wherein the hash function is a City Hash algorithm.
21. The apparatus of any one of claims 4 to 18, wherein the hash function is a Spooky Hash algorithm.
22. The apparatus of any one of claims 4 to 18, wherein the hash function is a FNV (Fowler-Noll-Vo) Hash algorithm.
23. The apparatus of any one of claims 1 to 22, wherein the multiple pieces of data are stored in MySQL (My Structured Query Language) database.
24. The apparatus of any one of claims 1 to 23, wherein the newly added data generated are stored in MySQL database.
25. The apparatus of any one of claims 1 to 24, wherein the determination of the identification information of the newly added data existing in the broadcast variable is executed via a Boom filter.
26. An electronic device for data filtering, wherein the electronic device includes:
a memory storing data; and a processor configured to:

Date Recite/Date Received 2023-03-09 generate a broadcast variable based on identification information of multiple pieces of data in a first data table and broadcast the broadcast variable to each working node;
extract the identification information of newly added data generated at working nodes and determine whether the identification information of the newly added data exists in the broadcast variable; and where if the identification information of the newly added data exists in the broadcast variable, filter the corresponding newly added data to a resilient distributed dataset to be processed.
27. The electronic device of claim 26, wherein the processor is further configured to acquire the identification information of multiple pieces of data in the first data table.
28. The electronic device of any one of claims 26 to 27, wherein the processor is further configured to execute the identification information of each piece of data as a first keyword.
29. The electronic device of any one of claims 26 to 28, wherein the processor is further configured to generate a BitSet corresponding to the identification information by putting a first keyword into a hash function.
30. The electronic device of any one of claims 26 to 29, wherein the processor is further configured to generate the broadcast variable with a BitSet as initial data.
31. The electronic device of any one of claims 26 to 30, wherein the processor is further configured to execute the identification information of the newly added data as a second keyword.
32. The electronic device of any one of claims 26 to 31, wherein the processor is further configured to put a second keyword into a hash function.

Date Recue/Date Received 2023-03-09
33. The electronic device of any one of claims 26 to 32, wherein the processor is further configured to determine whether a second keyword exists in a BitSet according to hashing results.
34. The electronic device of any one of claims 26 to 33, wherein the processor is further configured to generate a temporary table based on the resilient distributed dataset to be processed.
35. The electronic device of any one of claims 26 to 34, wherein the processor is further configured to perform a join query on a temporary table and a second data table.
36. The electronic device of any one of claims 26 to 35, wherein the processor is further configured to create a sub-thread.
37. The electronic device of any one of claims 26 to 36, wherein the processor is further configured to convert the resilient distributed dataset to be processed into a DateFraine by a sub-thread.
38. The electronic device of any one of claims 26 to 37, wherein the processor is further configured to generate a temporary table based on a DateFrame.
39. The electronic device of any one of claims 26 to 38, wherein the first data table is an alerting rule table.
40. The electronic device of any one of claims 26 to 39, wherein the identification information includes a date.
41. The electronic device of any one of claims 26 to 40, wherein the identification information further includes an Internet Protocol (IP) address.
42. The electronic device of any one of claims 26 to 41, wherein the identification information further includes an alert type.
43. The electronic device of any one of claims 29 to 42, wherein the hash function is a Murmur Date Recue/Date Received 2023-03-09 Hash algorithm.
44. The electronic device of any one of claims 29 to 42, wherein the hash function is a City Hash algorithm.
45. The electronic device of any one of claims 29 to 42, wherein the hash function is a Spooky Hash algorithm.
46. The electronic device of any one of claims 29 to 42, wherein the hash function is a FNV
(Fowler-Noll-Vo) Hash algorithm.
47. The electronic device of any one of claims 26 to 46, wherein the multiple pieces of data are stored in MySQL (My Structured Query Language) database.
48. The electronic device of any one of claims 26 to 47, wherein the newly added data generated are stored in MySQL database.
49. The electronic device of any one of claims 26 to 48, wherein the determination of the identification information of the newly added data existing in the broadcast variable is executed via a Boom filter.
50. A computer-readable storage medium having recorded thereon instructions for execution by an apparatus for data filtering, wherein the computer-readable storage medium includes the instructions for:
generating a broadcast variable based on identification information of multiple pieces of data in a first data table and broadcasting the broadcast variable to each working node;
extracting the identification information of newly added data generated at working nodes and determining whether the identification information of the newly added data exists in the broadcast variable; and Date Recue/Date Received 2023-03-09 where if the identification information of the newly added data exists in the broadcast variable, filtering the corresponding newly added data to a resilient distributed dataset to be processed.
51. The computer-readable storage medium of claim 50, wherein the computer-readable storage medium includes the instructions for acquiring the identification information of multiple pieces of data in the first data table.
52. The computer-readable storage medium of any one of claims 50 to 51, wherein the computer-readable storage medium includes the instructions for executing the identification information of each piece of data as a first keyword.
53. The computer-readable storage medium of any one of claims 50 to 52, wherein the computer-readable storage medium includes the instructions for generating a BitSet corresponding to the identification information by putting a first keyword into a hash function.
54. The computer-readable storage medium of any one of claims 50 to 53, wherein the computer-readable storage medium includes the instnictions for generating the broadcast variable with a BitSet as initial data.
55. The computer-readable storage medium of any one of claims 50 to 54, wherein the computer-readable storage medium includes the instructions for executing the identification information of the newly added data as a second keyword.
56. The computer-readable storage medium of any one of claims 50 to 55, wherein the computer-readable storage medium includes the instructions for putting a second keyword into a hash function.
57. The computer-readable storage medium of any one of claims 50 to 56, wherein the computer-readable storage medium includes the instructions for determining whether a second keyword Date Recite/Date Received 2023-03-09 exists in a BitSet according to hashing results.
58. The computer-readable storage medium of any one of claims 50 to 57, wherein the computer-readable storage medium includes the instructions for generating a temporary table based on the resilient distributed dataset to be processed.
59. The computer-readable storage medium of any one of claims 50 to 58, wherein the computer-readable storage medium includes the instructions for performing a join query on a temporary table and a second data table.
60. The computer-readable storage medium of any one of claims 50 to 59, wherein the computer-readable storage medium includes the instructions for creating a sub-thread.
61. The computer-readable storage medium of any one of claims 50 to 60, wherein the computer-readable storage medium includes the instructions for converting the resilient distributed dataset to be processed into a DateFrame by a sub-thread.
62. The computer-readable storage medium of any one of claims 50 to 61, wherein the computer-readable storage medium includes the instnictions for generating a temporary table based on a DateFrame.
63. The computer-readable storage medium of any one of claims 50 to 62, wherein the first data table is an alerting rule table.
64. The computer-readable storage medium of any one of claims 50 to 63, wherein the identificati on information includes a date.
65. The computer-readable storage medium of any one of claims 50 to 64, wherein the identification information further includes an Internet Protocol (IP) address.
66. The computer-readable storage medium of any one of claims 50 to 65, wherein the Date Recite/Date Received 2023-03-09 identification information further includes an alert type.
67. The computer-readable storage medium of any one of claims 50 to 66, wherein the hash function is a Murmur Hash algorithm.
68. The computer-readable storage medium of any one of claims 53 to 67, wherein the hash function is a City Hash algorithm.
69. The computer-readable storage medium of any one of claims 53 to 68, wherein the hash function is a Spooky Hash algorithm.
70. The computer-readable storage medium of any one of claims 53 to 69, wherein the hash function is a FNV (Fowler-Noll-Vo) Hash algorithm.
71. The computer-readable storage medium of any one of claims 53 to 70, wherein the multiple pieces of data are stored in MySQL (My Structured Query Language) database.
72. The computer-readable storage medium of any one of claims 50 to 71, wherein the newly added data generated are stored in MySQL database.
73. The computer-readable storage medium of any one of claims 50 to 72, wherein the determination of the identification information of the newly added data existing in the broadcast variable is executed via a Boom filter.
74. A method for data filtering, wherein the method includes:
generating a broadcast variable based on identification information of multiple pieces of data in a first data table and broadcasting the broadcast variable to each working node;
extracting the identification information of newly added data generated at working nodes and determining whether the identification information of the newly added data exists in Date Recue/Date Received 2023-03-09 the broadcast variable; and where if the identification information of the newly added data exists in the broadcast variable, filtering the corresponding newly added data to a resilient distributed dataset to be processed.
75. The method of claim 74, wherein the method further includes acquiring the identification information of multiple pieces of data in the first data table.
76. The method of any one of claims 74 to 75, wherein the method further includes executing the identification information of each piece of data as a first keyword.
77. The method of any one of claims 74 to 76, wherein the method further includes generating a BitSet corresponding to the identification information by putting a first keyword into a hash function.
78. The method of any one of claims 74 to 77, wherein the method further includes generating the broadcast variable with a BitSet as initial data.
79. The method of any one of claims 74 to 78, wherein the method further includes executing the identification information of the newly added data as a second keyword.
80. The method of any one of claims 74 to 79, wherein the method further includes putting a second keyword into a hash function.
81. The method of any one of claims 74 to 80, wherein the method further includes determining whether a second keyword exists in a BitSet according to hashing results.
82. The method of any one of claims 74 to 81, wherein the method further includes generating a temporary table based on the resilient distributed dataset to be processed.

Date Recue/Date Received 2023-03-09
83. The method of any one of claims 74 to 82, wherein the method further includes performing a join query on a temporary table and a second data table.
84. The method of any one of claims 74 to 83, wherein the method further includes creating a sub-thread.
85. The method of any one of claims 74 to 84, wherein the method further includes converting the resilient distributed dataset to be processed into a DateFrame by a sub-thread.
86. The method of any one of claims 74 to 85, wherein the method further includes generating a temporary table based on a DateFrame.
87. The method of any one of claims 74 to86, wherein the first data table is an alerting rule table.
88. The method of any one of claims 74 to 87, wherein the identification information includes a date.
89. The method of any one of claims 74 to 88, wherein the identification information further includes an Internet Protocol (IP) address.
90. The method of any one of claims 74 to 89, wherein the identification information further includes an alert type.
91. The method of any one of claims 77 to 90, wherein the hash function is a Murmur Hash algorithm.
92. The method of any one of claims 77 to 90, wherein the hash function is a City Hash algorithm.
93. The method of any one of claims 77 to 90, wherein the hash function is a Spooky Hash algorithm.

Date Recite/Date Received 2023-03-09
94. The method of any one of claims 77 to 90, wherein the hash function is a FNV (Fowler-Noll-Vo) Hash algorithm.
95. The method of any one of claims 77 to 94, wherein the multiple pieces of data are stored in MySQL (My Structured Query Language) database.
96. The method of any one of claims 77 to 95, wherein the newly added data generated are stored in MySQL database.
97. The method of any one of claims 77 to 96, wherein the determination of the identification information of the newly added data existing in the broadcast variable is executed via a Boom filter.
Date Recue/Date Received 2023-03-09
CA3057038A 2018-09-29 2019-09-27 Data filtering method, apparatus, electronic apparatus and storage medium Active CA3057038C (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811150166.6 2018-09-29
CN201811150166.6A CN109408711B (en) 2018-09-29 2018-09-29 Data filtering method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CA3057038A1 CA3057038A1 (en) 2020-03-29
CA3057038C true CA3057038C (en) 2023-06-27

Family

ID=65465729

Family Applications (1)

Application Number Title Priority Date Filing Date
CA3057038A Active CA3057038C (en) 2018-09-29 2019-09-27 Data filtering method, apparatus, electronic apparatus and storage medium

Country Status (2)

Country Link
CN (1) CN109408711B (en)
CA (1) CA3057038C (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111241163A (en) * 2020-01-17 2020-06-05 平安科技(深圳)有限公司 Distributed computing task response method and device
CN112163176A (en) * 2020-11-02 2021-01-01 北京城市网邻信息技术有限公司 Data storage method and device, electronic equipment and computer readable medium
CN116368790A (en) * 2021-01-22 2023-06-30 Oppo广东移动通信有限公司 Information transmission method, device, equipment and storage medium

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104408190B (en) * 2014-12-15 2018-06-26 北京国双科技有限公司 Data processing method and device based on Spark
CN107015989A (en) * 2016-01-27 2017-08-04 博雅网络游戏开发(深圳)有限公司 Data processing method and device
CN105740424A (en) * 2016-01-29 2016-07-06 湖南大学 Spark platform based high efficiency text classification method
CN107220261B (en) * 2016-03-22 2020-10-30 中国移动通信集团山西有限公司 Real-time mining method and device based on distributed data
CN106296305A (en) * 2016-08-23 2017-01-04 上海海事大学 Electric business website real-time recommendation System and method under big data environment
CN106372190A (en) * 2016-08-31 2017-02-01 华北电力大学(保定) Method and device for querying OLAP (on-line analytical processing) in real time
US10176092B2 (en) * 2016-09-21 2019-01-08 Ngd Systems, Inc. System and method for executing data processing tasks using resilient distributed datasets (RDDs) in a storage device
CN106611064B (en) * 2017-01-03 2020-03-06 北京华胜信泰数据技术有限公司 Data processing method and device for distributed relational database

Also Published As

Publication number Publication date
CN109408711B (en) 2019-12-06
CN109408711A (en) 2019-03-01
CA3057038A1 (en) 2020-03-29

Similar Documents

Publication Publication Date Title
CA3057038C (en) Data filtering method, apparatus, electronic apparatus and storage medium
US9817858B2 (en) Generating hash values
CN106933854B (en) Short link processing method and device and server
RU2663358C2 (en) Clustering storage method and device
US9378233B2 (en) For all entries processing
WO2017020637A1 (en) Task allocation method and task allocation apparatus for distributed data calculation
US8949222B2 (en) Changing the compression level of query plans
WO2018036549A1 (en) Distributed database query method and device, and management system
US11030196B2 (en) Method and apparatus for processing join query
US11294920B2 (en) Method and apparatus for accessing time series data in memory
JP6336090B2 (en) Method and apparatus for maintaining data for online analytical processing in a database system
CN106156088B (en) Index data processing method, data query method and device
US11294931B1 (en) Creating replicas from across storage groups of a time series database
JP5817558B2 (en) Information processing apparatus, distributed processing system, cache management program, and distributed processing method
US20200042609A1 (en) Methods and systems for searching directory access groups
JP2020123320A (en) Method, apparatus, device and storage medium for managing index
CA3131330A1 (en) Database aggregation query method, device and system
WO2018120933A1 (en) Storage and query method and device of data base
EP3107010B1 (en) Data integration pipeline
US10083121B2 (en) Storage system and storage method
CN107943807B (en) Data processing method and storage device
CN106446039B (en) Aggregation type big data query method and device
CN107679093B (en) Data query method and device
Shivarkar Speed-up extension to Hadoop system
CN111125108A (en) HBASE secondary index method, device and computer equipment based on Lucene

Legal Events

Date Code Title Description
EEER Examination request

Effective date: 20220916

EEER Examination request

Effective date: 20220916

EEER Examination request

Effective date: 20220916

EEER Examination request

Effective date: 20220916

EEER Examination request

Effective date: 20220916

EEER Examination request

Effective date: 20220916

EEER Examination request

Effective date: 20220916