CN117009409A - Big data real-time duplication elimination processing method and device, electronic equipment and storage medium - Google Patents

Big data real-time duplication elimination processing method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN117009409A
CN117009409A CN202310659635.1A CN202310659635A CN117009409A CN 117009409 A CN117009409 A CN 117009409A CN 202310659635 A CN202310659635 A CN 202310659635A CN 117009409 A CN117009409 A CN 117009409A
Authority
CN
China
Prior art keywords
data
index
weight
stream data
service
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310659635.1A
Other languages
Chinese (zh)
Inventor
周锋
罗浩
朱磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Anyixun Technology Co ltd
Original Assignee
Chengdu Anyixun Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Anyixun Technology Co ltd filed Critical Chengdu Anyixun Technology Co ltd
Priority to CN202310659635.1A priority Critical patent/CN117009409A/en
Publication of CN117009409A publication Critical patent/CN117009409A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Probability & Statistics with Applications (AREA)
  • Fuzzy Systems (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a real-time big data duplication eliminating processing method and device, electronic equipment and a storage medium, and relates to the technical field of big data processing. The method utilizes a real-time stream processing engine to consume service stream data; acquiring weight-removing index stream data in the service stream data according to the set weight-removing index; judging whether the duplication eliminating index stream data in the service stream data is marked according to the mapping relation between the duplication eliminating index and the marking data; if the weight-removing index stream data in the service stream data is not marked, adding target marking data corresponding to the weight-removing index stream data in the service stream data to the mapping relation; if the weight-removing index stream data in the service stream data is marked, the target marking data corresponding to the weight-removing index stream data in the service stream data is not added to the mapping relation. The embodiment can realize accurate and efficient data duplication elimination processing, and is convenient for providing support information for subsequent business decisions.

Description

Big data real-time duplication elimination processing method and device, electronic equipment and storage medium
Technical Field
The present application relates to the field of big data processing technologies, and in particular, to a method and apparatus for real-time duplication elimination processing of big data, an electronic device, and a storage medium.
Background
With the development of internet technology, people acquire information or interact through the internet in work, life and study, and the internet has become an indispensable part of people's work, life and study. The provider of the product or service can put advertisements on the Internet, and the product or service can be publicized and displayed to the user through advertisement content. Currently, the strategy of advertisement delivery mainly depends on the situation that the user browses advertisements, such as the number of times the user clicks on the advertisements, the number of people clicking on the advertisements, and the like. In an actual scene, the situation of repeatedly counting the number of people exists, for example, the situation that the advertisement A is clicked by the third person at 1 o 'clock, the advertisement A is clicked by the third person at 1 o' clock for 2 minutes, the advertisement A is clicked by the fourth person at 2 o 'clock for 1 minute, the advertisement A is clicked by the third person at 2 o' clock for 5 minutes, 4 pieces of data are stored in a real-time database, the statistical analysis is 4 times of clicking, 4 people are clicked, and the situation that the repeated counting exists is obvious in the number of clicks. Therefore, how to accurately and efficiently perform data de-duplication is a technical problem to be solved.
Disclosure of Invention
The present application has been made in view of the above problems, and provides a method and apparatus for real-time de-duplication of big data, an electronic device, and a storage medium, which overcome or at least partially solve the above problems. The technical scheme is as follows:
in a first aspect, a method for real-time duplication elimination of big data is provided, including:
consuming the traffic stream data using a real-time stream processing engine;
acquiring weight-removing index stream data in the service stream data according to the set weight-removing index;
judging whether the duplication eliminating index flow data in the service flow data is marked according to the mapping relation between the duplication eliminating index and the marking data;
if the weight-removing index stream data in the service stream data is not marked, adding target marking data corresponding to the weight-removing index stream data in the service stream data to the mapping relation;
and if the weight-removal index stream data in the service stream data is marked, not adding the target marking data corresponding to the weight-removal index stream data in the service stream data to the mapping relation so as to perform weight-removal processing on the target marking data corresponding to the weight-removal index stream data in the service stream data.
In one possible implementation manner, obtaining the weight-removal index stream data in the service flow data according to the set weight-removal index includes:
and screening out the drainage index running water data corresponding to the set drainage index from the service flow data.
In one possible implementation manner, determining whether the duplication elimination index pipeline data in the service flow data is marked according to the mapping relation between the duplication elimination index and the marking data currently includes:
generating target mark data corresponding to the weight-removal index stream data in the service stream data;
searching whether target mark data corresponding to the duplication eliminating index flow data in the service flow data exists in the mapping relation currently containing the duplication eliminating index and the mark data;
if the target mark data corresponding to the weight-removing index running water data in the service flow data is not found, determining that the weight-removing index running water data in the service flow data is not marked;
and if target marking data corresponding to the weight-removal index running water data in the service flow data is found, determining that the weight-removal index running water data in the service flow data is marked.
In one possible implementation manner, generating target mark data corresponding to the duplication elimination indicator stream data in the service flow data includes:
and converting the duplication elimination index streaming data in the service flow data into the data type of the marking data according to the data type of the marking data, and generating target marking data corresponding to the duplication elimination index streaming data in the service flow data.
In one possible implementation, the mapping relationship between the current weight-removal index and the marking data is stored in the memory; judging whether the duplication eliminating index flow data in the service flow data is marked according to the mapping relation of the duplication eliminating index and the marking data, comprising the following steps:
and judging whether the duplication eliminating index stream data in the service stream data is marked according to the mapping relation between the duplication eliminating index and the marking data in the memory.
In one possible implementation manner, after adding the target mark data corresponding to the duplication elimination index streaming data in the service flow data to the mapping relationship, the method further includes:
and writing target mark data corresponding to the duplication eliminating index stream data in the service stream data in the memory into a disk, and storing the mapping relation between the duplication eliminating index and the mark data in the disk.
In one possible implementation, the method further includes:
and carrying out statistical analysis on the marking data corresponding to the weight-eliminating index in the mapping relation, and generating and displaying a statistical analysis result.
In a second aspect, a real-time big data duplication elimination device is provided, including:
the consumption module is used for consuming the service flow data by using the real-time flow processing engine;
the acquisition module is used for acquiring the weight-removal index stream data in the service stream data according to the set weight-removal index;
the judging module is used for judging whether the weight-removal index stream data in the service stream data is marked according to the mapping relation between the weight-removal index and the marking data;
the processing module is used for adding target marking data corresponding to the weight-removal index running water data in the service flow data to the mapping relation if the weight-removal index running water data in the service flow data is not marked; and if the weight-removal index stream data in the service stream data is marked, not adding the target marking data corresponding to the weight-removal index stream data in the service stream data to the mapping relation so as to perform weight-removal processing on the target marking data corresponding to the weight-removal index stream data in the service stream data.
In one possible implementation, the obtaining module is further configured to:
and screening out the drainage index running water data corresponding to the set drainage index from the service flow data.
In one possible implementation manner, the judging module is further configured to:
generating target mark data corresponding to the weight-removal index stream data in the service stream data;
searching whether target mark data corresponding to the duplication eliminating index flow data in the service flow data exists in the mapping relation currently containing the duplication eliminating index and the mark data;
if the target mark data corresponding to the weight-removing index running water data in the service flow data is not found, determining that the weight-removing index running water data in the service flow data is not marked;
and if target marking data corresponding to the weight-removal index running water data in the service flow data is found, determining that the weight-removal index running water data in the service flow data is marked.
In one possible implementation manner, the judging module is further configured to:
and converting the duplication elimination index streaming data in the service flow data into the data type of the marking data according to the data type of the marking data, and generating target marking data corresponding to the duplication elimination index streaming data in the service flow data.
In one possible implementation, the mapping relationship between the current weight-removal index and the marking data is stored in the memory; the judging module is further used for:
and judging whether the duplication eliminating index stream data in the service stream data is marked according to the mapping relation between the duplication eliminating index and the marking data in the memory.
In one possible implementation, the processing module is further configured to:
and after adding the target mark data corresponding to the duplication elimination index streaming data in the service flow data to the mapping relation, writing the target mark data corresponding to the duplication elimination index streaming data in the service flow data in the memory into a disk, and storing the mapping relation between the duplication elimination index and the mark data in the disk.
In one possible implementation manner, the apparatus further includes a display module, configured to:
and carrying out statistical analysis on the marking data corresponding to the weight-eliminating index in the mapping relation, and generating and displaying a statistical analysis result.
In a third aspect, an electronic device is provided, the electronic device comprising a processor and a memory, wherein the memory has stored therein a computer program, the processor being configured to run the computer program to perform the big data real-time deduplication processing method of any of the above.
In a fourth aspect, a storage medium is provided, where the storage medium stores a computer program, where the computer program is configured to execute the big data real-time deduplication processing method according to any of the preceding claims when running.
By means of the technical scheme, the big data real-time duplication elimination processing method and device, the electronic equipment and the storage medium provided by the embodiment of the application can consume business flow data by using a real-time flow processing engine; acquiring weight-removing index stream data in the service stream data according to the set weight-removing index; judging whether the duplication eliminating index stream data in the service stream data is marked according to the mapping relation between the duplication eliminating index and the marking data; if the weight-removing index stream data in the service stream data is not marked, adding target marking data corresponding to the weight-removing index stream data in the service stream data to the mapping relation; if the weight-removing index stream data in the service stream data is marked, the target marking data corresponding to the weight-removing index stream data in the service stream data is not added to the mapping relation so as to carry out weight-removing processing on the target marking data corresponding to the weight-removing index stream data in the service stream data. It can be seen that, according to the embodiment of the application, whether the duplication elimination index stream data in the service stream data is marked or not can be judged according to the mapping relation between the duplication elimination index and the marking data, and further, corresponding processing is carried out according to the marked result, so that the data duplication elimination processing can be accurately and efficiently carried out, and support information is conveniently provided for subsequent service decisions.
Drawings
In order to more clearly illustrate the technical solution of the embodiments of the present application, the drawings that are required to be used in the description of the embodiments of the present application will be briefly described below.
FIG. 1 shows a flow chart of a big data real-time duplication elimination processing method provided by an embodiment of the application;
FIG. 2 is a flowchart illustrating a method for real-time duplication elimination of big data according to another embodiment of the present application;
FIG. 3 shows a block diagram of a big data real-time duplication elimination device provided by an embodiment of the application;
FIG. 4 is a block diagram of a real-time big data de-duplication processing apparatus according to another embodiment of the present application;
fig. 5 shows a block diagram of an electronic device according to an embodiment of the present application.
Detailed Description
Exemplary embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the application to those skilled in the art.
It should be noted that the terms "first," "second," and the like in the description and the claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that such use is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "include" and variations thereof are to be interpreted as open-ended terms that mean "include, but are not limited to.
In order to solve the above technical problems, an embodiment of the present application provides a real-time big data duplication elimination processing method, as shown in fig. 1, which may include the following steps S101 to S105:
step S101, the real-time stream processing engine is utilized to consume the service stream data.
In this step, the real-time stream processing engine may be a Flink, which is a framework and distributed processing engine for stateful computation of unbounded and bounded data streams. The flank is designed to run in all common clustered environments, performing computations at memory speed and on any scale.
Here, the unbounded stream has a start but no defined end, they do not terminate and provide data at the time of generation, the unbounded stream must be processed continuously, i.e. the event must be processed immediately after the ingestion event, it is impossible to wait for all input data to arrive, because the input is unbounded and will not complete at any point in time. Processing unbounded data typically requires ingestion of events in a particular order (e.g., the order in which the events occur) so that the integrity of the results can be inferred. The bounded flow has defined beginning and ending, and can be processed by ingesting all data before any computation is performed, processing the bounded flow does not require ordered ingestion, as the bounded data sets can always be ordered, and processing of the bounded flow is also referred to as batch processing.
The real-time stream processing engine can also be Spark, which is a fast and general computing engine designed for large-scale data processing, and has three main characteristics: first, the high-level API (application program interface) strips the focus on the cluster itself, and Spark application developers can focus on the computation itself that the application is to do; secondly, spark is fast, and interactive calculation and complex algorithms are supported; finally, spark is a general purpose engine that can be used to perform a variety of operations, including queries, text processing, machine learning, etc., and before Spark occurs, it is generally necessary to learn a variety of engines to handle these needs separately.
Step S102, according to the set weight-eliminating index, weight-eliminating index stream data in the service stream data are obtained.
In this step, the set duplication eliminating index may be determined according to actual requirements, for example, the set duplication eliminating index may be a user identifier, an address identifier, or an equipment identifier, which is not limited in this embodiment.
Step S103, judging whether the weight-removing index stream data in the service stream data is marked according to the mapping relation between the weight-removing index and the marking data; if not, continuing to execute the step S104; if yes, go on to step S105.
Step S104, if the weight-removing index stream data in the service stream data is not marked, the target marking data corresponding to the weight-removing index stream data in the service stream data is added to the mapping relation.
Step S105, if the duplication elimination index running water data in the service flow data is marked, the target marking data corresponding to the duplication elimination index running water data in the service flow data is not added to the mapping relationship, so as to perform duplication elimination processing on the target marking data corresponding to the duplication elimination index running water data in the service flow data.
The embodiment of the application can judge whether the duplication elimination index stream data in the service stream data is marked according to the mapping relation of the duplication elimination index and the marking data, and further carries out corresponding processing according to the marked result, thereby realizing accurate and efficient duplication elimination processing of the data and being convenient for providing support information for subsequent service decisions.
The embodiment of the application provides a possible implementation manner, and step S102 above obtains the weight-removal index running water data in the service flow data according to the set weight-removal index, specifically, the weight-removal index running water data corresponding to the set weight-removal index is screened out from the service flow data. The embodiment can screen the weight-removal index flow data corresponding to the set weight-removal index from the service flow data so as to facilitate the subsequent weight-removal processing.
In the embodiment of the present application, a possible implementation manner is provided, where step S103 above determines, according to a mapping relationship between a weight-removal index and marking data, whether weight-removal index stream data in service stream data has been marked, and may specifically include the following steps A1 to A4:
and A1, generating target marking data corresponding to the duplication elimination index streaming data in the service flow data.
In the step, the weight-removing index stream data in the service stream data can be converted into target mark data to generate target mark data corresponding to the weight-removing index stream data in the service stream data, wherein the target mark data can be integer type, and the occupied space is small; other data types are also possible, which is not limited by the present embodiment.
For example, the set weight-ranking index is a user identifier, and the specific weight-ranking index pipeline data is "Zhang san", "Liqu", etc., and the corresponding target mark data may be generated as "0001", "0002", etc. The examples are illustrative only and are not intended to limit the present embodiments.
Step A2, searching whether target mark data corresponding to the duplication elimination index flow data in the service flow data exists in the mapping relation of the duplication elimination index and the mark data; if not, continuing to execute the step A3; if yes, go on to step A4.
And step A3, if the target mark data corresponding to the weight-removal index running water data in the service flow data is not found, determining that the weight-removal index running water data in the service flow data is not marked.
And step A4, if target marking data corresponding to the weight-removal index running water data in the service flow data is searched, determining that the weight-removal index running water data in the service flow data is marked.
The embodiment generates the target marking data corresponding to the weight-eliminating index stream data in the service stream data, and based on the target marking data, carries out subsequent judgment on whether the weight-eliminating index stream data in the service stream data is marked, thereby saving space, and being efficient and accurate.
The embodiment of the application provides a possible implementation manner, wherein the step A1 generates the target marking data corresponding to the weight-removing index stream data in the service stream data, specifically, the weight-removing index stream data in the service stream data is converted into the data type of the marking data according to the data type of the marking data, and the target marking data corresponding to the weight-removing index stream data in the service stream data is generated. Therefore, the embodiment can generate the target mark data corresponding to the weight-removal index stream data in the service stream data according to the actual demand, thereby meeting the requirement of the service scene.
In the embodiment of the present application, a possible implementation manner is provided, and the mapping relationship between the current weight-removing index and the marking data mentioned in step S103 is stored in the memory, and then step S103 determines whether the weight-removing index running data in the service flow data is marked according to the mapping relationship between the current weight-removing index and the marking data, specifically, whether the weight-removing index running data in the service flow data is marked according to the mapping relationship between the current weight-removing index and the marking data in the memory. Thus, the processing in the memory can improve the efficiency of data processing and further improve the efficiency of real-time processing of big data.
The embodiment of the present application provides a possible implementation manner, where after adding the target mark data corresponding to the duplication elimination indicator stream data in the service stream data to the mapping relationship in the step S104, the method may further include the following step B1:
and B1, writing target mark data corresponding to the duplication elimination index stream data in the service stream data of the memory into a disk, and storing the mapping relation between the duplication elimination index and the mark data in the disk.
According to the embodiment, the target mark data corresponding to the duplication elimination index streaming data in the business flow data of the memory is written into the disk, and the mapping relation between the duplication elimination index and the mark data is stored in the disk, so that the memory data can be prevented from being lost after power failure, and the data safety is ensured.
The embodiment of the application provides a possible implementation manner, and the method can further comprise the following step C1:
and step C1, carrying out statistical analysis on the marking data corresponding to the weight-eliminating index in the mapping relation, and generating and displaying a statistical analysis result.
The embodiment performs statistical analysis on the marking data corresponding to the weight-eliminating index in the mapping relation, generates and displays a statistical analysis result, and is convenient for providing support information for business decision.
Having introduced various implementations of each link of the embodiment shown in fig. 1, the real-time big data duplication eliminating method of the embodiment of the present application will be further described by a specific embodiment.
Fig. 2 is a flowchart illustrating a method for real-time duplication elimination of big data according to another embodiment of the application, and as shown in fig. 2, the method for real-time duplication elimination of big data may include the following steps S201 to S208:
step S201, the real-time stream processing engine Flink is utilized to consume the service stream data.
As previously described, a Flink is a framework and distributed processing engine for stateful computation of unbounded and bounded data streams. The flank is designed to run in all common clustered environments, performing computations at memory speed and on any scale.
Step S202, according to the set weight-eliminating index, weight-eliminating index stream data in the service stream data are obtained.
In this step, the set duplication eliminating index may be determined according to actual requirements, for example, the set duplication eliminating index may be a user identifier, an address identifier, or an equipment identifier, which is not limited in this embodiment.
In step S203, target marking data corresponding to the duplication elimination index stream data in the service stream data is generated in the memory.
In this step, the weight-removal index stream data can be input into the bitmap, which marks the value corresponding to an element with a bit, and the key is the element, and the bitmap uses the bit to store the data, so that the storage space can be greatly saved.
Step S204, searching whether target mark data corresponding to the duplication elimination index flow data in the service flow data exists in the mapping relation of the duplication elimination index and the mark data; if not, continuing to execute the step S205; if yes, go on to step S206.
In step S205, if the target mark data corresponding to the duplication elimination index flow data in the service flow data is not found, it is determined that the duplication elimination index flow data in the service flow data is not marked, and the target mark data corresponding to the duplication elimination index flow data in the service flow data is added to the mapping relationship, and step S207 is continuously executed.
In step S206, if the target marking data corresponding to the duplication elimination index running data in the service flow data is found, it is determined that the duplication elimination index running data in the service flow data is marked, and the target marking data corresponding to the duplication elimination index running data in the service flow data is not added to the mapping relationship, so as to perform duplication elimination processing on the target marking data corresponding to the duplication elimination index running data in the service flow data.
Step S207, the target mark data corresponding to the duplication eliminating index stream data in the service stream data in the memory is written into the disk, and the mapping relation between the duplication eliminating index and the mark data is stored in the disk.
Step S208, carrying out statistical analysis on the marking data corresponding to the weight-eliminating index in the mapping relation, and generating and displaying a statistical analysis result.
According to the embodiment, whether the weight-removal index stream data in the service stream data is marked or not can be judged according to the mapping relation between the weight-removal index and the marking data, and further corresponding processing is carried out according to the marked result, so that data weight-removal processing can be accurately and efficiently carried out, statistical analysis is carried out on the marking data corresponding to the weight-removal index in the mapping relation, a statistical analysis result is generated and displayed, and support information is conveniently provided for service decision.
It should be noted that, the sequence number of each step in the above embodiment does not mean the sequence of execution sequence, and the execution sequence of each process should be determined by its function and internal logic, and should not limit the implementation process of the embodiment of the present application in any way. In practical applications, all the possible embodiments may be combined in any combination manner to form possible embodiments of the present application, which are not described in detail herein.
Based on the big data real-time duplication eliminating method provided by each embodiment, the embodiment of the application also provides a big data real-time duplication eliminating device based on the same inventive concept.
Fig. 3 is a block diagram of a real-time big data de-duplication processing device according to an embodiment of the present application. As shown in fig. 3, the real-time big data de-duplication processing apparatus may specifically include a consumption module 310, an acquisition module 320, a judgment module 330, and a processing module 340.
A consumption module 310 for consuming traffic stream data using a real-time stream processing engine;
an obtaining module 320, configured to obtain weight-removal index flow data in the service flow data according to a set weight-removal index;
a judging module 330, configured to judge whether the duplication elimination index flow data in the service flow data is marked according to the mapping relationship between the duplication elimination index and the marking data;
a processing module 340, configured to add, if the weight-removal index running water data in the service flow data is not marked, target marking data corresponding to the weight-removal index running water data in the service flow data to the mapping relationship; and if the weight-removal index stream data in the service stream data is marked, not adding the target marking data corresponding to the weight-removal index stream data in the service stream data to the mapping relation so as to perform weight-removal processing on the target marking data corresponding to the weight-removal index stream data in the service stream data.
In one possible implementation manner provided in the embodiment of the present application, the obtaining module 320 is further configured to:
and screening out the drainage index running water data corresponding to the set drainage index from the service flow data.
In one possible implementation manner provided in the embodiment of the present application, the determining module 330 is further configured to:
generating target mark data corresponding to the weight-removal index stream data in the service stream data;
searching whether target mark data corresponding to the duplication eliminating index flow data in the service flow data exists in the mapping relation currently containing the duplication eliminating index and the mark data;
if the target mark data corresponding to the weight-removing index running water data in the service flow data is not found, determining that the weight-removing index running water data in the service flow data is not marked;
and if target marking data corresponding to the weight-removal index running water data in the service flow data is found, determining that the weight-removal index running water data in the service flow data is marked.
In one possible implementation manner provided in the embodiment of the present application, the determining module 330 is further configured to:
and converting the duplication elimination index streaming data in the service flow data into the data type of the marking data according to the data type of the marking data, and generating target marking data corresponding to the duplication elimination index streaming data in the service flow data.
The embodiment of the application provides a possible implementation mode, wherein the mapping relation between the current weight-removing index and the marking data is stored in the memory; the judging module 330 is further configured to:
and judging whether the duplication eliminating index stream data in the service stream data is marked according to the mapping relation between the duplication eliminating index and the marking data in the memory.
In one possible implementation manner provided in the embodiment of the present application, the processing module 340 is further configured to:
and after adding the target mark data corresponding to the duplication elimination index streaming data in the service flow data to the mapping relation, writing the target mark data corresponding to the duplication elimination index streaming data in the service flow data in the memory into a disk, and storing the mapping relation between the duplication elimination index and the mark data in the disk.
In one possible implementation manner provided in the embodiment of the present application, as shown in fig. 4, the apparatus shown in fig. 3 above may further include a display module 410, configured to:
and carrying out statistical analysis on the marking data corresponding to the weight-eliminating index in the mapping relation, and generating and displaying a statistical analysis result.
Based on the same inventive concept, the embodiment of the present application further provides an electronic device, including a processor and a memory, where the memory stores a computer program, and the processor is configured to run the computer program to execute the real-time big data duplication elimination processing method of any one of the above embodiments.
In an exemplary embodiment, there is provided an electronic device, as shown in fig. 5, the electronic device 500 shown in fig. 5 includes: a processor 501 and a memory 503. The processor 501 is coupled to a memory 503, such as via a bus 502. Optionally, the electronic device 500 may also include a transceiver 504. It should be noted that, in practical applications, the transceiver 504 is not limited to one, and the structure of the electronic device 500 is not limited to the embodiment of the present application.
The processor 501 may be a CPU (Central Processing Unit, central processor), general purpose processor, DSP (Digital Signal Processor, data signal processor), ASIC (Application Specific Integrated Circuit ), FPGA (Field Programmable Gate Array, field programmable gate array) or other programmable logic device, transistor logic device, hardware components, or any combination thereof. Which may implement or perform the various exemplary logic blocks, modules and circuits described in connection with this disclosure. The processor 501 may also be a combination that implements computing functionality, such as a combination comprising one or more microprocessors, a combination of a DSP and a microprocessor, and the like.
Bus 502 may include a path to transfer information between the components. Bus 502 may be a PCI (Peripheral Component Interconnect, peripheral component interconnect Standard) bus or an EISA (Extended Industry Standard Architecture ) bus, or the like. The bus 502 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in fig. 5, but not only one bus or one type of bus.
The Memory 503 may be, but is not limited to, ROM (Read Only Memory) or other type of static storage device that can store static information and instructions, RAM (Random Access Memory ) or other type of dynamic storage device that can store information and instructions, EEPROM (Electrically Erasable Programmable Read Only Memory ), CD-ROM (Compact Disc Read Only Memory, compact disc Read Only Memory) or other optical disk storage, optical disk storage (including compact discs, laser discs, optical discs, digital versatile discs, blu-ray discs, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.
The memory 503 is used to store computer program code for performing the aspects of the present application and is controlled by the processor 501 for execution. The processor 501 is arranged to execute computer program code stored in the memory 503 for implementing what is shown in the foregoing method embodiments.
Among them, electronic devices include, but are not limited to: mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and stationary terminals such as digital TVs, desktop computers, and the like. The electronic device shown in fig. 5 is only an example and should not be construed as limiting the functionality and scope of use of the embodiments of the present application.
Based on the same inventive concept, the embodiment of the present application further provides a storage medium, in which a computer program is stored, where the computer program is configured to execute the real-time big data de-duplication processing method of any one of the above embodiments when running.
It will be clear to those skilled in the art that the specific working processes of the above-described systems, devices and modules may refer to the corresponding processes in the foregoing method embodiments, and are not described herein for brevity.
Those of ordinary skill in the art will appreciate that: the aspects of the present application may be embodied in essence or in whole or in part in a software product stored on a storage medium, comprising program instructions for causing an electronic device (e.g., personal computer, server, network device, etc.) to perform all or part of the steps of the methods of the embodiments of the present application when the program instructions are executed. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a read-only memory (ROM), a random-access memory (RAM), a magnetic disk, or an optical disk, etc.
Alternatively, all or part of the steps of implementing the foregoing method embodiments may be implemented by hardware (such as a personal computer, a server, or an electronic device such as a network device) associated with program instructions, where the program instructions may be stored in a computer-readable storage medium, and where the program instructions, when executed by a processor of the electronic device, perform all or part of the steps of the method according to the embodiments of the present application.
The above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all technical features thereof can be replaced by others within the spirit and principle of the present application; such modifications and substitutions do not depart from the scope of the application.

Claims (10)

1. The real-time big data duplication eliminating processing method is characterized by comprising the following steps of:
consuming the traffic stream data using a real-time stream processing engine;
acquiring weight-removing index stream data in the service stream data according to the set weight-removing index;
judging whether the duplication eliminating index flow data in the service flow data is marked according to the mapping relation between the duplication eliminating index and the marking data;
if the weight-removing index stream data in the service stream data is not marked, adding target marking data corresponding to the weight-removing index stream data in the service stream data to the mapping relation;
and if the weight-removal index stream data in the service stream data is marked, not adding the target marking data corresponding to the weight-removal index stream data in the service stream data to the mapping relation so as to perform weight-removal processing on the target marking data corresponding to the weight-removal index stream data in the service stream data.
2. The method of claim 1, wherein obtaining the de-duplication indicator pipeline data in the traffic stream data according to the set de-duplication indicator comprises:
and screening out the drainage index running water data corresponding to the set drainage index from the service flow data.
3. The method according to claim 1 or 2, wherein determining whether the duplication elimination indicator stream data in the service stream data has been marked according to a mapping relation between a duplication elimination indicator and marking data, comprises:
generating target mark data corresponding to the weight-removal index stream data in the service stream data;
searching whether target mark data corresponding to the duplication eliminating index flow data in the service flow data exists in the mapping relation currently containing the duplication eliminating index and the mark data;
if the target mark data corresponding to the weight-removing index running water data in the service flow data is not found, determining that the weight-removing index running water data in the service flow data is not marked;
and if target marking data corresponding to the weight-removal index running water data in the service flow data is found, determining that the weight-removal index running water data in the service flow data is marked.
4. The method of claim 3, wherein generating target mark data corresponding to duplication elimination indicator stream data in the service flow data comprises:
and converting the duplication elimination index streaming data in the service flow data into the data type of the marking data according to the data type of the marking data, and generating target marking data corresponding to the duplication elimination index streaming data in the service flow data.
5. The method according to claim 1 or 2, wherein the mapping relation between the current weight-removal index and the marking data is stored in a memory; judging whether the duplication eliminating index flow data in the service flow data is marked according to the mapping relation of the duplication eliminating index and the marking data, comprising the following steps:
and judging whether the duplication eliminating index stream data in the service stream data is marked according to the mapping relation between the duplication eliminating index and the marking data in the memory.
6. The method of claim 5, wherein after adding the target mark data corresponding to the duplication elimination index streaming data in the service flow data to the mapping relationship, the method further comprises:
and writing target mark data corresponding to the duplication eliminating index stream data in the service stream data in the memory into a disk, and storing the mapping relation between the duplication eliminating index and the mark data in the disk.
7. The method according to claim 1 or 2, further comprising:
and carrying out statistical analysis on the marking data corresponding to the weight-eliminating index in the mapping relation, and generating and displaying a statistical analysis result.
8. The real-time big data duplicate removal processing device is characterized by comprising:
the consumption module is used for consuming the service flow data by using the real-time flow processing engine;
the acquisition module is used for acquiring the weight-removal index stream data in the service stream data according to the set weight-removal index;
the judging module is used for judging whether the weight-removal index stream data in the service stream data is marked according to the mapping relation between the weight-removal index and the marking data;
the processing module is used for adding target marking data corresponding to the weight-removal index running water data in the service flow data to the mapping relation if the weight-removal index running water data in the service flow data is not marked; and if the weight-removal index stream data in the service stream data is marked, not adding the target marking data corresponding to the weight-removal index stream data in the service stream data to the mapping relation so as to perform weight-removal processing on the target marking data corresponding to the weight-removal index stream data in the service stream data.
9. An electronic device comprising a processor and a memory, wherein the memory has stored therein a computer program, the processor being configured to run the computer program to perform the big data real-time deduplication processing method of any of claims 1 to 7.
10. A storage medium having a computer program stored therein, wherein the computer program is configured to execute the big data real-time duplication elimination method of any one of claims 1 to 7 at a time of execution.
CN202310659635.1A 2023-06-05 2023-06-05 Big data real-time duplication elimination processing method and device, electronic equipment and storage medium Pending CN117009409A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310659635.1A CN117009409A (en) 2023-06-05 2023-06-05 Big data real-time duplication elimination processing method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310659635.1A CN117009409A (en) 2023-06-05 2023-06-05 Big data real-time duplication elimination processing method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN117009409A true CN117009409A (en) 2023-11-07

Family

ID=88575180

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310659635.1A Pending CN117009409A (en) 2023-06-05 2023-06-05 Big data real-time duplication elimination processing method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117009409A (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140074801A1 (en) * 2012-09-07 2014-03-13 Oracle International Corporation Data de-duplication system
CN108170826A (en) * 2018-01-08 2018-06-15 北京国信宏数科技有限责任公司 A kind of macro economic analysis method and system based on internet big data
CN108809704A (en) * 2018-05-28 2018-11-13 浙江口碑网络技术有限公司 Data deduplication statistical method based on dynamic time windows and device
CN112069162A (en) * 2020-11-10 2020-12-11 太平金融科技服务(上海)有限公司 Data processing method and device for stream computation, computer equipment and storage medium
CN113420263A (en) * 2021-06-30 2021-09-21 北京百度网讯科技有限公司 Data statistical method, device, equipment and storage medium
US20230073627A1 (en) * 2021-08-30 2023-03-09 Datadog, Inc. Analytics database and monitoring system for structuring and storing data streams
CN115794783A (en) * 2022-09-19 2023-03-14 交控科技股份有限公司 Data deduplication method, device, equipment and medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140074801A1 (en) * 2012-09-07 2014-03-13 Oracle International Corporation Data de-duplication system
CN108170826A (en) * 2018-01-08 2018-06-15 北京国信宏数科技有限责任公司 A kind of macro economic analysis method and system based on internet big data
CN108809704A (en) * 2018-05-28 2018-11-13 浙江口碑网络技术有限公司 Data deduplication statistical method based on dynamic time windows and device
CN112069162A (en) * 2020-11-10 2020-12-11 太平金融科技服务(上海)有限公司 Data processing method and device for stream computation, computer equipment and storage medium
CN113420263A (en) * 2021-06-30 2021-09-21 北京百度网讯科技有限公司 Data statistical method, device, equipment and storage medium
US20230073627A1 (en) * 2021-08-30 2023-03-09 Datadog, Inc. Analytics database and monitoring system for structuring and storing data streams
CN115794783A (en) * 2022-09-19 2023-03-14 交控科技股份有限公司 Data deduplication method, device, equipment and medium

Similar Documents

Publication Publication Date Title
CN109167816B (en) Information pushing method, device, equipment and storage medium
CN111209352B (en) Data processing method and device, electronic equipment and storage medium
CN103748579A (en) Processing data in a mapreduce framework
CN111352902A (en) Log processing method and device, terminal equipment and storage medium
CN106648569B (en) Target serialization realization method and device
JP7254925B2 (en) Transliteration of data records for improved data matching
CN109918594B (en) Information display method and device
CN115795000A (en) Joint similarity algorithm comparison-based enclosure identification method and device
CN114185895A (en) Data import and export method and device, electronic equipment and storage medium
CN111443899A (en) Element processing method and device, electronic equipment and storage medium
CN115080514A (en) Index data generation method, information retrieval method, device and computer system
US11347821B2 (en) Real-time generation of an improved graphical user interface for overlapping electronic content
WO2018205391A1 (en) Method, system and apparatus for evaluating accuracy of information retrieval, and computer-readable storage medium
CN115935723A (en) Equipment combination analysis method and system for gallium nitride preparation scene
CN117009409A (en) Big data real-time duplication elimination processing method and device, electronic equipment and storage medium
US9286349B2 (en) Dynamic search system
CN115080552A (en) Data quality evaluation method, device, equipment and computer readable storage medium
CN109543079B (en) Data query method and device, computing equipment and storage medium
CN111221817B (en) Service information data storage method, device, computer equipment and storage medium
US20130054580A1 (en) Data Point Dictionary
CN113672660A (en) Data query method, device and equipment
TWI746527B (en) Data recommendation processing interactive method, device and system
CN111967769A (en) Risk identification method, device, equipment and medium
CN111400510A (en) Data archiving processing method, device, equipment and readable storage medium
CN110619093B (en) Method, apparatus, electronic device, and computer-readable storage medium for determining an order of search items

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination