CN110389946B

CN110389946B - Mass data duplication removing method and system for wifi probe acquisition

Info

Publication number: CN110389946B
Application number: CN201910649217.8A
Authority: CN
Inventors: 林树阳
Original assignee: Fujian Weidun Science And Technology Group Co ltd
Current assignee: Fujian Weidun Science And Technology Group Co ltd
Priority date: 2019-07-18
Filing date: 2019-07-18
Publication date: 2023-01-24
Anticipated expiration: 2039-07-18
Also published as: CN110389946A

Abstract

The invention belongs to the technical field of big data, and discloses a method and a system for removing duplication of mass data acquired by a wifi probe, which are used for judging whether each piece of data is newly online or online recorded; regularly taking key2 from Redis and writing the online record into a track table of an elastic search; and after the inter-day data written in the S102 is taken out from the inter-day Kafka topic, writing the data into the inter-day elastic search index, deleting the corresponding Redis data, and deleting the record originally in the track index. The invention provides a duplication eliminating method for mass data acquired by a WIFI probe, which is used for eliminating duplication of the data acquired by the WIFI probe according to a certain rule; the method for removing the duplicate of the mass data acquired by the Wi-Fi probe can filter a large amount of useless data for a user, reduce the data volume to be stored, avoid reading the useless data and effectively improve the processing of the mass data.

Description

Mass data deduplication method and system for wifi probe acquisition

Technical Field

The invention belongs to the technical field of big data, and relates to a duplication removing method and system for mass data acquired by a wifi probe.

Background

Wi-Fi has a large data collection amount, currently, according to a 1-minute deduplication mode, if a person resides at a collection point for 2 hours, 120 pieces of data can be generated according to the deduplication mode, but for a user, as long as the entering time and the leaving time are enough, only one record needs to be stored on the same record at present, and the online time and the leaving time need to be stored at the same time. And simultaneously, if the time difference between two records collected by the same terminal before and after the terminal on the collecting device exceeds 30 minutes (the time can be adjusted, and the analysis is designed according to 30 minutes), the terminal is on-line again.

In summary, the problems of the prior art are:

Wi-Fi has a large data collection amount, can collect a large amount of repeated data, causes low processing efficiency of mass data, and cannot save use cost.

The difficulty of solving the technical problems is as follows:

caching the terminals collected by all the wifi probes, then comparing, and considering the concurrency of big data.

The significance of solving the technical problems is as follows:

effective data are mined, the utilization rate of the data is improved, data storage is reduced, and the use cost is saved.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a duplication eliminating method and system for mass data acquired by a wifi probe.

The invention is realized in such a way, and provides a duplication eliminating method for mass data acquired by a wifi probe. The duplication eliminating method aiming at the mass data acquired by the wifi probe comprises the following steps:

step one, judging whether each piece of data is a new online or online record.

And step two, taking key2 from Redis at regular time, and writing the key2 into a track table of the elastic search online record.

Step three, after the cross-day data written in the S102 is taken out from the cross-day Kafka topic, the data is written in a cross-day elastic search index, the corresponding Redis data is deleted, and meanwhile, the record of the original track index is deleted.

Further, the first step specifically includes:

and acquiring a key1 value according to the data to Redis, and judging whether the key1 value is acquired or not, wherein the key1 value is not acquired, and the key2 value is acquired.

Execution point 1: if key1 is not acquired, the value is a new online value, and the following steps are executed:

step 1: this piece of data is inserted into the trajectory table of the Elasticsearch (actually written into Kafka, and then written from Kafka to Elasticsearch).

And 2, step: and inserting a key1 value (the acquisition time and the online time are both the acquisition time of the current data).

And step 3: and (6) ending.

Execution point 2: acquiring key1, subtracting the current acquisition time from the acquired acquisition time of the key1, and judging the time difference:

in the first case: and if the time difference exceeds 30 minutes, executing the step of point 1 for the latest online data.

In the second case: the time difference does not exceed 30 minutes, the following steps are carried out:

step 1: and updating the key1 value (the acquisition time is the current acquisition time, and the online time is unchanged).

Step 2: insert the key2 value (value is the time on line of this piece of data plus key 1).

And step 3: and (6) ending.

Further, the second step specifically includes:

taking all current online records of Redis every 30 minutes, judging whether the record is cross-day data according to the acquisition time (leaving time) and online time in the record, respectively sending the record to a cross-day Kafka topic and a non-cross-day Kafka topic, then obtaining the non-cross-day Kafka topic, writing the non-cross-day Kafka topic into a track table and a cross-day table of an elastic search, and deleting the corresponding Redis data after the execution is successful.

The invention also aims to provide a duplication elimination control system for the mass data acquired by the wifi probe, which implements the duplication elimination method for the mass data acquired by the wifi probe.

The invention also aims to provide the information data processing terminal for realizing the duplication eliminating method aiming at the mass data acquired by the wifi probe.

Another object of the present invention is to provide a computer-readable storage medium, which includes instructions that, when executed on a computer, cause the computer to execute the deduplication method for mass data collected by a wifi probe.

In summary, the advantages and positive effects of the invention are as follows:

the form of the raw data collected by the Wifi probe provided by the invention in kafka is shown in FIG. 6.

The data stored in kafka and redis after processing by the present invention, as shown in fig. 7-8.

Integrating the data into an elastic search and displaying the integrated data to a user interface, as shown in fig. 9.

The invention aims to provide a WIFI probe, wherein each piece of data acquired by the WIFI probe comprises certain main fields, and the data are processed and combined into one piece of data due to the fact that the acquisition frequency is high, the contents of the fields are the same except the acquisition time, and the concepts of the entering time and the leaving time are introduced.

According to the invention, the data acquired by the WIFI probe are subjected to duplicate removal according to a certain rule. The method for removing the duplicate of the mass data acquired by the Wi-Fi probe can filter a large amount of useless data for a user, reduce the data amount to be stored, avoid reading the useless data and effectively improve the processing of the mass data.

Drawings

Fig. 1 is a flowchart of a duplication elimination method for mass data acquired by a wifi probe according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of a deduplication task 1 provided by an embodiment of the present invention.

Fig. 3 is a schematic diagram of a deduplication task 2 provided by an embodiment of the present invention.

Fig. 4 is a schematic diagram of a deduplication task 3 provided by an embodiment of the present invention.

Fig. 5 is a schematic diagram of raw data collected by a Wifi probe according to an embodiment of the present invention.

FIG. 6 is a formal diagram of the existence of kafka in raw data collected by a Wifi probe provided by an embodiment of the present invention.

FIG. 7 is a first data graph provided by an embodiment of the present invention and stored in kafka and redis after being processed by the present invention.

FIG. 8 is a second graph of data stored in kafka and redis after processing by the present invention, as provided by an embodiment of the present invention.

Fig. 9 is an interface diagram for integrating data into an Elasticsearch and displaying the integrated data to a user according to an embodiment of the present invention.

Fig. 10 is a schematic diagram of 3 hundred million plots of the amount collected per day with access 3800 multiple wifi probes in a practical project provided by an embodiment of the present invention.

Fig. 11 is a schematic diagram of an average number of the elastic search entries entering each day after the method is passed through the invention, which is 5000 ten thousand pieces.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the invention.

The prior art can not solve the problems that the Wi-Fi has large data collection amount and large amount of useless data.

Aiming at the problems in the prior art, the invention provides a duplication eliminating method for mass data acquired by a wifi probe, and the method is described in detail below with reference to the accompanying drawings.

As shown in fig. 1, the method for removing duplicate of mass data acquired by a wifi probe provided in the embodiment of the present invention includes the following steps:

s101: and judging whether each piece of data is newly online or online recorded.

S102: the timing takes the key2 online record from Redis and writes into the trajectory table of Elasticissearch.

S103: and after the inter-day data written in the S102 is taken out from the inter-day Kafka topic, writing the data into the inter-day elastic search index, deleting the corresponding Redis data, and deleting the record originally in the track index.

Further, step S101 specifically includes:

step 1: this piece of data is inserted into the trace table of the Elasticsearch (actually written to Kafka, and then written from Kafka to Elasticsearch).

Step 2: and inserting a key1 value (the acquisition time and the online time are both the acquisition time of the current data).

And step 3: and (6) ending.

And 2, step: insert the key2 value (value is the time on line of the piece of data plus key 1).

And step 3: and (6) ending.

Further, step S102 specifically includes:

The application of the principles of the present invention will now be described in further detail with reference to specific embodiments.

Examples

The trajectory table of the Elasticsearch generates an index table by nature, for example: and the track table _20190101, the track table _20190102, the track table _20190103 and the like are stored in the corresponding index tables according to the date of the acquisition time in the diagram 5.

The deduplication is divided into 3 tasks:

task 1: as shown in fig. 2.

[ task description ]

And judging whether each piece of data is newly online or online recorded.

[ MEANS FOR IMPLEMENTING PROCEDURE ]

Two types of keys are stored in Redis:

one is to store the current latest acquisition time and the online time.

The second is to store the latest piece of data plus the online time of the data.

Redis stores keys as follows:

storm acquires Wi-Fi track data from Kafka in real time, and the processing flow is as follows when one piece of data is acquired.

And step 3: and (6) ending.

in the first case: and if the time difference exceeds 30 minutes, executing the step 1 for the latest online data.

And step 3: and (6) ending.

Task 2: as shown in fig. 3.

[ task description ]:

timing the key2 taken from Redis to write on-line record to the track table of the elastic search (actually writing Kafka, then writing Kafka to elastic search)

[ implementation process ]:

Task 3: as shown in fig. 4.

[ task description ]:

some terminal devices have the problem of long-time online, such as some fixed devices, devices which are placed at home for a long time, and the like. Due to the fact that the acquisition time and the online time are not on the same day (cross-day), one piece of online data is stored every day, and repeated data can appear in collision and tracks.

[ implementation process ]:

after the inter-day data written by the task 2 is taken out from the inter-day Kafka topic, the data is written into the inter-day elastic search index, the corresponding Redis data is deleted, and meanwhile, the record originally in the track index needs to be deleted (the index in which day is obtained according to the previous day of the acquisition time).

The invention is further described below in conjunction with the description of the relevant data.

Currently, in practical projects, there are 3800 wifi probes, the number of the wifi probes collected per day is 3 hundred million (fig. 10), and the number of the wifi probes collected per day after the method is carried out is 5000 million on average (fig. 11).

In the above embodiments, all or part of the implementation may be realized by software, hardware, firmware, or any combination thereof. When used in whole or in part, can be implemented in a computer program product that includes one or more computer instructions. When loaded or executed on a computer, cause the flow or functions according to embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL), or wireless (e.g., infrared, wireless, microwave, etc.)). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), among others.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A duplication removing method for mass data acquired by a wifi probe is characterized in that two keys are stored in Redis:

k ey1 is used for storing the current latest acquisition time and the online time;

key2 is the time of storing the latest piece of data plus the data's online time;

the duplication eliminating method for the mass data acquired by the wifi probe comprises the following steps:

judging whether each piece of data is an online record;

step two, taking key2 from Redis, recording online and writing the online record into a track table of an elastic search;

step three, after the cross-day data written in the step two is taken out from the cross-day Kafka topic, the data is written into a cross-day elastic search index, corresponding Redis data is deleted, and meanwhile, the record initially in the track index is deleted;

acquiring a key1 value according to data to Redis, judging whether the key1 value is acquired or not, and acquiring an execution point 1 if the key1 value is not acquired; if yes, acquiring an execution point 2;

if key1 is not acquired in the method for acquiring the execution point 1, the value is a new online value, and the method specifically comprises the following steps:

step 1: inserting the data into a track table of the Elasticissearch;

step 2: inserting a key1 value, wherein the acquisition time and the online time are the acquisition time of the current data;

the method for acquiring the execution point 2 comprises the following steps:

acquiring key1, subtracting the current acquisition time from the acquired acquisition time of key1, and judging the time difference;

if the time difference exceeds 30 minutes, the latest online data is obtained, and the step of point 1 is executed;

if the time difference does not exceed 30 minutes, the following steps are executed:

step I: updating the key1 value, wherein the acquisition time is the current acquisition time, and the online time is unchanged;

step II: inserting a key2 value which is the data plus the online time of key 1;

the second step specifically comprises:

taking all current online records of Redis once every 30 minutes, judging whether the record is cross-day data according to the acquisition time and the online time in the record, respectively sending the record to a cross-day Kafka topic and a non-cross-day Kafka topic, then obtaining a non-cross-day K afka topic, writing the non-cross-day K afka topic into an Elasticsearch track table and a cross-day table, and deleting corresponding Redis data after the execution is successful.

2. A deduplication control system for mass data acquired by a wifi probe implementing the deduplication method for mass data acquired by a wifi probe of claim 1.

3. An information data processing terminal for implementing the duplication elimination method for mass data acquired by the wifi probe as claimed in claim 1.

4. A computer-readable storage medium comprising instructions that, when executed on a computer, cause the computer to perform the deduplication method of claim 1 for mass data acquired by a wifi probe.