CN110389946B - Mass data duplication removing method and system for wifi probe acquisition - Google Patents

Mass data duplication removing method and system for wifi probe acquisition Download PDF

Info

Publication number
CN110389946B
CN110389946B CN201910649217.8A CN201910649217A CN110389946B CN 110389946 B CN110389946 B CN 110389946B CN 201910649217 A CN201910649217 A CN 201910649217A CN 110389946 B CN110389946 B CN 110389946B
Authority
CN
China
Prior art keywords
data
online
day
time
acquired
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910649217.8A
Other languages
Chinese (zh)
Other versions
CN110389946A (en
Inventor
林树阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujian Weidun Science And Technology Group Co ltd
Original Assignee
Fujian Weidun Science And Technology Group Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujian Weidun Science And Technology Group Co ltd filed Critical Fujian Weidun Science And Technology Group Co ltd
Priority to CN201910649217.8A priority Critical patent/CN110389946B/en
Publication of CN110389946A publication Critical patent/CN110389946A/en
Application granted granted Critical
Publication of CN110389946B publication Critical patent/CN110389946B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/02Capturing of monitoring data
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W24/00Supervisory, monitoring or testing arrangements
    • H04W24/08Testing, supervising or monitoring using real traffic
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/02Capturing of monitoring data
    • H04L43/028Capturing of monitoring data by filtering

Abstract

The invention belongs to the technical field of big data, and discloses a method and a system for removing duplication of mass data acquired by a wifi probe, which are used for judging whether each piece of data is newly online or online recorded; regularly taking key2 from Redis and writing the online record into a track table of an elastic search; and after the inter-day data written in the S102 is taken out from the inter-day Kafka topic, writing the data into the inter-day elastic search index, deleting the corresponding Redis data, and deleting the record originally in the track index. The invention provides a duplication eliminating method for mass data acquired by a WIFI probe, which is used for eliminating duplication of the data acquired by the WIFI probe according to a certain rule; the method for removing the duplicate of the mass data acquired by the Wi-Fi probe can filter a large amount of useless data for a user, reduce the data volume to be stored, avoid reading the useless data and effectively improve the processing of the mass data.

Description

Mass data deduplication method and system for wifi probe acquisition
Technical Field
The invention belongs to the technical field of big data, and relates to a duplication removing method and system for mass data acquired by a wifi probe.
Background
Wi-Fi has a large data collection amount, currently, according to a 1-minute deduplication mode, if a person resides at a collection point for 2 hours, 120 pieces of data can be generated according to the deduplication mode, but for a user, as long as the entering time and the leaving time are enough, only one record needs to be stored on the same record at present, and the online time and the leaving time need to be stored at the same time. And simultaneously, if the time difference between two records collected by the same terminal before and after the terminal on the collecting device exceeds 30 minutes (the time can be adjusted, and the analysis is designed according to 30 minutes), the terminal is on-line again.
In summary, the problems of the prior art are:
Wi-Fi has a large data collection amount, can collect a large amount of repeated data, causes low processing efficiency of mass data, and cannot save use cost.
The difficulty of solving the technical problems is as follows:
caching the terminals collected by all the wifi probes, then comparing, and considering the concurrency of big data.
The significance of solving the technical problems is as follows:
effective data are mined, the utilization rate of the data is improved, data storage is reduced, and the use cost is saved.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a duplication eliminating method and system for mass data acquired by a wifi probe.
The invention is realized in such a way, and provides a duplication eliminating method for mass data acquired by a wifi probe. The duplication eliminating method aiming at the mass data acquired by the wifi probe comprises the following steps:
step one, judging whether each piece of data is a new online or online record.
And step two, taking key2 from Redis at regular time, and writing the key2 into a track table of the elastic search online record.
Step three, after the cross-day data written in the S102 is taken out from the cross-day Kafka topic, the data is written in a cross-day elastic search index, the corresponding Redis data is deleted, and meanwhile, the record of the original track index is deleted.
Further, the first step specifically includes:
and acquiring a key1 value according to the data to Redis, and judging whether the key1 value is acquired or not, wherein the key1 value is not acquired, and the key2 value is acquired.
Execution point 1: if key1 is not acquired, the value is a new online value, and the following steps are executed:
step 1: this piece of data is inserted into the trajectory table of the Elasticsearch (actually written into Kafka, and then written from Kafka to Elasticsearch).
And 2, step: and inserting a key1 value (the acquisition time and the online time are both the acquisition time of the current data).
And step 3: and (6) ending.
Execution point 2: acquiring key1, subtracting the current acquisition time from the acquired acquisition time of the key1, and judging the time difference:
in the first case: and if the time difference exceeds 30 minutes, executing the step of point 1 for the latest online data.
In the second case: the time difference does not exceed 30 minutes, the following steps are carried out:
step 1: and updating the key1 value (the acquisition time is the current acquisition time, and the online time is unchanged).
Step 2: insert the key2 value (value is the time on line of this piece of data plus key 1).
And step 3: and (6) ending.
Further, the second step specifically includes:
taking all current online records of Redis every 30 minutes, judging whether the record is cross-day data according to the acquisition time (leaving time) and online time in the record, respectively sending the record to a cross-day Kafka topic and a non-cross-day Kafka topic, then obtaining the non-cross-day Kafka topic, writing the non-cross-day Kafka topic into a track table and a cross-day table of an elastic search, and deleting the corresponding Redis data after the execution is successful.
The invention also aims to provide a duplication elimination control system for the mass data acquired by the wifi probe, which implements the duplication elimination method for the mass data acquired by the wifi probe.
The invention also aims to provide the information data processing terminal for realizing the duplication eliminating method aiming at the mass data acquired by the wifi probe.
Another object of the present invention is to provide a computer-readable storage medium, which includes instructions that, when executed on a computer, cause the computer to execute the deduplication method for mass data collected by a wifi probe.
In summary, the advantages and positive effects of the invention are as follows:
the form of the raw data collected by the Wifi probe provided by the invention in kafka is shown in FIG. 6.
The data stored in kafka and redis after processing by the present invention, as shown in fig. 7-8.
Integrating the data into an elastic search and displaying the integrated data to a user interface, as shown in fig. 9.
The invention aims to provide a WIFI probe, wherein each piece of data acquired by the WIFI probe comprises certain main fields, and the data are processed and combined into one piece of data due to the fact that the acquisition frequency is high, the contents of the fields are the same except the acquisition time, and the concepts of the entering time and the leaving time are introduced.
According to the invention, the data acquired by the WIFI probe are subjected to duplicate removal according to a certain rule. The method for removing the duplicate of the mass data acquired by the Wi-Fi probe can filter a large amount of useless data for a user, reduce the data amount to be stored, avoid reading the useless data and effectively improve the processing of the mass data.
Drawings
Fig. 1 is a flowchart of a duplication elimination method for mass data acquired by a wifi probe according to an embodiment of the present invention.
Fig. 2 is a schematic diagram of a deduplication task 1 provided by an embodiment of the present invention.
Fig. 3 is a schematic diagram of a deduplication task 2 provided by an embodiment of the present invention.
Fig. 4 is a schematic diagram of a deduplication task 3 provided by an embodiment of the present invention.
Fig. 5 is a schematic diagram of raw data collected by a Wifi probe according to an embodiment of the present invention.
FIG. 6 is a formal diagram of the existence of kafka in raw data collected by a Wifi probe provided by an embodiment of the present invention.
FIG. 7 is a first data graph provided by an embodiment of the present invention and stored in kafka and redis after being processed by the present invention.
FIG. 8 is a second graph of data stored in kafka and redis after processing by the present invention, as provided by an embodiment of the present invention.
Fig. 9 is an interface diagram for integrating data into an Elasticsearch and displaying the integrated data to a user according to an embodiment of the present invention.
Fig. 10 is a schematic diagram of 3 hundred million plots of the amount collected per day with access 3800 multiple wifi probes in a practical project provided by an embodiment of the present invention.
Fig. 11 is a schematic diagram of an average number of the elastic search entries entering each day after the method is passed through the invention, which is 5000 ten thousand pieces.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the invention.
The prior art can not solve the problems that the Wi-Fi has large data collection amount and large amount of useless data.
Aiming at the problems in the prior art, the invention provides a duplication eliminating method for mass data acquired by a wifi probe, and the method is described in detail below with reference to the accompanying drawings.
As shown in fig. 1, the method for removing duplicate of mass data acquired by a wifi probe provided in the embodiment of the present invention includes the following steps:
s101: and judging whether each piece of data is newly online or online recorded.
S102: the timing takes the key2 online record from Redis and writes into the trajectory table of Elasticissearch.
S103: and after the inter-day data written in the S102 is taken out from the inter-day Kafka topic, writing the data into the inter-day elastic search index, deleting the corresponding Redis data, and deleting the record originally in the track index.
Further, step S101 specifically includes:
and acquiring a key1 value according to the data to Redis, and judging whether the key1 value is acquired or not, wherein the key1 value is not acquired, and the key2 value is acquired.
Execution point 1: if key1 is not acquired, the value is a new online value, and the following steps are executed:
step 1: this piece of data is inserted into the trace table of the Elasticsearch (actually written to Kafka, and then written from Kafka to Elasticsearch).
Step 2: and inserting a key1 value (the acquisition time and the online time are both the acquisition time of the current data).
And step 3: and (6) ending.
Execution point 2: acquiring key1, subtracting the current acquisition time from the acquired acquisition time of the key1, and judging the time difference:
in the first case: and if the time difference exceeds 30 minutes, executing the step of point 1 for the latest online data.
In the second case: the time difference does not exceed 30 minutes, the following steps are carried out:
step 1: and updating the key1 value (the acquisition time is the current acquisition time, and the online time is unchanged).
And 2, step: insert the key2 value (value is the time on line of the piece of data plus key 1).
And step 3: and (6) ending.
Further, step S102 specifically includes:
taking all current online records of Redis every 30 minutes, judging whether the record is cross-day data according to the acquisition time (leaving time) and online time in the record, respectively sending the record to a cross-day Kafka topic and a non-cross-day Kafka topic, then obtaining the non-cross-day Kafka topic, writing the non-cross-day Kafka topic into a track table and a cross-day table of an elastic search, and deleting the corresponding Redis data after the execution is successful.
The application of the principles of the present invention will now be described in further detail with reference to specific embodiments.
Examples
The trajectory table of the Elasticsearch generates an index table by nature, for example: and the track table _20190101, the track table _20190102, the track table _20190103 and the like are stored in the corresponding index tables according to the date of the acquisition time in the diagram 5.
The deduplication is divided into 3 tasks:
task 1: as shown in fig. 2.
[ task description ]
And judging whether each piece of data is newly online or online recorded.
[ MEANS FOR IMPLEMENTING PROCEDURE ]
Two types of keys are stored in Redis:
one is to store the current latest acquisition time and the online time.
The second is to store the latest piece of data plus the online time of the data.
Redis stores keys as follows:
Figure BDA0002134596790000061
storm acquires Wi-Fi track data from Kafka in real time, and the processing flow is as follows when one piece of data is acquired.
And acquiring a key1 value according to the data to Redis, and judging whether the key1 value is acquired or not, wherein the key1 value is not acquired, and the key2 value is acquired.
Execution point 1: if key1 is not acquired, the value is a new online value, and the following steps are executed:
step 1: this piece of data is inserted into the trace table of the Elasticsearch (actually written to Kafka, and then written from Kafka to Elasticsearch).
Step 2: and inserting a key1 value (the acquisition time and the online time are both the acquisition time of the current data).
And step 3: and (6) ending.
Execution point 2: acquiring key1, subtracting the current acquisition time from the acquired acquisition time of the key1, and judging the time difference:
in the first case: and if the time difference exceeds 30 minutes, executing the step 1 for the latest online data.
In the second case: the time difference does not exceed 30 minutes, the following steps are carried out:
step 1: and updating the key1 value (the acquisition time is the current acquisition time, and the online time is unchanged).
And 2, step: insert the key2 value (value is the time on line of the piece of data plus key 1).
And step 3: and (6) ending.
Task 2: as shown in fig. 3.
[ task description ]:
timing the key2 taken from Redis to write on-line record to the track table of the elastic search (actually writing Kafka, then writing Kafka to elastic search)
[ implementation process ]:
taking all current online records of Redis every 30 minutes, judging whether the record is cross-day data according to the acquisition time (leaving time) and online time in the record, respectively sending the record to a cross-day Kafka topic and a non-cross-day Kafka topic, then obtaining the non-cross-day Kafka topic, writing the non-cross-day Kafka topic into a track table and a cross-day table of an elastic search, and deleting the corresponding Redis data after the execution is successful.
Task 3: as shown in fig. 4.
[ task description ]:
some terminal devices have the problem of long-time online, such as some fixed devices, devices which are placed at home for a long time, and the like. Due to the fact that the acquisition time and the online time are not on the same day (cross-day), one piece of online data is stored every day, and repeated data can appear in collision and tracks.
[ implementation process ]:
after the inter-day data written by the task 2 is taken out from the inter-day Kafka topic, the data is written into the inter-day elastic search index, the corresponding Redis data is deleted, and meanwhile, the record originally in the track index needs to be deleted (the index in which day is obtained according to the previous day of the acquisition time).
The invention is further described below in conjunction with the description of the relevant data.
FIG. 6 is a formal diagram of the existence of kafka in raw data collected by a Wifi probe provided by an embodiment of the present invention.
FIG. 7 is a first data graph provided by an embodiment of the present invention and stored in kafka and redis after being processed by the present invention.
FIG. 8 is a second graph of data stored in kafka and redis after processing by the present invention, as provided by an embodiment of the present invention.
Fig. 9 is an interface diagram for integrating data into an Elasticsearch and displaying the integrated data to a user according to an embodiment of the present invention.
Currently, in practical projects, there are 3800 wifi probes, the number of the wifi probes collected per day is 3 hundred million (fig. 10), and the number of the wifi probes collected per day after the method is carried out is 5000 million on average (fig. 11).
In the above embodiments, all or part of the implementation may be realized by software, hardware, firmware, or any combination thereof. When used in whole or in part, can be implemented in a computer program product that includes one or more computer instructions. When loaded or executed on a computer, cause the flow or functions according to embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL), or wireless (e.g., infrared, wireless, microwave, etc.)). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), among others.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims (4)

1. A duplication removing method for mass data acquired by a wifi probe is characterized in that two keys are stored in Redis:
k ey1 is used for storing the current latest acquisition time and the online time;
key2 is the time of storing the latest piece of data plus the data's online time;
the duplication eliminating method for the mass data acquired by the wifi probe comprises the following steps:
judging whether each piece of data is an online record;
step two, taking key2 from Redis, recording online and writing the online record into a track table of an elastic search;
step three, after the cross-day data written in the step two is taken out from the cross-day Kafka topic, the data is written into a cross-day elastic search index, corresponding Redis data is deleted, and meanwhile, the record initially in the track index is deleted;
acquiring a key1 value according to data to Redis, judging whether the key1 value is acquired or not, and acquiring an execution point 1 if the key1 value is not acquired; if yes, acquiring an execution point 2;
if key1 is not acquired in the method for acquiring the execution point 1, the value is a new online value, and the method specifically comprises the following steps:
step 1: inserting the data into a track table of the Elasticissearch;
step 2: inserting a key1 value, wherein the acquisition time and the online time are the acquisition time of the current data;
the method for acquiring the execution point 2 comprises the following steps:
acquiring key1, subtracting the current acquisition time from the acquired acquisition time of key1, and judging the time difference;
if the time difference exceeds 30 minutes, the latest online data is obtained, and the step of point 1 is executed;
if the time difference does not exceed 30 minutes, the following steps are executed:
step I: updating the key1 value, wherein the acquisition time is the current acquisition time, and the online time is unchanged;
step II: inserting a key2 value which is the data plus the online time of key 1;
the second step specifically comprises:
taking all current online records of Redis once every 30 minutes, judging whether the record is cross-day data according to the acquisition time and the online time in the record, respectively sending the record to a cross-day Kafka topic and a non-cross-day Kafka topic, then obtaining a non-cross-day K afka topic, writing the non-cross-day K afka topic into an Elasticsearch track table and a cross-day table, and deleting corresponding Redis data after the execution is successful.
2. A deduplication control system for mass data acquired by a wifi probe implementing the deduplication method for mass data acquired by a wifi probe of claim 1.
3. An information data processing terminal for implementing the duplication elimination method for mass data acquired by the wifi probe as claimed in claim 1.
4. A computer-readable storage medium comprising instructions that, when executed on a computer, cause the computer to perform the deduplication method of claim 1 for mass data acquired by a wifi probe.
CN201910649217.8A 2019-07-18 2019-07-18 Mass data duplication removing method and system for wifi probe acquisition Active CN110389946B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910649217.8A CN110389946B (en) 2019-07-18 2019-07-18 Mass data duplication removing method and system for wifi probe acquisition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910649217.8A CN110389946B (en) 2019-07-18 2019-07-18 Mass data duplication removing method and system for wifi probe acquisition

Publications (2)

Publication Number Publication Date
CN110389946A CN110389946A (en) 2019-10-29
CN110389946B true CN110389946B (en) 2023-01-24

Family

ID=68285132

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910649217.8A Active CN110389946B (en) 2019-07-18 2019-07-18 Mass data duplication removing method and system for wifi probe acquisition

Country Status (1)

Country Link
CN (1) CN110389946B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103559302A (en) * 2013-11-14 2014-02-05 北京国双科技有限公司 Method, device and system for monitoring state of network media data
CN106844546A (en) * 2016-12-30 2017-06-13 江苏号百信息服务有限公司 Multi-data source positional information fusion method and system based on Spark clusters
CN108347698A (en) * 2018-02-07 2018-07-31 山东合天智汇信息技术有限公司 A kind of on-line off-line event trace analysis method, apparatus and system
CN108418821A (en) * 2018-03-06 2018-08-17 北京焦点新干线信息技术有限公司 Redis and Kafka-based high-concurrency scene processing method and device for online shopping system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10437635B2 (en) * 2016-02-10 2019-10-08 Salesforce.Com, Inc. Throttling events in entity lifecycle management

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103559302A (en) * 2013-11-14 2014-02-05 北京国双科技有限公司 Method, device and system for monitoring state of network media data
CN106844546A (en) * 2016-12-30 2017-06-13 江苏号百信息服务有限公司 Multi-data source positional information fusion method and system based on Spark clusters
CN108347698A (en) * 2018-02-07 2018-07-31 山东合天智汇信息技术有限公司 A kind of on-line off-line event trace analysis method, apparatus and system
CN108418821A (en) * 2018-03-06 2018-08-17 北京焦点新干线信息技术有限公司 Redis and Kafka-based high-concurrency scene processing method and device for online shopping system

Also Published As

Publication number Publication date
CN110389946A (en) 2019-10-29

Similar Documents

Publication Publication Date Title
CN108319654B (en) Computing system, cold and hot data separation method and device, and computer readable storage medium
KR101766790B1 (en) Method and computing apparatus for maniging main memory database
US8972338B2 (en) Sampling transactions from multi-level log file records
US11176110B2 (en) Data updating method and device for a distributed database system
CN112714359B (en) Video recommendation method and device, computer equipment and storage medium
CN109240607B (en) File reading method and device
CN113961153B (en) Method and device for writing index data into disk and terminal equipment
CN110502510B (en) Real-time analysis and duplicate removal method and system for WIFI terminal equipment trajectory data
CN111651127A (en) Monitoring data storage method and device based on shingled magnetic recording disk
CN107273449B (en) Breakpoint processing method and system based on memory database
US10628305B2 (en) Determining a data layout in a log structured storage system
CN110389946B (en) Mass data duplication removing method and system for wifi probe acquisition
CN113761059A (en) Data processing method and device
US9405786B2 (en) System and method for database flow management
JP5956064B2 (en) Computer system, data management method, and computer
KR20170106626A (en) Method and computing apparatus for maniging main memory database
CN111913913A (en) Access request processing method and device
US8484429B2 (en) Apparatus and method to copy data via a removable storage device
US20210089401A1 (en) Method, Server, and Computer Readable Medium for Index Recovery Using Index Redo Log
CN107894942B (en) Method and device for monitoring data table access amount
CN115858471A (en) Service data change recording method, device, computer equipment and medium
CN116820323A (en) Data storage method, device, electronic equipment and computer readable storage medium
CN108153805A (en) A kind of method, the system of efficient cleaning Hbase time series datas
CN114036121A (en) Log file processing method, device, system, equipment and storage medium
CN106921536A (en) Data processing method and device based on client release information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant