CN112306998A

CN112306998A - Commission data duplicate removal method, device and server

Info

Publication number: CN112306998A
Application number: CN202011087893.XA
Authority: CN
Inventors: 何正威
Original assignee: Wuhan Zhongke Tongda High New Technology Co Ltd
Current assignee: Wuhan Zhongke Tongda High New Technology Co Ltd
Priority date: 2020-10-13
Filing date: 2020-10-13
Publication date: 2021-02-02
Anticipated expiration: 2040-10-13
Also published as: CN112306998B

Abstract

The application provides a method, a device and a server for deduplication of commit data, the method determines target data of a certain data type and metadata of the target data through a selection interface, then updates the selection interface to generate a configuration interface, selects a deduplication index field and a deduplication time field from the metadata through the configuration interface, sets filtering duration corresponding to the deduplication time field, and finally performs deduplication processing on the commit data according to the index duration of the deduplication index field of the received commit data, the time data segment of the deduplication time field and the filtering duration based on data cached in a redis database, so that a configuration interface for deduplication can be dynamically generated through the initial selection interface, and dynamically selects the deduplication index field and the deduplication time field through the configuration interface, the flexibility and the accuracy of deduplication processing are improved, and the redis database is adopted, the speed of duplicate removal processing is improved, and the system expansibility is stronger. The universality is high.

Description

Commission data duplicate removal method, device and server

Technical Field

The present application relates to the field of transportation, and in particular, to a method, an apparatus, and a server for removing duplicate of data of committee.

Background

With the vigorous development of the internet, big data technology is widely applied in various fields, a large amount of data need to be continuously accessed from the outside in a short time in a plurality of big data application scenes, a large amount of repeated data exist in the data, for example, in shopping malls and stations with dense personnel, one person stays near a certain wifi electronic fence for a period of time, similar data can be collected for many times, for example, in a traffic operation scene, when a vehicle slowly moves or is static at a bayonet, a data collector can repeatedly upload the vehicle passing record of the vehicle in a short time, and a big data platform stores more repeated data. If the repeated data is stored without being processed, on one hand, the throughput and the storage capacity of the system are wasted, and on the other hand, the subsequent data use and data analysis are greatly influenced.

At present, for these data, usually, an ETL (Extract-Transform-Load) tool performs deduplication processing during data extraction, or performs deduplication processing through an SQL statement, but the ETL tool needs complete data consistency to determine that the data is duplicated data, and cannot determine whether the data is duplicated data or not through consistency of part of keywords or consistency of part of keywords in a specified time period; the SQL statement is only used in a relational database, has no universality and expansibility, and has great influence on the performance of the database in the deduplication process.

Therefore, the current data deduplication method has the technical problems of insufficient flexibility and accuracy, and poor expansibility and universality.

Disclosure of Invention

The embodiment of the application provides a commit data deduplication method and device, and aims to solve the technical problems that a current data deduplication method is not enough in flexibility and accuracy and poor in expansibility and universality.

In a first aspect, an embodiment of the present application provides a commit data deduplication method, where the commit data deduplication method includes:

displaying a selection interface of target data, and receiving data type selection operation through the selection interface;

determining the target data according to the data type selection operation, and acquiring metadata corresponding to the target data;

updating the selection interface according to the metadata, and generating and displaying a configuration interface;

receiving a duplication removal parameter configuration operation through the configuration interface, determining a duplication removal index field and a duplication removal time field according to the metadata according to the duplication removal parameter configuration operation, and setting a filtering duration corresponding to the duplication removal time field;

receiving transaction data, wherein the transaction data comprises an index data segment of the deduplication index field and a time data segment of the deduplication time field;

based on the data cached in the redis database, according to the index data segment, the time data segment and the filtering duration, the committee data is cached to the redis database after being subjected to deduplication processing.

In some embodiments, determining the target data according to the data type selection operation, and acquiring metadata corresponding to the target data specifically includes:

selecting the data type according to the data type selection operation, and determining target data corresponding to the data type;

and acquiring the metadata corresponding to the target data according to the data type.

In some embodiments, updating the selection interface according to the metadata, and generating and displaying a configuration interface specifically includes:

displaying the metadata, and setting a field index box corresponding to the metadata, a field selection box corresponding to each metadata, and a field confirmation box;

and updating the selection interface, and generating and displaying the configuration interface.

In some embodiments, the determining the deduplication index field and the deduplication time field according to the metadata and setting the filtering duration corresponding to the deduplication time field specifically include:

searching the deduplication index field and the deduplication time field through the field index box;

selecting the deduplication index field and the deduplication time field through the field selection box;

confirming the deduplication index field and the deduplication time field through the field confirmation box;

and displaying the deduplication index field and the deduplication time field, and setting the filtering duration corresponding to the deduplication time field.

In some embodiments, the deduplication index field comprises at least one field of the metadata, and the deduplication time field comprises only one field of the metadata.

In some embodiments, based on data cached in a redis database, according to the index data segment, the time data segment, and the filtering duration, performing deduplication processing on the commit data, and then caching the commit data in the redis database, specifically including:

judging whether the data cached in the redis database comprises the index data segment or not, if the redis database does not comprise the index data segment, determining that the committee data is not repeated data, and writing the committee data into the redis database;

if the data cached in the redis database comprises the index data segment, judging whether a difference value between the time data segment and the time data segment of the last committee data cached in the redis database is within the filtering time length, if not, determining that the committee data is not repeated data, and writing the committee data into the redis database; and if so, determining that the transaction data is repeated data and discarded, and updating the time data segment of the last transaction data cached in the redis database into the time data segment.

In some embodiments, the determining whether the data cached in the redis database includes the index data segment, and if the redis database does not include the index data segment, determining that the commit data is not duplicate data and writing the commit data into the redis database specifically includes:

generating a redis key name corresponding to the committee data by using the duplication removal index field and the index data segment;

generating a redis key value corresponding to the committee data by the time data segment;

combining the redis key name and the redis key value into a judging redis key value corresponding to the committee data;

inserting the reds key value pair into the reds database, and judging whether the data cached in the reds database comprises the reds key name;

and if the redis key name is not included, determining that the commit data is not repeated data, and writing the redis key value pair into the redis database.

In some embodiments, the determining whether the difference between the time data segment and the time data segment of the last commit data cached in the redis database is within the filtering duration is performed, and if the difference is not within the filtering duration, determining that the commit data is not duplicated data, and writing the commit data into the redis database; if so, determining that the commit data is repeated data and discarded, and updating the time data segment of the last commit data cached in the redis database to the time data segment, specifically including:

judging whether the difference value between the redis key value and the redis key value of the last piece of committee data cached in the redis database is within the filtering duration;

if the data is not in the filtering duration, determining that the committee data is not repeated data, and writing the redis key value pair into the redis database;

and if the operation time is within the filtering time length, determining that the transaction data is repeated data and discarded, and updating a redis key value corresponding to the last transaction data cached in the redis database into the redis key value.

In a second aspect, an embodiment of the present application further provides a commit data deduplication device, where the commit data deduplication device includes:

the data type selection module is used for displaying a selection interface of target data and receiving data type selection operation through the selection interface;

the metadata acquisition module is used for determining the target data according to the data type selection operation and acquiring metadata corresponding to the target data;

the interface generation module is used for updating the selection interface according to the metadata and generating and displaying a configuration interface;

the configuration module is used for receiving the configuration operation of the duplication elimination parameters through the configuration interface, determining a duplication elimination index field and a duplication elimination time field according to the metadata according to the configuration operation of the duplication elimination parameters, and setting the filtering duration corresponding to the duplication elimination time field;

a receiving module, configured to receive commit data, where the commit data includes an index data segment of the deduplication index field and a time data segment of the deduplication time field;

and the duplication removing module is used for carrying out duplication removing processing on the committee data and caching the data to the redis database according to the index data segment, the time data segment and the filtering duration based on the data cached in the redis database.

In a third aspect, an embodiment of the present application further provides a server, where the server includes: the system comprises a processor, a memory, a bus and a computer program which is stored on the memory and can run on the processor, wherein the processor and the memory are communicated with each other through the bus; the memory processor, when executing the computer program, implements the commit data deduplication method as described above.

Has the advantages that: the embodiment of the application provides a committee data deduplication method, a device and a server, the committee data deduplication method determines target data of a certain data type and metadata of the target data through a selection interface, then the selection interface is updated to generate a configuration interface, then a deduplication index field and a deduplication time field used for deduplication processing are selected from the metadata through the configuration interface, a filtering duration corresponding to the deduplication time field is set, finally, deduplication processing is carried out on the committee data based on data cached in a redis database according to an index time period of the deduplication index field and a time data segment of the deduplication time field of the received committee data of the same type as the target data and the filtering duration, a configuration interface used for deduplication can be dynamically generated through the initial selection interface, and the deduplication index field and the deduplication time field are dynamically selected through the configuration interface, the method and the device improve the flexibility and the accuracy of the duplicate removal processing, adopt a redis database of key value pairs with high concurrent access during the duplicate removal processing, effectively filter out repeated data with similar receiving time, improve the speed of the duplicate removal processing, and have strong system expansibility and high universality.

Drawings

The technical solution and other advantages of the present application will become apparent from the detailed description of the embodiments of the present application with reference to the accompanying drawings.

Fig. 1 is a scene schematic diagram of a commit data deduplication method provided in an embodiment of the present application.

Fig. 2 is a schematic flowchart of a commit data deduplication method according to an embodiment of the present application.

Fig. 3 is a schematic diagram of a selection interface of a commit data deduplication method according to an embodiment of the present application.

Fig. 4 is a schematic diagram of a configuration interface of a commit data deduplication method according to an embodiment of the present application.

Fig. 5 is a schematic structural diagram of an apparatus for removing duplicate data of committee according to an embodiment of the present application.

Fig. 6 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In the embodiments of the present application, "for example", "example" and "such as" are used to mean "serving as an example, instance or illustration". Any embodiment described herein as "for example," "exemplary," and "such as" is not necessarily to be construed as preferred or advantageous over other embodiments. The following description is presented to enable any person skilled in the art to make and use the invention. In the following description, details are set forth for the purpose of explanation. It will be apparent to one of ordinary skill in the art that the present invention may be practiced without these specific details. In other instances, well-known structures and processes are not shown in detail to avoid obscuring the description of the invention with unnecessary detail. Thus, the present invention is not intended to be limited to the embodiments shown, but is to be accorded the widest scope consistent with the principles and features disclosed herein.

Embodiments of the present application provide a commit data deduplication method, a commit data deduplication device, and a server, which are described in detail below.

Referring to fig. 1, fig. 1 is a schematic view of a scenario of a commit data deduplication method according to an embodiment of the present application. The network architecture of the committee data deduplication method comprises a data acquisition unit 11, a server 12 and a redis database 13 which are sequentially connected, wherein the data acquisition unit 11 is used for acquiring real-time data and uploading the acquired data to the server 12, the server 12 conducts deduplication processing on the data, and the redis database 13 is used for caching intermediate data in the data deduplication processing process of the server 12. There may be a plurality of data collectors 11, servers 12, and redis databases 13, and it should be noted that in this embodiment of the application, each redis database 13 temporarily caches only target data of one data type for a period of time.

The execution subject of the embodiment of the present application is the server 12, and the server 12 performs deduplication processing on the data acquired by the data acquirer 11 and caches the data in the redis database 13. The redis database 13 is a key-value (key-value) type memory database, and can perform atomic operation on the database through a key value with high concurrent access during data deduplication processing, so that the minimization of occupied resources is ensured to the maximum extent from the aspects of time and space.

The embodiment of the application can be applied to large data scenes with various data having time continuity, for example, in traffic operation scenes, the motor vehicle data have time continuity. The data collector of the bayonet uploads the vehicle passing records of the motor vehicle passing through the bayonet to the server, when the motor vehicle slowly moves or is static at the bayonet, the data collector of the bayonet uploads a plurality of vehicle passing records of the motor vehicle in a short time, the uploaded vehicle passing records are different from each other only in vehicle passing time, if the first vehicle passing record of the motor vehicle can be reserved within certain filtering time, then the subsequent vehicle passing records uploaded for many times are discarded as repeated data, the same operation is carried out within the next filtering time, and the repeated data can be effectively filtered by the aid of circulation, so that excessive redundant data are prevented from being stored in the server.

It should be noted that the scene schematic diagram of the committee data deduplication method shown in fig. 1 is merely an example, so as to more clearly illustrate the technical solution of the embodiment of the present application, and does not constitute a limitation on the technical solution provided by the embodiment of the present application.

It should be further noted that the following description of the embodiments is not intended to limit the preferred order of the embodiments.

Referring to fig. 2, fig. 2 is a flowchart illustrating a method for removing duplicate data of a transaction, according to an embodiment of the present disclosure, the method for removing duplicate data of a transaction includes:

and S1, displaying a selection interface of the target data, and receiving data type selection operation through the selection interface.

The target data refers to data source data that needs to be subjected to deduplication processing, for example, in a traffic operation scene, the target data includes commission data such as motor vehicle data, non-motor vehicle data, portrait data, personnel data, wifi data, and electricity data, and data classification is performed in advance for these different types of target data, so that the target data is selected as data source data of a certain data type through the data classification. It is to be understood that a template name, a manufacturer name, a classification name, etc. may be defined when data classification is performed, so as to perform detailed classification for each data type.

The selection interface is an initial background interface used for determining a certain data type according to data type selection operation and then acquiring target data of the data type.

The data type selection operation is a selection operation for selecting a certain data type.

Specifically, fig. 3 is a schematic diagram of a selection interface of the committee data deduplication method provided in the embodiment of the application, and as shown in fig. 3, the selection interface is displayed, and target data which is subjected to data classification in advance can be selected through the selection interface, so that a data type selection operation is received through the selection interface, that is, the target data of a certain data type selected through the data type selection operation is determined through the selection interface.

It should be noted that, by this step, target data subjected to data classification in advance can be acquired, and for data of which the data type cannot be selected through the selection interface, the target data does not belong to the target data to be deduplicated in the embodiment of the present application.

S2, determining the target data according to the data type selection operation, and acquiring metadata corresponding to the target data.

The metadata is data describing data, and mainly refers to information describing attributes of the data, for example, data describing attributes of a vehicle, such as a device name, a device code, a lane number, a license plate type, a license plate color, a passing time, a vehicle disappearance time, and the like, for the vehicle data, that is, metadata corresponding to the vehicle data, that is, the metadata is information describing attributes of the data type according to a data type of the target data.

It should be noted that, the target data subjected to data classification in advance is defined by the data type of the target data.

Specifically, as shown in fig. 3, a data type is selected by the data type selection operation, and the metadata corresponding to the target data of the data type is displayed on the selection interface while the target data of the data type is acquired.

And S3, updating the selection interface according to the metadata, and generating and displaying a configuration interface.

Specifically, fig. 4 is a schematic view of a configuration interface of the committee data deduplication method provided in the embodiment of the present application, and as shown in fig. 3 and fig. 4, a certain data type is determined through a selection interface, and after metadata corresponding to target data of the data type is acquired, the selection interface is updated, so that the configuration interface is generated and displayed, that is, the configuration interface is a background operation interface after the selection interface is updated.

S4, receiving the configuration operation of the duplication elimination parameters through the configuration interface, determining the duplication elimination index field and the duplication elimination time field according to the metadata according to the duplication elimination parameter configuration operation, and setting the filtering duration corresponding to the duplication elimination time field.

It can be understood that, in order to perform deduplication processing on target data, a deduplication determination rule needs to be set for the target data, and whether the target data is duplicated data is determined according to the deduplication determination rule. The deduplication index field and the deduplication time field in the embodiment of the application are key fields of the deduplication determination rule, and the deduplication index field and the deduplication time field can be stored in the redis database for a long specified time.

The duplication elimination index field is mainly a characteristic field for identifying the self attribute of the target data, the duplication elimination time field is mainly a characteristic field for identifying the time attribute of the target data, the filtering duration is the filtering duration corresponding to the duplication elimination time field, for example, for motor vehicle data, the name of the device, the code of the device, the number plate number, the type of the number plate, the color of the number plate and the like can be used as the duplication elimination index field, the passing time, the disappearance time of the vehicle and the like can be used as the duplication elimination time field, and the filtering duration is a fixed time range for judging whether the target data is the duplicated data or not according to the duplication elimination time field.

Specifically, as shown in fig. 4, in the configuration interface, a deduplication index field and a deduplication time field serving as a deduplication determination rule are selected from the displayed metadata through deduplication parameter configuration operation, and a filtering duration corresponding to the deduplication time field is set.

S5, receiving transaction data, wherein the transaction data comprises an index data segment of the deduplication index field and a time data segment of the deduplication time field.

For example, for the motor vehicle data, if the license plate number is used as the deduplication index field and the passing time is used as the deduplication time field, the field value of the license plate number (the actual license plate number, such as e.g., axxxxx) is the index data field, and the field value of the passing time (the actual passing time, such as e.g., 2020-09-1310:00:00) is the time data field, it should be noted that the time data field is generally a time point, and all the time points adopt uniform timestamps.

Herein, the commit data refers to target data including an index data segment and a time data segment, and for example, the commit data includes "jae axxxxx" and "2020-09-1310: 00: 00".

Specifically, the committee data is received, and then deduplication processing is performed on the committee data according to the index data segment and the time data segment of the committee data.

It is understood that the data type of the transaction data should be the same as the data type of the target data, that is, after the deduplication determination rule is configured for the target data of a certain data type, the transaction data of the data type is subjected to deduplication processing.

And S6, based on the data cached in the redis database, according to the index data segment, the time data segment and the filtering duration, performing deduplication processing on the committee data and caching the committee data in the redis database.

Note that, the duplication elimination determination rule in the embodiment of the present application refers to: and when the index data segments of the two pieces of transaction data are the same and the difference value of the time data segments is within the filtering time length, judging that the two pieces of transaction data are repeated data.

For example, for the data of the motor vehicle, the license plate number is used as an index data segment, the passing time is used as a duplication removal time field,

if the motor vehicle with the license plate number of the jaw AXXXXXXXX passes through a bayonet from 2020-09-1310:00:00 to 2020-09-1310:01:00, and the filtering time length is set to be 30s, according to the de-duplication determination rule, the jaw AXXXXXXXX is an index data segment, and a certain time between 2020-09-1310:00:00 and 2020-09-1310:00: 60 is a de-duplication time field.

Assuming that committee data passing through a card slot of 2020-09-1310:00 exists, and multiple pieces of committee data at a certain moment are accessed between 2020-09-1310:00:00 and 2020-09-1310:00: 30, wherein an index data segment is AXXXXXXX at the Hubei and a time data segment is 2020-09-1310:00:00 and 2020-09-1310:00: 30, the committee data with the time data segment of 2020-09-1310:00:00 is written into a redis database, and the committee data between the time data segments 2020-09-1310:00: 01 and 2020-09-1310:00: 30 are discarded as repeated data; next, the commit data of 2020-09-1310:00: 31 is not duplicated data because it differs by more than 30s from the temporal data segment of the 2020-09-1310:00:00 commit data cached in the redis database, but is written into the redis database as new data, and then the commit data between the temporal data segments 2020-09-1310:00: 32 to 2020-09-1310:00: 60 is discarded as duplicated data, and so on.

Specifically, the fact that the past reprocessing transaction data is cached in the redis database and the deduplication processing is performed on the transaction data according to the index data segment and the time data segment means that the index data segment and the time data segment of the transaction data are compared with the index data segment and the time data segment cached in the redis database to judge whether the transaction data is duplicated data or not, so that the transaction data is discarded or written into the redis database as new data.

It should be noted that, in the traffic operation scenario, because the commit data are accessed one by one, it is determined whether the time data segment of the currently accessed commit data exceeds the filtering time duration, by comparing the time data segment of the currently accessed commit data with the time data segment of the last commit data corresponding to the same index data segment cached in the redis database, that is, it is determined whether the difference between the time data segment of the currently accessed commit data and the time data segment of the last commit data corresponding to the same index data segment cached in the redis database exceeds the filtering time duration.

The committee data deduplication method provided by the embodiment of the application determines target data of a certain data type and metadata of the target data through a selection interface, then updates the selection interface to generate a configuration interface, selects a deduplication index field and a deduplication time field for deduplication processing from the metadata through the configuration interface, sets filtering duration corresponding to the deduplication time field, and finally performs deduplication processing on the committee data based on data cached in a redis database according to the index time period of the deduplication index field and the time data period of the deduplication time field of the received committee data of the same type as the target data and the filtering duration, can dynamically generate a configuration interface for deduplication through an initial selection interface, and dynamically selects the deduplication index field and the deduplication time field through the configuration interface, so that the flexibility and accuracy of deduplication processing are improved, meanwhile, a redis database of key value pairs with high concurrency access is adopted during deduplication processing, repeated data with similar receiving time can be effectively filtered, deduplication processing speed is improved, and the deduplication processing method has strong system expansibility and high universality.

In one embodiment, step S2 specifically includes the following steps:

and S21, selecting the data type according to the data type selection operation, and determining target data corresponding to the data type.

And S22, acquiring the metadata corresponding to the target data according to the data type.

The selection interface is provided with a selection control of a data type, the data type selection operation comprises selection of a template name, a manufacturer and a classification name, the selection control of the template name, the manufacturer and the classification name is respectively arranged, and the selection control is a drop-down frame control.

Specifically, as shown in fig. 3, in the selection interface, after a certain data type is selected through the data type selection operation, it is determined that the deduplication processing is performed on the target data of the data type, and according to the data type, the selection interface displays all metadata corresponding to the target data of the data type. The metadata includes, among other things, a name, a field, and a type, such as "device code DeviceID string", "passing time string", and the like.

In one embodiment, step S3 specifically includes the following steps:

s31, displaying the metadata, setting a field index box corresponding to the metadata, a field selection box corresponding to each metadata, and a field confirmation box.

And S32, updating the selection interface, and generating and displaying the configuration interface.

The field index box is a text box control, the field selection box is a selection box control, and the field confirmation box is a button control. As shown in fig. 3, the field confirmation box specifically includes a deduplication index field confirmation box and a deduplication time field confirmation box.

Specifically, as shown in fig. 3 and 4, a field index box for searching metadata, a field selection box for selecting each metadata, and a field confirmation box for confirming that the selected metadata is used as a deduplication index field and a deduplication time field are provided on a selection interface, and then the selection interface is updated, and display controls of the deduplication index field, the deduplication time field, and a filtering duration are generated, thereby dynamically generating a configuration interface for deduplication processing. Wherein, the display control is a text box control.

In addition, as shown in fig. 4, while the display controls of the deduplication index field, the deduplication time field, and the filtering duration of the configuration interface are generated, a deduplication index field and a deduplication time field deletion control are also generated, that is, the selected deduplication index field and the deduplication time field can be deleted by the deletion control, and then the deduplication index field and the deduplication time field are selected by the field index box, the field selection box, and the field confirmation box of the metadata again. And the deleting control is a button control.

In an embodiment, the determining the deduplication index field and the deduplication time field according to the metadata and setting the filtering duration corresponding to the deduplication time field in step S4 specifically include the following steps:

s41, searching the deduplication index field and the deduplication time field through the field index box control.

S42, selecting the deduplication index field and the deduplication time field through the field selection box control.

S43, confirming the duplication elimination index field and the duplication elimination time field through the field confirmation box control.

And S44, displaying the deduplication index field and the deduplication time field, and setting the filtering duration corresponding to the deduplication time field.

Specifically, as shown in fig. 4, the deduplication index field and the deduplication time field are searched through a field index box, selecting the deduplication index field and the deduplication time field by the field selection box, after confirming the deduplication index field and the deduplication time field by the field confirmation box, the deduplication index field is displayed by the display control of the deduplication index field, the deduplication time field is displayed by the display control of the deduplication time field, meanwhile, the filtering duration corresponding to the duplication-removing time field is set in the display control of the filtering duration, namely, a deduplication index field and a deduplication time field are displayed on the configuration interface as deduplication decision rules, meanwhile, the filtering duration corresponding to the deduplication time field needs to be set, so that the deduplication index field, the deduplication time field and the filtering duration serving as the deduplication determination rule are confirmed.

It will be appreciated that if the deduplication index field and the deduplication time field are selected directly through the field selection box, the deduplication index field and the deduplication time field need not be searched through the field index box.

It should be noted that the deduplication index field includes at least one metadata field, and the deduplication time field includes only one metadata field. As shown in fig. 4, for the deduplication index field, a field of the first metadata is selected as the deduplication index field 1, a field of the second metadata is selected as the deduplication index field 2, and so on.

In one embodiment, step S6 specifically includes the following steps:

s61, judging whether the data cached in the redis database comprises the index data segment, if not, determining that the committee data is not repeated data and writing the committee data into the redis database.

According to the method and the device, the intermediate data in the deduplication processing process is cached by the redis database, and the key value pair which can be accessed at high concurrency of the redis database can be utilized, so that the deduplication processing speed is improved.

Specifically, when the latest transaction data is received, the index data segment and the time data segment of the transaction data can be determined, firstly, whether the index data segment is included in the data cached in the redis database is judged, if the index data segment is not cached in the redis database, the transaction data is not repeated data, and the transaction data is written into the redis database as new data.

For example, if a vehicle having a license plate number of AXXXXXXX, is passing through the bayonet at 2020-09-1310:00:00 to 2020-09-1310:01:00, the filtering period is set to 30 s.

When receiving commission data of a motor vehicle with a license plate number of Eye AXXXXXXXX, 2020-09-1310:00:05 passing through a card slot, firstly judging whether the commission data comprising the Eye AXXXXXXXX exist in a redis database, if not, judging that the commission data is not repeated data, and writing the commission data into the redis database as new data.

S62, if the data cached in the redis database comprises the index data segment, judging whether a difference value between the time data segment and the time data segment of the last transaction data cached in the redis database is within the filtering time length, if not, determining that the transaction data is not repeated data, and writing the transaction data into the redis database; and if so, determining that the transaction data is repeated data and discarded, and updating the time data segment of the last transaction data cached in the redis database into the time data segment.

Specifically, after the above step S61, if the index data segment of the piece of transaction data is cached in the redis database, it is further determined whether the time data segment of the piece of transaction data exceeds the filtering time duration, that is, the transaction data corresponding to the index data segment is found in the redis database, that is, the time data segment of the last piece of transaction data cached in the redis database, then the time data segment of the latest received piece of transaction data is compared with the time data segment of the last piece of transaction data cached in the redis database, if the difference between the two does not exceed the filtering time duration, it is determined that the piece of transaction data is duplicated data and discarded, if the difference between the two exceeds the filtering time duration, it is determined that the piece of transaction data is not duplicated data, and the piece of transaction data is written into the redis database as new data, that the piece of transaction data becomes the last piece of transaction data newly stored in the redis database, as a comparison to the next commit datum received.

For example, in the case where the above-mentioned step S61 is carried out, assuming that committee data with a license plate number of "EYE AXXXXXX", 2020-09-1310:00:00 "passed through the card slot exists, if the committee data including" EYE AXXXXXXXX "exists in the redis database, 2020-09-1310:00:05 and 2020-09-13 are further added

10:00:00, the difference time is 5s, and the filtering time length is not exceeded by 30s, so that the data of the committee of the Hubei AXXXXXXXX motor vehicle 2020-09-1310:00:05 passing through the checkpoint is removed as repeated data. Next, when receiving committee data of a motor vehicle with the license plate number of the jaw AXXXXXX, 2020-09-1310:00: 31 passing through the bayonet, the data is not duplicated data because the difference between 2020-09-1310:00: 31 and 2020-09-1310:00:00 is more than 30s, and is written into a redis database as new data, and the committee data passing through the bayonet after 2020-09-1310:00: 31 compares the time data segment with 2020-09-1310:00: 31 to judge whether the data is duplicated data.

In one embodiment, step S62 specifically includes the following steps:

and S621, generating a redis key name corresponding to the committee data by using the duplication removal index field and the index data segment.

And S622, generating a redis key value corresponding to the committee data by the time data segment.

For example, for a piece of committee data with an index data segment of jaw axxxxx and a time data segment of 2020-09-1310:00:05, "prefix _ license plate number field name _ jaw axxxxx" may be used as the redis key name of the piece of committee data, and "2020-09-1310: 00: 05" may be used as the redis key value of the piece of committee data.

It should be noted that, if a plurality of metadata fields are selected as the deduplication index fields, the field names and field values of the metadata fields are arranged according to a preset form to generate a final index data segment, where the preset form may be: the "field value of the field name | deduplication index field 1 of the prefix _ deduplication index field 1, the field value of the field name | deduplication index field 2 of the deduplication index field 2", for example, for a piece of delegation data including the index data segment "jaw axxxxx" and the index data segment "device name bxxxx", the redis key name corresponding to the piece of delegation data is given by "prefix _ lane no | jaw axxxxxxx, DeviceID | bxxxx". The preset form is only an example, and the embodiment of the present application is not limited.

And S623, combining the redis key name and the redis key value into a judging redis key value corresponding to the committee data.

For example, the redis key name "prefix _ license plate field name _ jaw axxxxx" and the redis key value "2020-09-1310: 00: 05" are combined into a key value pair (prefix _ license plate field name _ jaw axxxxx, 2020-09-1310:00: 05) as the re-determination redis key value corresponding to the piece of committee data.

S624, inserting the reds key value pairs into the reds database, and judging whether the data cached in the reds database includes the reds key name.

And S625, if the redis key name is not included, determining that the commit data is not repeated data, and writing the redis key value pair into the redis database.

For example, a key value pair (prefix _ license plate number field name _ jaw axxxxx, 2020-09-1310:00: 05) is inserted into a redis database, whether a key name "prefix _ license plate number field name _ jaw axxxxx" exists in the redis database is judged, if not, the commit data is determined not to be duplicated data, and the key value pair (prefix _ license plate number field name _ jaw axxxxx, 2020-09-1310:00: 05) is written into the redis database.

In one embodiment, step S63 specifically includes the following steps:

s631, judging whether the difference value between the redis key value and the redis key value of the last piece of committee data cached in the redis database is within the filtering duration.

S632, if the data is not in the filtering duration, determining that the committee data is not repeated data, and writing the redis key value pair into the redis database.

And S633, if the transaction data are within the filtering duration, determining that the transaction data are repeated data and discarded, and updating a redis key value corresponding to the last transaction data cached in the redis database into the redis key value.

For example, in the example of step S625, if the key name "prefix _ license plate number field name _ jaw axxxxx" exists in the redis database, the redis key value "2020-09-1310: 00: 05" is compared with the redis key value "2020-09-1310: 00: 00" of the last piece of commit data cached in the redis database, the difference time between the two values is 5S, and if the filtering time length is not 30S, the commit data with the index data field of jaw axxxxx and the time data field of 2020-09-1310:00:05 is removed as the repeated data.

Based on the above embodiments, a complete embodiment is provided below to describe in detail the implementation process of the commit data deduplication method:

firstly, a selection interface of target data is displayed, a template name, a manufacturer name and a classification name are selected on the selection interface through data type selection operation, so that the target data of a certain data type is selected to determine that the target data of the data type is subjected to deduplication processing, at the moment, the selection interface displays metadata corresponding to the target data according to the data type, wherein the metadata comprises a field, a name and a type of the metadata, and then the selection interface is updated according to the metadata to generate and display a configuration interface.

Then, determining a duplicate removal index field and a duplicate removal time field from the metadata through a configuration interface on the configuration interface, setting a filtering duration corresponding to the duplicate removal index field, and taking the duplicate removal index field, the duplicate removal time field and the filtering duration as duplicate removal parameters of a duplicate removal judgment rule, wherein the duplicate removal judgment rule is as follows: and when the index data segments of the two pieces of transaction data are the same and the difference value of the time data segments is within the filtering time length, judging that the two pieces of transaction data are the repeated data.

Finally, receiving the transaction data, wherein the data type of the transaction data is the same as that of the target data, the transaction data is real-time data, the transaction data comprises an index data segment (namely a field value of a deduplication index field) of a deduplication index field and a time data segment (namely a field value of a deduplication time field) of a deduplication time field, the deduplication time field and the index data segment are used for generating a redis key name corresponding to the transaction data, the time data segment is used for generating a redis key value corresponding to the transaction data, the redis key name corresponding to the transaction data and the redis key value corresponding to the transaction data are used for forming a redis key value pair corresponding to the transaction data, and the redis key value pair of the transaction data is compared with data cached in a redis database, specifically: judging whether a redis key name of the committee data exists in a redis database, if not, determining that the committee data is not repeated data, and writing the committee data into the redis database as new data; if the difference value exists, whether the difference value between the redis key value of the transaction data and the redis key value of the last transaction data cached in the redis database is within the filtering time length is further judged, if the difference value is within the filtering time length, the transaction data is determined to be repeated data and discarded, if the difference value is not within the filtering time length, the transaction data is determined not to be repeated data, and the transaction data is written into the redis database as new data. And in a period of filtering duration, the above steps are circulated, so that the accessed committee data is subjected to deduplication processing and then is cached in a redis database.

According to the data deduplication method provided by the embodiment of the application, when verification is carried out in an actual project, before the method is used, the data volume generated by wifi electronic fence equipment in a certain city every hour is about 300 thousands, the data volume generated every day is 4000 thousands, the data storage volume of an original storage system is greatly occupied, after the method is used, more than 50% of repeated data can be filtered through reasonably configuring deduplication index fields, deduplication time periods and filtering time, and the deduplication speed is high due to the fact that key value pairs of high concurrency access of a redis database are adopted in the deduplication process, and the efficiency of data access is not affected basically.

On the basis of the method described in the foregoing embodiment, the present application embodiment will be further described from the perspective of an committee data deduplication device, please refer to fig. 5, where fig. 5 is a schematic structural diagram of the committee data deduplication device provided in the present application embodiment, and the committee data deduplication device includes:

the data type selection module 51 is configured to display a selection interface of target data, and receive a data type selection operation through the selection interface;

a metadata obtaining module 52, configured to determine the target data according to the data type selection operation, and obtain metadata corresponding to the target data;

an interface generating module 53, configured to update the selection interface according to the metadata, and generate and display a configuration interface;

a configuration module 54, configured to receive a deduplication parameter configuration operation through the configuration interface, and determine a deduplication index field and a deduplication time field according to the deduplication parameter configuration operation and the metadata;

a receiving module 55, configured to receive transaction data, where the transaction data includes an index data segment of the deduplication index field and a time data segment of the deduplication time field;

and the duplication elimination module 56 is configured to, based on the data cached in the redis database, perform duplication elimination processing on the commit data according to the index data segment and the time data segment, and then cache the commit data in the redis database.

The committee data deduplication device provided by the embodiment of the application determines target data of a certain data type and metadata of the target data through a selection interface, then updates the selection interface to generate a configuration interface, selects a deduplication index field and a deduplication time field for deduplication processing from the metadata through the configuration interface, sets filtering duration corresponding to the deduplication time field, and finally deduplicates the committee data based on data cached in a redis database according to the index time period of the deduplication index field and the time data period of the deduplication time field of the received committee data of the same type as the target data and the filtering duration, wherein the delegation data deduplication method can dynamically generate a configuration interface for deduplication through an initial selection interface, dynamically selects the deduplication index field and the deduplication time field through the configuration interface, and improves flexibility and accuracy of deduplication processing, meanwhile, a redis database of key value pairs with high concurrency access is adopted during deduplication processing, repeated data with similar receiving time can be effectively filtered, deduplication processing speed is improved, and the deduplication processing method has strong system expansibility and high universality.

In an embodiment, the metadata obtaining module 52 is further configured to select the data type according to the data type selecting operation, and determine target data corresponding to the data type; and acquiring the metadata corresponding to the target data according to the data type.

In one embodiment, the interface generating module 53 is further configured to display the metadata, set a field index box corresponding to the metadata, a field selection box corresponding to each metadata, and a field confirmation box; and updating the selection interface, and generating and displaying the configuration interface.

In one embodiment, the configuration module 54 is further configured to search the deduplication index field and the deduplication time field through the field index box; selecting the deduplication index field and the deduplication time field through the field selection box; confirming the deduplication index field and the deduplication time field through the field confirmation box; and displaying the deduplication index field and the deduplication time field, and setting the filtering duration corresponding to the deduplication time field.

In one embodiment, the deduplication module 56 is further configured to determine whether the data cached in the redis database includes the index data segment, and if the redis database does not include the index data segment, determine that the commit data is not duplicate data and write the commit data into the redis database; if the data cached in the redis database comprises the index data segment, judging whether a difference value between the time data segment and the time data segment of the last committee data cached in the redis database is within the filtering time length, if not, determining that the committee data is not repeated data, and writing the committee data into the redis database; and if so, determining that the transaction data is repeated data and discarded, and updating the time data segment of the last transaction data cached in the redis database into the time data segment.

In an embodiment, referring to fig. 6, fig. 6 is a schematic structural diagram of a server provided in the embodiment of the present application, where the server includes a processor 601, a memory 602, and a bus 603, where the processor 601 and the memory 602 complete communication with each other through the bus 603; the memory 602 stores program instructions executable by the processor 601 to perform the methods provided by the above-described method embodiments, including, for example: displaying a selection interface of target data, and receiving data type selection operation through the selection interface; determining the target data according to the data type selection operation, and acquiring metadata corresponding to the target data; updating the selection interface according to the metadata, and generating and displaying a configuration interface; receiving a duplication removal parameter configuration operation through the configuration interface, determining a duplication removal index field and a duplication removal time field according to the metadata according to the duplication removal parameter configuration operation, and setting a filtering duration corresponding to the duplication removal time field; receiving transaction data, wherein the transaction data comprises an index data segment of the deduplication index field and a time data segment of the deduplication time field; based on the data cached in the redis database, according to the index data segment, the time data segment and the filtering duration, the committee data is cached to the redis database after being subjected to deduplication processing.

In one embodiment, the present application provides a non-transitory computer-readable storage medium storing computer instructions that cause the computer to perform the methods provided by the above method embodiments, for example, including: displaying a selection interface of target data, and receiving data type selection operation through the selection interface; determining the target data according to the data type selection operation, and acquiring metadata corresponding to the target data; updating the selection interface according to the metadata, and generating and displaying a configuration interface; receiving a duplication removal parameter configuration operation through the configuration interface, determining a duplication removal index field and a duplication removal time field according to the metadata according to the duplication removal parameter configuration operation, and setting a filtering duration corresponding to the duplication removal time field; receiving transaction data, wherein the transaction data comprises an index data segment of the deduplication index field and a time data segment of the deduplication time field; based on the data cached in the redis database, according to the index data segment, the time data segment and the filtering duration, the committee data is cached to the redis database after being subjected to deduplication processing.

Wherein the computer storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.

Because the instructions stored in the computer storage medium can execute the steps in any method provided in the embodiments of the present application, the beneficial effects that can be achieved by any method provided in the embodiments of the present application can be achieved, for details, see the foregoing embodiments, and are not described herein again.

The method, the device and the server for removing duplicate of data of committee provided by the embodiment of the application are introduced in detail, a specific example is applied in the description to explain the principle and the implementation mode of the application, and the description of the embodiment is only used for helping to understand the technical scheme and the core idea of the application; those of ordinary skill in the art will understand that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications or substitutions do not depart from the spirit and scope of the present disclosure as defined by the appended claims.

Claims

1. A method for de-duplicating data of a transaction, comprising:

2. The commit data deduplication method according to claim 1, wherein determining the target data according to the data type selection operation, and acquiring metadata corresponding to the target data specifically includes:

3. The transaction data deduplication method according to claim 1, wherein updating the selection interface according to the metadata, and generating and displaying a configuration interface specifically includes:

4. The method according to claim 3, wherein the determining the deduplication index field and the deduplication time field according to the metadata and setting the filtering duration corresponding to the deduplication time field specifically include:

5. The transaction data deduplication method of claim 4, wherein the deduplication index field comprises at least one field of the metadata, and wherein the deduplication time field comprises only one field of the metadata.

6. The commit data deduplication method according to claim 4, wherein based on data cached in a redis database, according to the index data segment, the time data segment, and the filtering duration, the commit data is deduplicated and cached in the redis database, and specifically includes:

7. The commit data deduplication method according to claim 6, wherein the determining whether the data cached in the redis database includes the index data segment, and if the redis database does not include the index data segment, determining that the commit data is not duplicate data and writing the commit data into the redis database specifically includes:

8. The commit data deduplication method of claim 7, wherein the determining determines whether a difference between the time data segment and a time data segment of a last commit data cached within the redis database is within the filtering duration, and if not within the filtering duration, determines that the commit data is not duplicate data, and writes the commit data into the redis database; if so, determining that the commit data is repeated data and discarded, and updating the time data segment of the last commit data cached in the redis database to the time data segment, specifically including:

9. An apparatus for de-duplicating data of a transaction, comprising:

10. A server, comprising: the system comprises a processor, a memory, a bus and a computer program which is stored on the memory and can run on the processor, wherein the processor and the memory are communicated with each other through the bus; the memory processor, when executing the computer program, implements the commit data deduplication method as recited in any one of claims 1-8.