CN113282616A - Incremental time sequence data conflict detection method and device and storage medium - Google Patents

Incremental time sequence data conflict detection method and device and storage medium Download PDF

Info

Publication number
CN113282616A
CN113282616A CN202110547706.XA CN202110547706A CN113282616A CN 113282616 A CN113282616 A CN 113282616A CN 202110547706 A CN202110547706 A CN 202110547706A CN 113282616 A CN113282616 A CN 113282616A
Authority
CN
China
Prior art keywords
data set
record
incremental
inverted index
target data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110547706.XA
Other languages
Chinese (zh)
Inventor
袁俊
魏庆波
任新宇
汪文涛
张少男
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Resource Power Technology Research Institute
Original Assignee
China Resource Power Technology Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Resource Power Technology Research Institute filed Critical China Resource Power Technology Research Institute
Priority to CN202110547706.XA priority Critical patent/CN113282616A/en
Publication of CN113282616A publication Critical patent/CN113282616A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • G06F16/24534Query rewriting; Transformation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2474Sequence data queries, e.g. querying versioned data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database

Abstract

The application discloses a method and a device for detecting time sequence data conflict of increment, electronic equipment and a storage medium, wherein the method comprises the following steps: after the spark platform acquires the input incremental data set, preprocessing the incremental data set to obtain a target data set; establishing an inverted index for the primary key attributes in the target data set, and merging the inverted index into the historical data inverted index; traversing each record of the target data set, and determining the records in the merged inverted index, which do not meet the comparison rule with the corresponding key attributes in each tuple; and performing conflict detection on each record of the target data set and the corresponding record which does not meet the comparison rule by using a preset detect operator to obtain conflict information. According to the method, only records of the incremental data set, namely the target data set, are traversed to search for records of the merged inverted index, wherein the attribute of the primary key does not meet the rule, so that conflict information is obtained, the data conflict detection efficiency is improved, and efficient data anomaly detection is realized.

Description

Incremental time sequence data conflict detection method and device and storage medium
Technical Field
The present application relates to the field of distributed data management technologies, and in particular, to a method and an apparatus for detecting incremental time series data collision, an electronic device, and a computer-readable storage medium.
Background
With the advent of the big data era, the importance of data quality is becoming more and more obvious, and data cleaning technology is also gaining wide attention. The diversification and the opening of the existing data generation and acquisition ways lead to uneven data quality, and partial data has errors and redundancy problems such as inconsistency, deficiency, conflict and the like.
Due to the unique bulkiness and diversity of data types of big data, data anomaly detection performance is limited. Existing large-order cleaning systems, such as BigDansing, suffer from the following drawbacks: and when the data set has increment, merging the newly added data set and the historical data set, and integrally detecting abnormal data. However, in this way, since the previous historical data has been subjected to anomaly detection, when a new data set is added, repeated detection is performed on the historical data, and since the data set is huge, a large amount of repeated detection results in waste of resource cost, and the detection efficiency and the detection performance are also reduced.
Disclosure of Invention
The application aims to provide an incremental time sequence data conflict detection method, which can realize high-efficiency data exception detection and improve the performance of the whole data cleaning process. The specific scheme is as follows:
in a first aspect, the present application discloses a method for incremental time series data collision detection, including:
after the spark platform acquires an input incremental data set, preprocessing the incremental data set to obtain a target data set;
establishing an inverted index for the primary key attributes in the target data set, and merging the inverted index into the historical data inverted index;
traversing each record of the target data set, and determining records in the merged inverted index, which do not meet comparison rules with corresponding key attributes in each tuple;
and performing conflict detection on each record of the target data set and the corresponding record which does not meet the comparison rule by using a preset detect operator to obtain conflict information.
Optionally, preprocessing the incremental data set to obtain a target data set, including:
slicing the incremental data set to generate RDD data, and converting the RDD data into data in a DataFrame format;
and extracting the DataFrame format data by using an SQL statement to obtain a target data set.
Optionally, traversing each record of the target data set, and determining a record in the merged inverted index, which does not satisfy the comparison rule with the corresponding key-attribute in each tuple, includes:
when the comparison rule is an equation comparison rule, traversing each record in the target data set; determining the record of which the attribute value of the primary key attribute in each record is equal to the corresponding attribute value in the merged inverted index;
when the comparison rule is an inequality comparison rule, traversing each record in the target data set; and determining the record of which the attribute value of the primary key attribute is greater than or equal to the corresponding attribute value in the merged inverted index in each record.
Optionally, after the spark platform acquires the input incremental data set, the method further includes:
judging whether the table name and the name of the primary key attribute corresponding to the incremental data set are correct or not;
if not, correcting the list name with errors and the name of the primary key attribute;
and if so, executing preprocessing on the incremental data set to obtain a target data set.
Optionally, before performing collision detection on each record of the target data set and the corresponding record that does not satisfy the comparison rule by using a pre-established detect operator to obtain collision information, the method further includes:
creating a BaseDetect class, and transmitting the BaseDetect class into the spark platform through a setClass interface;
invoking a method, importing, and transmitting parameters to the spark platform according to the parameter requirements in the BaseDetect class to generate the detect operator.
In a second aspect, the present application discloses an incremental time series data collision detection apparatus, including:
the preprocessing module is used for preprocessing the incremental data set to obtain a target data set after the spark platform obtains the input incremental data set;
the merging module is used for establishing an inverted index for the primary key attributes in the target data set and merging the inverted index into the historical data inverted index;
the determining module is used for traversing each record of the target data set and determining the record which does not meet the comparison rule with the corresponding key attribute in each tuple in the merged inverted index;
and the detection module is used for carrying out conflict detection on each record of the target data set and the corresponding record which does not meet the comparison rule by utilizing a preset detect operator to obtain conflict information.
Optionally, the preprocessing module includes:
the generating unit is used for slicing the incremental data set, generating RDD data and converting the RDD data into data in a DataFrame format;
and the extraction unit is used for extracting the DataFrame format data by using the SQL statement to obtain a target data set.
Optionally, the determining module includes:
a first determining unit, configured to traverse each record in the target data set when the comparison rule is an equation comparison rule; determining the record of which the attribute value of the primary key attribute in each record is equal to the corresponding attribute value in the merged inverted index;
a second determining unit, configured to traverse each record in the target data set when the comparison rule is an inequality comparison rule; and determining the record of which the attribute value of the primary key attribute is greater than or equal to the corresponding attribute value in the merged inverted index in each record.
In a third aspect, the present application discloses an electronic device, comprising:
a memory for storing a computer program;
and a processor for implementing the steps of the incremental time series data collision detection method when executing the computer program.
In a fourth aspect, the present application discloses a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method for sequential data collision detection in increments as described above.
The application provides an incremental time sequence data conflict detection method, which comprises the following steps: after the spark platform acquires an input incremental data set, preprocessing the incremental data set to obtain a target data set; establishing an inverted index for the primary key attributes in the target data set, and merging the inverted index into the historical data inverted index; traversing each record of the target data set, and determining records in the merged inverted index, which do not meet comparison rules with corresponding key attributes in each tuple; and performing conflict detection on each record of the target data set and the corresponding record which does not meet the comparison rule by using a preset detect operator to obtain conflict information.
Therefore, the method comprises the steps of preprocessing a newly added data set, namely an incremental data set, to obtain a target data set, establishing an inverted index of the target data set, merging the inverted index into a historical data inverted index, searching and determining records of which the attribute of a main key in the current inverted index does not meet a rule by traversing all records in the target data set, and performing conflict detection by using a detect operator to obtain conflict information. According to the method and the device, records of the current index, wherein the attribute of the primary key does not meet the rule, are searched only by traversing each record of the incremental data set, namely the target data set, so that conflict detection is performed, conflict information is obtained, the efficiency of data conflict detection is improved, the defects that in the related technology, a newly added data set and a historical data set need to be combined, and data detection is performed on the whole, a large amount of repeated detection work is caused, the cost of resource detection is wasted are avoided, efficient data anomaly detection is realized, and the performance of the whole data cleaning process is improved. The application also provides an incremental time sequence data conflict detection device, an electronic device and a computer readable storage medium, which have the beneficial effects and are not described herein again.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a flowchart of an incremental time series data collision detection method according to an embodiment of the present disclosure;
fig. 2 is a schematic structural diagram of an incremental time series data collision detection apparatus according to an embodiment of the present disclosure.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
At present, the content of data cleansing mainly focuses on the following four aspects: (1) cleaning attribute abnormal data; (2) washing repeated data; (3) cleaning specific domain data; (4) cleaning of domain-independent data, etc. In general, data cleansing comprises three steps: (1) specifying a quality constraint rule; (2) detecting abnormal data; (3) and repairing the abnormal data. Typically, after the first operation is performed, the data cleansing process iteratively performs the second and third steps until the data is completely "clean". On this basis, many different cleaning software have been created, such as IBM QualityStage, SAP businessObjects, Oracle Enterprise Data Quality, and Google Refine, among others. However, these tools place more points of interest on data that is subject to low-level ETL (Transformation, Loading) rules and do not support more complex user-defined data quality constraint rules. Therefore, the tools can only be applied to some specific database systems and fixed scenes, and can not deal with some common data quality problems, and for some special situations, such as time, space or displacement, the data which needs to be judged whether to have errors according to specific situations cannot be used. In addition, for the case that the data set has increment, the newly added data set and the historical data set are merged, and the abnormal data detection is carried out on the whole. However, this method performs repeated detection on the historical data, and because the data set is huge, a large amount of repeated detection causes waste of resource cost, and also reduces detection efficiency and detection performance.
Based on the above technical problem, this embodiment provides an incremental time series data collision detection method, which can implement efficient data anomaly detection and improve the performance of the whole data cleaning process, specifically referring to fig. 1, where fig. 1 is a flowchart of the incremental time series data collision detection method provided in this embodiment of the present application, and specifically includes:
s101, after the spark platform obtains the input incremental data set, preprocessing the incremental data set to obtain a target data set.
The Spark platform is a distributed computing platform realized based on a MapReduce algorithm, has usability and universality, has higher processing speed, and supports various resource managers. It will be appreciated that the incremental data set is for the spark platform based on the data set that was first acquired as input. The content and data size of the incremental data set are not limited in this embodiment, and may be determined according to actual situations. And after the spark platform acquires the input incremental data set, preprocessing the incremental data set to obtain a target data set. The embodiment does not limit the specific process of the preprocessing, as long as a data form suitable for the operation of the distributed spark platform can be obtained. It is understood that the target data set in this embodiment may be a target data set obtained by processing and converting all input data streams, or may be a target data set formed by extracting only a part of data that is interested or concerned, and may be selected according to actual needs.
In a specific embodiment, in order to effectively improve the detection efficiency and the detection performance, the preprocessing the incremental data set to obtain the target data set may include:
slicing the incremental data set to generate RDD data, and converting the RDD data into data in a DataFrame format;
and extracting the DataFrame format data by using the SQL statement to obtain a target data set.
In this embodiment, the input incremental data set is sliced to obtain RDD data, and then the RDD data is converted into DataFrame format data, so that subsequent data operations can be performed. In this embodiment, the SQL statement is used to extract (select and project) the DataFrame format data, so that data attributes that are interesting or concerned by the user can be extracted, the defects of low detection efficiency and poor performance caused by a large number of unnecessary abnormal detections due to abnormal detections performed on all data streams and selection of interesting partial results are avoided, and the detection efficiency and the detection performance can be effectively improved. It will also be appreciated that after slicing the data stream, the large data set is cut into many small data sets, and then converted into DataFrame format data, where the data stream is distributed. Compared with a single-machine-based full-quantity data anomaly detection mode in the related technology, the efficiency of data anomaly detection can be effectively improved.
In a specific embodiment, in order to improve collision detection and data utilization, after the spark platform acquires the input incremental data set, the method further includes:
judging whether the table name and the name of the primary key attribute corresponding to the incremental data set are correct or not;
if not, correcting the list name with errors and the name of the primary key attribute;
and if so, executing preprocessing on the incremental data set to obtain a target data set.
That is, in this embodiment, after the spark platform acquires the input incremental data set, the table name and the name of the primary key attribute corresponding to the incremental data set are also determined, and if the table name and the name of the primary key attribute are incorrect, the table name and the name of the primary key attribute having an error are corrected; and if the data is correct, executing the next operation, namely preprocessing the incremental data set to obtain a target data set. It can be understood that, before data cleaning or data collision detection is performed, the table name and the name of the primary key attribute corresponding to the data set must be correctly given, and then subsequent operations can be performed, so that after the spark platform acquires the input incremental data set, the table name and the name of the primary key attribute having errors are corrected, which can effectively improve collision detection efficiency and improve data utilization rate.
S102, establishing an inverted index for the primary key attributes in the target data set, and merging the inverted index into the historical data inverted index.
It is understood that a primary key attribute is a field or set of fields that can uniquely identify a record. A reverse index is one that requires the lookup of records based on the value of an attribute, and each entry in such an index table includes an attribute value and the address of the record having that attribute value. A feature is that the attribute value is determined by the record, but the location of the record is determined by the attribute value. In this embodiment, after the inverted index is established for the primary key attribute of the target data set, the primary key attribute is merged into the inverted index of the historical data, so that collision detection can be performed on the subsequent target data set and the merged inverted index. It is to be understood that the historical data inverted index is an inverted index established for a data set before the target data set arrives, and is referred to as the historical data inverted index.
S103, traversing each record of the target data set, and determining the record which does not meet the comparison rule with the corresponding key attribute in each tuple in the merged inverted index.
The present embodiment does not limit the specific content of the comparison rule, and may be an equality comparison rule or an inequality comparison rule. In this embodiment, each record in the target data set is traversed, and then whether a record that does not satisfy the comparison rule exists in the merged inverted index is searched for, if so, the subsequent step is performed, that is, collision detection is performed on the record and the corresponding record that does not satisfy the comparison rule. That is, in this embodiment, only the records of the target data set are traversed, and then the records of which the attribute of the primary key does not satisfy the rule in the current index are searched, so that the data collision detection efficiency can be improved.
In a specific embodiment, traversing each record of the target data set and determining a record in the merged inverted index that does not satisfy the comparison rule with the corresponding key-attribute in each tuple may include:
when the comparison rule is an equality comparison rule, traversing each record in the target data set; determining records in which the attribute value of the primary key attribute in each record is equal to the corresponding attribute value in the merged inverted index;
when the comparison rule is an inequality comparison rule, traversing each record in the target data set; and determining the record of which the attribute value of the primary key attribute is greater than or equal to the corresponding attribute value in the merged inverted index in each record.
That is, in this embodiment, when the comparison rule is an equality comparison rule, a record in which the attribute value of the primary key attribute in the target data set is equal to the corresponding attribute value in the merged inverted index is queried; correspondingly, when the comparison rule is an inequality comparison rule, querying the record in the target data set, wherein the attribute value of the primary key attribute and the corresponding attribute value in the merged inverted index are greater than or equal to each other. That is, in this embodiment, considering that the data quality constraint rule, that is, the comparison rule, is of two types, namely, equality comparison and inequality comparison, when the problem of data increment is handled, the comparison rule is divided into two types, and different operations are respectively performed. It is understood that the equality comparison rule may be that, for an attribute a (as a key attribute) and an attribute B (which may be any attribute of interest to a user), when the attribute values of attribute a of two tuples are equal, the attribute values of attribute B of the two tuples must be equal or unequal, otherwise, the two tuples are considered to have an exception based on the comparison rule. The inequality comparison rule means that, for attribute a and attribute B, there are two cases: when Ai is more than or equal to Aj, Bi is also more than or equal to Bj; or when Ai is greater than or equal to Aj, Bi must be less than Bj. Considering that when Ai is less than or equal to Aj, the rule that Bi is less than or equal to Bj must be satisfied without loss of generality; or when Ai is less than or equal to Aj, Bi must be greater than Bj.
And S104, performing conflict detection on each record of the target data set and the corresponding record which does not meet the comparison rule by using a preset detect operator to obtain conflict information.
It will be appreciated that the pre-established detect operator is set according to specific comparison rules. The conflict information in this embodiment is data information corresponding to records that do not satisfy the comparison rule. In this embodiment, before performing collision detection on each record of the target data set and the corresponding record meeting the comparison rule by using a pre-established detect operator to obtain collision information, the method may further include: creating a BaseDetect class, and transmitting the BaseDetect class into a spark platform through a setClass interface; invoking a method, importing, and transmitting parameters to a spark platform according to the parameter requirements in a BaseDetect class to generate a detect operator.
That is, in this embodiment, a BaseDetect class is predefined, that is, a class is created, and the parameter type, the number of parameters, and the return value type of the Detect operator can be declared in this class. When the user is in use, the BaseDetect class is inherited first. Then, through this setClass () interface, the user can transfer the basedetect class and Detect operator into the spark platform counterpart. After the program obtains the relevant information internally, by calling a method, and introducing parameters for a Detect operator according to the parameter requirements and format, the system can perform conflict detection according to the detection logic provided by the user.
Based on the above technical solution, in this embodiment, each record in the target data set is traversed, a record in which the attribute of the primary key in the current inverted index meets the rule is searched and determined, and then a detect operator is used to perform conflict detection, so as to obtain conflict information. The records of the current index, in which the attribute of the primary key meets the rule, are searched only by traversing each record of the incremental data set, namely the target data set, so as to obtain conflict information, thus realizing the efficient conflict detection of the incremental data set and the historical data, avoiding a large amount of unnecessary repeated detection and greatly improving the performance of the whole data cleaning process.
A specific example is provided below. There are abnormal collision detection in the face of equality comparison rules and abnormal collision detection in inequality comparison rules. The method solves the problem of data anomaly detection in a big data set and data increment, aims to solve the problem that when a big data anomaly detection framework (Ocad) reads incremental data on line through a Spark Streaming API, data anomaly cannot be detected efficiently and the like, ensures that when the data is sent to be detected, the framework can selectively store the key characteristics of the detected data, and simultaneously only selects certain data which has abnormal conflict with the data to be detected from a historical database for detection and generates conflict information according to data quality constraint rules.
1. Anomalous conflict detection in the face of equality comparison rules.
(1) For a data set which arrives for the first time, if the historical data is empty, the system firstly traverses the data set which arrives for the first time, and establishes an inverted index for the data set according to an attribute A, wherein the key of the inverted index is the attribute value of the attribute A, and the value is the attribute value of a record id and an attribute B;
(2) traversing the data set which arrives for the first time, finding out records which are the same as the attribute value of the attribute A of the r in the inverted index for each record r, and judging whether a conflict exists through a detect operator;
(3) for subsequently arrived data, namely the incremental data set, establishing an inverted index according to the attribute A, and merging the inverted index into the historical data inverted index;
(4) and traversing the incremental dataset again, for each record r in the incremental dataset, finding the record in the inverted index, which is the same as the attribute value of r about the attribute A, and judging whether a conflict exists through a detect operator.
2. Detection of abnormal conflicts in the face of inequality comparison rules.
(1) For data arriving for the first time, if the historical data set is empty, the system firstly traverses the data arriving for the first time, and establishes an ordered inverted index for the data arriving for the first time according to the attribute A, namely the key values of the inverted index are ordered from small to large, the key of the inverted index is the value of the attribute A, the value is the value of the record id and the attribute B, and for each key value, the minimum and maximum value values which are less than or equal to the key value are calculated;
(2) traversing the data which arrives for the first time again, for each record r in the data, finding out the record which is the same as the attribute value of the r about the attribute A in the inverted index, and judging whether the record has conflict or not by a detect method because the minimum maximum value which is less than or equal to the key value is maintained;
(3) for subsequently arrived data, namely the incremental data set, establishing an inverted index according to the attribute A, and merging the inverted index into the historical data inverted index;
(4) and traversing the subsequently arrived data again, finding the record which is the same as the attribute value of the r about the attribute A in the inverted index for each record r, and judging whether a conflict exists by a detect method.
The detection operator adopts a dynamic loading mechanism, namely the detection operator for detecting the data is realized by a user in the data conflict detection process, and the detection operator is loaded only when used in the operation process of the system.
Based on the embodiment, on the basis of the incremental data set, online abnormal data detection on data is realized on the basis of the spark of the distributed platform; the method adopts a Map-Reduce frame-like form, is easy to deploy and is easy for users to get on hand; in addition, only a simple SQL query statement based on data is needed to be provided and a detect operator of a predefined class is rewritten, so that the method is simple and easy to use; the newly added data set, namely the incremental data set and the historical data are subjected to conflict detection to more comprehensively obtain the abnormal information of the data, the data quality is improved, a large amount of unnecessary repeated detection is avoided, and the performance of the whole data cleaning process is greatly improved.
Referring to fig. 2, fig. 2 is a schematic structural diagram of an incremental time series data collision detection apparatus provided in an embodiment of the present application, where the incremental time series data collision detection apparatus described below and the incremental time series data collision detection method described above are referred to in a corresponding manner, and the incremental time series data collision detection apparatus provided in the embodiment of the present application includes:
in some specific embodiments, the method specifically includes:
the preprocessing module 201 is configured to, after the spark platform obtains the input incremental data set, perform preprocessing on the incremental data set to obtain a target data set;
the merging module 202 is configured to establish an inverted index for the primary key attribute in the target data set, and merge the inverted index into the inverted index of the historical data;
the determining module 203 is configured to traverse each record of the target data set, and determine a record in the merged inverted index, where the record does not satisfy the comparison rule with the corresponding key attribute in each tuple;
and the detection module 204 is configured to perform collision detection on each record of the target data set and the corresponding record that does not satisfy the comparison rule by using a detect operator established in advance, so as to obtain collision information.
In some specific embodiments, the preprocessing module 201 includes:
the generating unit is used for slicing the incremental data set, generating RDD data and converting the RDD data into data frame format data;
and the extraction unit is used for extracting the DataFrame format data by using the SQL statement to obtain a target data set.
In some specific embodiments, the determining module 203 includes:
a first determining unit, configured to traverse each record in the target data set when the comparison rule is an equality comparison rule; determining records of which the attribute values of the primary key attributes in the records are equal to the corresponding attribute values in the current inverted index;
the second determining unit is used for traversing each record in the target data set when the comparison rule is an inequality comparison rule; and determining the record of which the attribute value of the primary key attribute is greater than or equal to the corresponding attribute value in the current inverted index in each record.
In some specific embodiments, the method further comprises:
the judging module is used for judging whether the table name and the name of the primary key attribute corresponding to the incremental data set are correct or not;
and the correcting module is used for correcting the list name with errors and the name of the primary key attribute if the list name with errors does not exist.
In some specific embodiments, the method further comprises:
the establishing module is used for establishing a BaseDetect class and transmitting the BaseDetect class into a spark platform through a setClass interface;
the generating module is used for calling a method, importing parameters to a spark platform according to the parameter requirements in the BaseDetect class, and generating a detect operator.
Since the embodiment of the incremental time series data collision detection apparatus part corresponds to the embodiment of the incremental time series data collision detection method part, for the embodiment of the incremental time series data collision detection apparatus part, please refer to the description of the embodiment of the incremental time series data collision detection method part, which is not repeated here.
In the following, an electronic device provided in the embodiments of the present application is introduced, and the electronic device described below and the incremental time series data collision detection method described above may be referred to correspondingly.
The application also discloses an electronic device, including:
a memory for storing a computer program;
and a processor for implementing the steps of the incremental time series data collision detection method when executing the computer program.
Since the embodiment of the electronic device portion corresponds to the embodiment of the incremental time series data collision detection method portion, please refer to the description of the embodiment of the incremental time series data collision detection method portion for the embodiment of the electronic device portion, and details are not repeated here.
The application also discloses a computer readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps of the incremental time series data collision detection method as described above.
The following describes a computer-readable storage medium provided by an embodiment of the present application, and the computer-readable storage medium described below and the incremental time series data collision detection method described above may be referred to in correspondence with each other.
Since the embodiment of the computer-readable storage medium portion corresponds to the embodiment of the incremental time series data collision detection method portion, for the embodiment of the computer-readable storage medium portion, please refer to the description of the embodiment of the incremental time series data collision detection method portion, which is not repeated here.
The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The incremental time series data collision detection method, the incremental time series data collision detection device, the electronic device and the computer readable storage medium provided by the application are described in detail above. The principles and embodiments of the present application are explained herein using specific examples, which are provided only to help understand the method and the core idea of the present application. It should be noted that, for those skilled in the art, it is possible to make several improvements and modifications to the present application without departing from the principle of the present application, and such improvements and modifications also fall within the scope of the claims of the present application.

Claims (10)

1. An incremental time series data collision detection method, comprising:
after the spark platform acquires an input incremental data set, preprocessing the incremental data set to obtain a target data set;
establishing an inverted index for the primary key attributes in the target data set, and merging the inverted index into the historical data inverted index;
traversing each record of the target data set, and determining records in the merged inverted index, which do not meet comparison rules with corresponding key attributes in each tuple;
and performing conflict detection on each record of the target data set and the corresponding record which does not meet the comparison rule by using a preset detect operator to obtain conflict information.
2. The incremental time series data collision detection method of claim 1, wherein preprocessing the incremental data set to obtain a target data set comprises:
slicing the incremental data set to generate RDD data, and converting the RDD data into data in a DataFrame format;
and extracting the DataFrame format data by using an SQL statement to obtain a target data set.
3. The incremental time series data collision detection method of claim 2, wherein traversing each record of the target data set and determining a record in the merged inverted index that does not satisfy a comparison rule with the corresponding key-attribute in the each tuple comprises:
when the comparison rule is an equation comparison rule, traversing each record in the target data set; determining the record of which the attribute value of the primary key attribute in each record is equal to the corresponding attribute value in the merged inverted index;
when the comparison rule is an inequality comparison rule, traversing each record in the target data set; and determining the record of which the attribute value of the primary key attribute is greater than or equal to the corresponding attribute value in the merged inverted index in each record.
4. The incremental time series data collision detection method according to claim 2, wherein after the spark platform acquires the input incremental data set, the method further comprises:
judging whether the table name and the name of the primary key attribute corresponding to the incremental data set are correct or not;
if not, correcting the list name with errors and the name of the primary key attribute;
and if so, executing preprocessing on the incremental data set to obtain a target data set.
5. The incremental time series data collision detection method according to claim 2, wherein before performing collision detection on each record of the target data set and a corresponding record which does not satisfy the comparison rule by using a pre-established detect operator, the method further comprises:
creating a BaseDetect class, and transmitting the BaseDetect class into the spark platform through a setClass interface;
invoking a method, importing, and transmitting parameters to the spark platform according to the parameter requirements in the BaseDetect class to generate the detect operator.
6. An incremental time series data collision detection apparatus, comprising:
the preprocessing module is used for preprocessing the incremental data set to obtain a target data set after the spark platform obtains the input incremental data set;
the merging module is used for establishing an inverted index for the primary key attributes in the target data set and merging the inverted index into the historical data inverted index;
the determining module is used for traversing each record of the target data set and determining the record which does not meet the comparison rule with the corresponding key attribute in each tuple in the merged inverted index;
and the detection module is used for carrying out conflict detection on each record of the target data set and the corresponding record which does not meet the comparison rule by utilizing a preset detect operator to obtain conflict information.
7. The incremental time series data collision detecting apparatus of claim 6, wherein the preprocessing module comprises:
the generating unit is used for slicing the incremental data set, generating RDD data and converting the RDD data into data in a DataFrame format;
and the extraction unit is used for extracting the DataFrame format data by using the SQL statement to obtain a target data set.
8. The incremental time series data collision detecting apparatus of claim 6, wherein said determining module comprises:
a first determining unit, configured to traverse each record in the target data set when the comparison rule is an equation comparison rule; determining the record of which the attribute value of the primary key attribute in each record is equal to the corresponding attribute value in the merged inverted index;
a second determining unit, configured to traverse each record in the target data set when the comparison rule is an inequality comparison rule; and determining the record of which the attribute value of the primary key attribute is greater than or equal to the corresponding attribute value in the merged inverted index in each record.
9. An electronic device, comprising:
a memory for storing a computer program;
a processor for implementing the steps of the incremental time series data collision detection method according to any one of claims 1 to 5 when executing said computer program.
10. A computer-readable storage medium, having stored thereon a computer program which, when being executed by a processor, carries out the steps of the incremental time series data collision detection method according to any one of claims 1 to 5.
CN202110547706.XA 2021-05-19 2021-05-19 Incremental time sequence data conflict detection method and device and storage medium Pending CN113282616A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110547706.XA CN113282616A (en) 2021-05-19 2021-05-19 Incremental time sequence data conflict detection method and device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110547706.XA CN113282616A (en) 2021-05-19 2021-05-19 Incremental time sequence data conflict detection method and device and storage medium

Publications (1)

Publication Number Publication Date
CN113282616A true CN113282616A (en) 2021-08-20

Family

ID=77280001

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110547706.XA Pending CN113282616A (en) 2021-05-19 2021-05-19 Incremental time sequence data conflict detection method and device and storage medium

Country Status (1)

Country Link
CN (1) CN113282616A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114943021A (en) * 2022-07-20 2022-08-26 之江实验室 TB-level incremental data screening method and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104794123A (en) * 2014-01-20 2015-07-22 阿里巴巴集团控股有限公司 Method and device for establishing NoSQL database index for semi-structured data
CN111190906A (en) * 2019-12-31 2020-05-22 全球能源互联网研究院有限公司 Method for detecting data abnormality of sensor network
CN112559514A (en) * 2019-09-25 2021-03-26 上海哔哩哔哩科技有限公司 Information processing method and system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104794123A (en) * 2014-01-20 2015-07-22 阿里巴巴集团控股有限公司 Method and device for establishing NoSQL database index for semi-structured data
CN112559514A (en) * 2019-09-25 2021-03-26 上海哔哩哔哩科技有限公司 Information processing method and system
CN111190906A (en) * 2019-12-31 2020-05-22 全球能源互联网研究院有限公司 Method for detecting data abnormality of sensor network

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114943021A (en) * 2022-07-20 2022-08-26 之江实验室 TB-level incremental data screening method and device
US11789639B1 (en) 2022-07-20 2023-10-17 Zhejiang Lab Method and apparatus for screening TB-scale incremental data

Similar Documents

Publication Publication Date Title
CN106528787B (en) query method and device based on multidimensional analysis of mass data
US11301470B2 (en) Control method for performing multi-table join operation and corresponding apparatus
WO2015109250A1 (en) CREATING NoSQL DATABASE INDEX FOR SEMI-STRUCTURED DATA
JP6928677B2 (en) Data processing methods and equipment for performing online analysis processing
US10496645B1 (en) System and method for analysis of a database proxy
US20140222870A1 (en) System, Method, Software, and Data Structure for Key-Value Mapping and Keys Sorting
CN105468720A (en) Method for integrating distributed data processing systems, corresponding systems and data processing method
US20200250192A1 (en) Processing queries associated with multiple file formats based on identified partition and data container objects
US20210165773A1 (en) On-demand, dynamic and optimized indexing in natural language processing
CN112579610A (en) Multi-data source structure analysis method, system, terminal device and storage medium
CN111125199B (en) Database access method and device and electronic equipment
CN113282616A (en) Incremental time sequence data conflict detection method and device and storage medium
CN113032465A (en) Data query method and device, electronic equipment and storage medium
CN111522918A (en) Data aggregation method and device, electronic equipment and computer readable storage medium
CN106446039B (en) Aggregation type big data query method and device
CN110955460A (en) Service process starting method and device, electronic equipment and storage medium
CN116186053A (en) Data processing method, device and storage medium
CN107203550B (en) Data processing method and database server
CN110413617B (en) Method for dynamically adjusting hash table group according to size of data volume
CN115114325A (en) Data query method and device, electronic equipment and storage medium
CN111259062B (en) Method and device capable of guaranteeing sequence of statement result set of full-table query of distributed database
CN114969046A (en) Hash connection processing method, storage medium and equipment
CN106599267B (en) Method and device for deleting data
KR101638048B1 (en) Sql query processing method using mapreduce
CN112711627B (en) Data importing method, device and equipment of Greemplum database

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210820