CN115544092A

CN115544092A - Data detection method, device, equipment and storage medium

Info

Publication number: CN115544092A
Application number: CN202211281984.6A
Authority: CN
Inventors: 钟志明; 张�浩; 韩森
Original assignee: WeBank Co Ltd
Current assignee: WeBank Co Ltd
Priority date: 2022-10-19
Filing date: 2022-10-19
Publication date: 2022-12-30

Abstract

The application provides a data detection method, a device, equipment and a storage medium, wherein the method comprises the following steps: inquiring data of the distributed application server nodes, and formatting the inquired data to obtain a data model object which accords with a predefined data model; distributing the data model objects to corresponding channels for aggregation to obtain aggregated data sequence pairs, and collecting the data model objects associated with the aggregated data sequence pairs into a data stream; setting a dynamic data window range based on the starting time of each data model object entering the data stream and the tail end time of each data model object in the data stream; processing the aggregated data sequence pair in the data stream based on the set dynamic data window range to generate a detection result aiming at the inquired data; wherein, the detection result comprises whether the inquired data is consistent with the expected data.

Description

Data detection method, device, equipment and storage medium

Technical Field

The embodiment of the application relates to the technical field of data processing of financial technology (Fintech), and relates to but is not limited to a data detection method, a device, equipment and a storage medium.

Background

With the development of computer computing, more and more technologies are applied in the financial field, and the traditional financial industry is gradually changing to financial technology (Fintech), but higher requirements are also put forward on the technologies due to the requirements of the financial industry on safety and real-time performance.

In the field of financial science and technology, projects are mostly deployed under a distributed architecture at present, applications and database instances are distributed in a multi-node area, and nodes communicate with each other through corresponding services. The current data consistency detection scheme is applied to a relational database management system such as Mysql master-slave architecture design, synchronization of files (such as binary format files Binlog) is started by relying on Mysql, a check sum (checksum) mode for generating data blocks is executed in a master library, the same data blocks checksum is calculated in a slave library after the data blocks are transmitted to the slave library through the Binlog synchronization log and executed. This data consistency detection scheme relies on the Binlog synchronization implementation of the master-slave architecture of the Mysql database. When multiple database types coexist simultaneously in a distributed cross-domain scenario, for example, the following multiple database relationship systems of different types exist in a distributed architecture design: in the scenarios of Mysql, tiDB, and Oracle, since various types of databases cannot form a master-slave architecture, detection in the scenario of distributed multiple database type architectures cannot be supported. Based on this, the related art cannot support cross-domain data consistency detection in the case of cross-domain and cross-database types.

Disclosure of Invention

The embodiment of the application provides a data detection method, a device, equipment and a storage medium, which are used for solving the problem that cross-domain data consistency detection cannot be supported under the condition of cross-domain and cross-database types in the prior art.

The technical scheme of the embodiment of the application is realized as follows:

the embodiment of the application provides a data detection method, which comprises the following steps:

querying data of the distributed application server nodes, and formatting the queried data to obtain a data model object which accords with a predefined data model;

distributing the data model objects to corresponding channels for aggregation to obtain aggregated data sequence pairs, and collecting the data model objects associated with the aggregated data sequence pairs into a data stream;

setting a dynamic data window range based on a starting time of each data model object entering the data stream and a terminal time of each data model object in the data stream;

processing the aggregation data sequence pair in the data stream based on the set dynamic data window range to generate a detection result aiming at the inquired data; wherein the detection result comprises whether the inquired data is consistent with expected data.

A data detection apparatus, the apparatus comprising:

the preprocessing module is used for inquiring data of the distributed application server nodes and formatting the inquired data to obtain a data model object which accords with a predefined data model;

the aggregation module is used for distributing the data model objects to the corresponding channels for aggregation to obtain an aggregated data sequence pair, and collecting each data model object associated with the aggregated data sequence pair into a data stream;

the window setting module is used for setting a dynamic data window range based on the starting time of each data model object entering the data stream and the tail end time of each data model object in the data stream;

a detection result matching module, configured to process the aggregated data sequence pair in the data stream based on a set dynamic data window range, and generate a detection result for the queried data; wherein the detection result comprises whether the inquired data is consistent with expected data.

A data detection apparatus comprising:

a memory for storing executable instructions; and the processor is used for realizing the method when executing the executable instructions stored in the memory.

A computer readable storage medium having stored thereon executable instructions for causing a processor to perform the method described above when executed.

The embodiment of the application has the following beneficial effects:

querying data of a distributed application server node, and formatting the queried data to obtain a data model object which accords with a predefined data model; that is to say, the data detection method provided by the application can process data of distributed application server nodes, namely data of different data sources, format the data of the different data sources, and support cross-domain and cross-library data detection; further, the data model objects are distributed to corresponding channels for aggregation to obtain aggregated data sequence pairs, and the data model objects associated with the aggregated data sequence pairs are collected into a data stream; setting a dynamic data window range based on the starting time of each data model object entering the data stream and the tail end time of each data model object in the data stream; processing the aggregated data sequence pair in the data stream based on the set dynamic data window range to generate a detection result aiming at the inquired data; wherein, the detection result comprises whether the inquired data is consistent with the expected data; the data detection method provided by the application achieves the purposes of supporting cross-domain data consistency detection and improving detection efficiency under the condition of cross-domain and cross-database types.

Drawings

Fig. 1 is an alternative architecture diagram of a terminal provided in an embodiment of the present application;

fig. 2 is a first schematic flowchart of a data detection method according to an embodiment of the present application;

fig. 3 is a schematic flowchart of a data detection method according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a data flow provided by an embodiment of the present application;

fig. 5 is a schematic flowchart diagram of a data detection method provided in the embodiment of the present application;

FIG. 6 is a schematic diagram of a time range defined by two time dimensions provided by an embodiment of the present application;

FIG. 7 is a diagram illustrating determining a maximum value within a query interval according to an embodiment of the present disclosure;

fig. 8 is a schematic diagram of streaming window calculation provided in an embodiment of the present application.

Detailed Description

In order to make the objectives, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the attached drawings, the described embodiments should not be considered as limiting the present application, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict. Unless defined otherwise, all technical and scientific terms used in the examples of this application have the same meaning as commonly understood by one of ordinary skill in the art to which the examples of this application belong. The terminology used in the embodiments of the present application is for the purpose of describing the embodiments of the present application only and is not intended to be limiting of the present application.

An exemplary application of the data detection device provided in the embodiment of the present application is described below, and the data detection device provided in the embodiment of the present application may be implemented as a notebook computer, a tablet computer, a desktop computer, a mobile device (for example, a mobile phone, a portable music player, a personal digital assistant, a dedicated messaging device, a portable game device), an intelligent robot, or any other terminal with an on-screen display function, and may also be implemented as a server. Next, an exemplary application when the data detection apparatus is implemented as a terminal will be explained.

Referring to fig. 1, fig. 1 is a schematic structural diagram of a terminal 100 according to an embodiment of the present application, where the terminal 100 shown in fig. 1 includes: at least one processor 110, at least one network interface 120, a user interface 130, and memory 150. The various components in terminal 100 are coupled together by a bus system 140. It is understood that the bus system 140 is used to enable connected communication between these components. The bus system 140 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 140 in fig. 1.

The Processor 110 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc., wherein the general purpose Processor may be a microprocessor or any conventional Processor, etc.

The user interface 130 includes one or more output devices 131, including one or more speakers and/or one or more visual displays, that enable the presentation of media content. The user interface 130 also includes one or more input devices 132 including user interface components that facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.

The memory 150 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard disk drives, optical disk drives, and the like. Memory 150 optionally includes one or more storage devices physically located remotely from processor 110. The memory 150 includes volatile memory or nonvolatile memory, and may also include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a Random Access Memory (RAM). The memory 150 described in embodiments herein is intended to comprise any suitable type of memory. In some embodiments, memory 150 is capable of storing data to support various operations, examples of which include programs, modules, and data structures, or subsets or supersets thereof, as exemplified below.

An operating system 151 including system programs for processing various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and processing hardware-based tasks;

a network communication module 152 for reaching other computing devices via one or more (wired or wireless) network interfaces 120, exemplary network interfaces 120 including: bluetooth, wireless compatibility authentication (WiFi), and Universal Serial Bus (USB), etc.;

an input processing module 153 for detecting one or more user inputs or interactions from one of the one or more input devices 132 and translating the detected inputs or interactions.

In some embodiments, the apparatus provided in the embodiments of the present application may be implemented in software, and fig. 1 illustrates a data detection apparatus 154 stored in the memory 150, where the data detection apparatus 154 may be a data detection apparatus in the terminal 100, which may be software in the form of programs and plug-ins, and includes the following software modules: the preprocessing module 1541, the aggregation module 1542, the window setting module 1543, and the detection result matching module 1544 are logical, and thus may be arbitrarily combined or further divided according to the implemented functions. The functions of the respective modules will be explained below.

In other embodiments, the apparatus provided in the embodiments of the present Application may be implemented in hardware, and as an example, the apparatus provided in the embodiments of the present Application may be a processor in the form of a hardware decoding processor, which is programmed to perform the data detection method provided in the embodiments of the present Application, for example, the processor in the form of the hardware decoding processor may be one or more Application Specific Integrated Circuits (ASICs), DSPs, programmable Logic Devices (PLDs), complex Programmable Logic Devices (CPLDs), field-Programmable Gate arrays (FPGAs), or other electronic components.

Here, the current data consistency detection scheme is further explained, and in a scenario where a transaction link has strong consistency data requirements, data consistency of the transaction link can be generally ensured through transactions. However, for some abnormal scenarios that cannot be guaranteed by a program, such as application service downtime, manual intervention and pre-repair, data Migration (DM) Data synchronization conflict, master-slave synchronization inconsistency, etc., data inconsistency among Data Center Nodes (DCNs) in a production environment is caused, and thus Data conflict and Data splitting in the production environment are caused. Currently, a set of high-level command tools (Percona-Toolkit, PT) is commonly used in the industry to realize Mysql master-slave database consistency detection.

As described above, when multiple database types coexist in a distributed cross-domain scenario, for example, in a scenario of Mysql, tiDB, or Oracle existing in a distributed architecture design, detection in a scenario of distributed multiple database type architectures cannot be supported because various types of databases cannot form a master-slave architecture.

In addition, in the detection process, in the Mysql master-slave framework, the Binlog synchronization must be started, the detection table structures must be consistent, and the detection table can only work on one table for processing at one time. The reason is as follows: based on Binlog synchronization, the problem of master-slave synchronization delay is necessarily considered, and master-slave synchronization delay is probably generated under the condition of large synchronous data volume. Therefore, the current data consistency detection scheme needs to control the synchronization rate and the amount of the synchronization data, so that only a single table data can be detected once, the single table is divided into line blocks, and each block of master-slave node data is detected. This directly compromises efficiency and does not decouple the dependency on the underlying characteristics of the database.

Therefore, the data detection method is provided, and data consistency detection is realized and detection efficiency is improved under the condition of cross-domain and cross-database types.

The data detection method provided by the embodiment of the present application will be described below in conjunction with an exemplary application and implementation of the terminal 100 provided by the embodiment of the present application. Referring to fig. 2, fig. 2 is an alternative flow chart of the data detection method provided in the embodiment of the present application, which will be described in conjunction with the steps shown in fig. 2,

step S201, data of the distributed application server nodes are inquired, and the inquired data are formatted to obtain a data model object which accords with a predefined data model.

Applicable scenarios of the data detection method provided by the present application include, but are not limited to, combinations of one or more of the following: and multiple data sources coexist under the heterogeneous database scene and the distributed cross-database complex architecture scene.

In the embodiment of the application, a unified data model (MapMode) protocol is defined, and is used for performing unified mapping and packaging processing on data in a data set to be subjected to consistency detection, so as to realize a formatting processing process. In defining mapcode, the parameters involved include some or all of the following: data nodes, data types, data entities, data identity information, and data cursors. Here, the parameters involved are also called attributes, and therefore, the parameter values corresponding to the parameters are also called attribute values. Wherein, the attribute value corresponding to the data node is used for storing data segment DCN node information, such as a DCN area corresponding to the data; the attribute value corresponding to the data type is used for storing the type information of the data, such as the order form type; the attribute value corresponding to the data entity is used for storing the queried entity data information, for example { field 1: data 1, field 2: data 2}; the attribute value corresponding to the data identity information is used for storing the main key information of the inquired data; and the attribute value corresponding to the data cursor is used for storing the position information of the data cursor.

In some embodiments of the present application, taking a defined mapcode as an example of a data node, a data type, a data entity, data identity information, and a data cursor related to a data node, data of a queried distributed application server node is combined into a mapcode data packet, where the mapcode data packet includes parameters as shown in table 1:

properties	Attribute name	Attribute value
			dataNode	Data node	DCN region
Datatype	Data type	order form type
			dataEntity	Data entity	{ field 1: data 1, field 2 data 2}
dataIdentity	Data identity information	Primary key information
			dataVernier	Data cursor		1

TABLE 1 parameters included in MapMode packets

In an achievable data preprocessing scenario, when a client of a terminal queries data of a distributed application server node, the data may be sequentially and progressively queried in a segmented query manner through a data cursor (dataVernier), the queried data is formatted and mapped and encapsulated into the defined MapMode object, and the MapMode object is transmitted based on a structured data storage manner, for example, a structured data serialization method (Protocol Buffer, protocol buf). The client can be a data sentry (WatchDog) client which is integrally deployed at the terminal. That is to say, in the present application, a data sentry WatchDog client may be integrally deployed at each distributed application server node, and data detection processing may be performed based on an open source framework, for example, an Input/Output (IO) thread model of Netty, so as to improve concurrency and speed of data transmission.

Illustratively, after the collection processing by the WatchDog client, mapcode data is mapped as follows:

the dataNode stores data source segment DCN node information;

the dataType stores the type information of the data;

entity data information inquired by the dataEntity;

the data identity inquired out the main key information of the data;

dataVernier data position information;

the following MapMode objects are obtained by MapMode data mapping after the acquisition and processing of the WatchDog client:

in the embodiment of the application, the client may be an application program running in the terminal, and may be a web application loaded in a web page.

Step S202, the data model objects are distributed to corresponding channels for aggregation, an aggregated data sequence pair is obtained, and the data model objects related to the aggregated data sequence pair are collected into a data stream.

In the embodiment of the application, the collected data set is serialized into a MapMode data model, the data model objects are distributed to corresponding channels to be aggregated to obtain aggregated data sequence pairs, and the aggregated data sequence pairs are associated with the data model objects and collected into data streams, so that the situation that the stream-oriented computation based on a dynamic data window range faces infinite input data streams is ensured, the admission reference of data amount is improved, the flow of a large amount of data can be processed, and the detection efficiency is improved.

Step S203 sets a dynamic data window range based on the start time of each data model object entering the data stream and the end time of each data model object in the data stream.

Wherein, the data window is the window time size of the data in the data stream.

In the embodiment of the application, the dynamic data window range is a data range determined according to at least two time dimensions, namely the starting time of each data model object entering the data stream and the end time of each data model object in the data stream, so that the problem of inaccurate data range in a single time dimension is avoided, and the data range is defined by adopting the double time dimensions, so that the data range is higher in accuracy.

And step S204, processing the aggregation data sequence pair in the data stream based on the set dynamic data window range, and generating a detection result aiming at the inquired data.

Wherein, the detection result comprises whether the inquired data is consistent with the expected data.

In the embodiment of the application, the aggregated data sequence pair in the data stream is processed based on the set dynamic data window range, and a detection result of whether the inquired data is consistent with the expected data is generated. Further, according to different detection results, event pattern matching (EventMode) can be triggered to notify the data detection results.

According to the data detection method, data of distributed application server nodes are inquired, and the inquired data are formatted, so that a data model object conforming to a predefined data model is obtained; that is to say, the data detection method provided by the application can process data of distributed application server nodes, namely data of different data sources, format the data of the different data sources, and support cross-domain and cross-library data detection; further, the data model objects are distributed to corresponding channels for aggregation to obtain aggregated data sequence pairs, and the data model objects associated with the aggregated data sequence pairs are collected into a data stream; setting a dynamic data window range based on the starting time of each data model object entering the data stream and the tail end time of each data model object in the data stream; processing the aggregated data sequence pair in the data stream based on the set dynamic data window range to generate a detection result aiming at the inquired data; the detection result comprises whether the inquired data is consistent with the expected data or not; the data detection method provided by the application supports cross-domain data consistency detection and improves detection efficiency under the condition of cross-domain and cross-database types.

In some embodiments of the present application, in step S202, the data model objects are shunted to the corresponding channels for aggregation, so as to obtain an aggregated data sequence pair, which may be implemented by the steps shown in fig. 3:

step S2021, obtain a data sequence pair of data model objects.

Wherein, the data sequence pair comprises a data entity characteristic sequence and a data information characteristic sequence.

Here, the data entity feature sequence includes features of the data entity included after the queried data is formatted; the data information characteristic sequence comprises the respective characteristics of the acquisition time of the inquired data, the data nodes and the data tables contained after formatting and the like.

Step S2022, determine a queue corresponding to the feature value included in the data entity feature sequence.

In the embodiment of the present application, the eigenvalues are the same, which indicates that the data entities are the same. The characteristic values and the queues have one-to-one correspondence, that is, the same queues are selected during the transmission of the data corresponding to the same characteristic values, so that the concurrency and the speed of data transmission are improved.

Step S2023, with the determined at least one queue, shunting the data model object to a corresponding channel for aggregation, so as to obtain an aggregated data sequence pair.

In the embodiment of the application, the collected data set is serialized into a MapMode data model, then the data is subjected to time sequence processing and feature sequence pair extraction, the time sequence of the data is ensured, and the same data set flows to the same queue channel. The data queue is then polled by a channel selector to determine to stream the data model into the queue channel for aggregation.

In some embodiments of the present application, the step S2021 of obtaining the data sequence pair of the data model object may be implemented by the following steps:

and A11, obtaining the data length, the product coefficient, the modulus coefficient and the encoding standard value of the character string mapping contained in the data model object.

In the embodiment of the application, whether the two character strings are equal or not is to compare the substrings mapped by the two character strings, rather than directly comparing the two character strings, so that the comparison accuracy is improved. Here, the substring of the character string mapping is embodied in the form of a characteristic value, and the encoding standard value, the multiplication coefficient, the modulus coefficient and the data length of the character string mapping are used for calculating the characteristic value of the character string mapping.

And A12, determining the characteristic value of the character string based on the multiplication coefficient, the modulus coefficient, the data length and the encoding standard value of the character string mapping, and taking the characteristic values of all the character strings contained in the data model object as the data entity characteristic sequence of the data model object.

Wherein, the data entity characteristic sequence h _i And the method is used for judging the difference of the data entity characteristics.

In some embodiments, a12, determining the characteristic value of the character string based on the multiplication coefficient, the modulus coefficient, the data length, and the encoding standard value of the character string mapping, and substituting the multiplication coefficient, the modulus coefficient, the data length, and the encoding standard value of the character string mapping into a hash function to perform hash mapping on the character string to obtain the characteristic value of the character string. The hash mapping of the character string refers to mapping different character strings to different numbers by using a certain character string hash function.

In an implementation scenario, the characteristic value of the character string can be determined by a hash function as shown in the following formula (1), so as to obtain the characteristic sequence h of the data entity _i 。

H is calculated from A11 to A12 _i This can be achieved by the following calculation formula (1),

wherein, m [ j ]]Is a character, p, of a data entity dataEntity in MapMode ^j Is a multiplication coefficient and can be defined as a smaller value, mod is a modulus coefficient and can be defined as a larger value, n is the current data length, i and j are positive integers, idx (m [ j ] j)]) Encoding standard values, e.g. ASCII code values, for mapping characters, i.e. each character idx (m j) of a data entity dataEntity]) The ASCII table herein may refer to a related art ASCII code reference table corresponding to decimal values thereof in the ASCII table, which is not specifically limited in the present application.

In an achievable data consistency check scenario, referring to fig. 4, assume that there is a case where data of ABCD4 distributed application server nodes (i.e., cross-domain cross-library DCN node data) is normalized into a MapMpde object:

in the following, ABCD represents MapMpde objects processed by different DCN node data respectively,

A:{"dataNode":"AA0","dataType":"order","dataEntity":"{'table':'order_info','record':{'user_name':'zhangsan','seal_type':'2','trans_status':'SUCCESS'}}","dataIdentity":"2102240QD022000A96UN7M0LI0CXUDC0","dataVernier":"1"}

B:{"dataNode":"AJ0","dataType":"order","dataEntity":"{'table':'order_info','record':{'user_name':'zhangsan','seal_type':'2','trans_status':'SUCCESS'}}","dataIdentity":"2102240QD022000A96UN7M0LI0CXUDC0","dataVernier":"1"}

C:{"dataNode":"AK0","dataType":"order","dataEntity":"{"table":"order_info","record":{"user_name":"zhangsan","seal_type":"2","trans_status":"FAIL"}}","dataIdentity":"2102240QD022000A96UN7M0LI0CXUDC0","dataVernier":"1"}

D:{"dataNode":"AI0","dataType":"order","dataEntity":"{"table":"order_info","record":{"user_name":"zhangsan","seal_type":"2","trans_status":"SUCCESS"}}","dataIdentity":"2102240QD022000A96UN7M0LI0CXUDC0","dataVernier":"1"}

the data entities of the MapMpde objects after normalization processing of A, B and D in the corresponding nodes are consistent, the node data node corresponding to the MapMpde object after normalization processing of C is AK0, and the fields are subjected to variation. Further, the data entities of the ABCD are converted into uniform substrings of the characteristic values of the data entities.

Further, each character idx (mj) in the data entity dataEntity corresponds to its decimal value in the ASCII table. Assuming that p =3, mod =81001, h 2] =0, modulo is taken after accumulating the calculated value through each character string.

Here, taking a string in a as an example, the "{ 'SUCCESS' }" is calculated: a = ((0 × 3) ¹ +123)+(123×3 ² +39)+(1146×3 ³ +83)+(31025×3 ⁴ +85)+(h[j-1]×3 ^j +idx(m[j])))％81001。

Here, taking a string in C as an example, the "{ 'FAIL' }" is calculated: c = ((0 × 3) ¹ +123)+(123×3 ² +39)+(1146×3 ³ +70)+(31012×3 ⁴ +65)+(h[j-1]×3 ^j +idx(m[j])))％81001。

MapMode of ABCD is subjected to h _i () The hash operation yields an index sequence of [ a, b, c, d]Where a = b = d, that is, the data entities of the ABD correspond to the same eigenvalue.

And A13, acquiring a source database region, a data source data table and data query time of the data model object, and taking the source database region, the data source data table and the data query time as a data information characteristic sequence of the data model object.

Wherein, d _c As a source database DCN region, t _a Is a data source data Table Table, w _t The data query time is also called data acquisition time WatchTime (wt);

here, h (d) _c +t _a +w _t ) Calculating to obtain a data information characteristic sequence h _k 。

Wherein h (d) _c +t _a +w _t ) Calculating to obtain h _k The method can be realized by the following steps: d is to be _c +t _a +w _t H is calculated as a character string by the above formula (1) _k 。

And A14, forming a data sequence pair of the data model object based on the data entity characteristic sequence and the data information characteristic sequence.

Here, based on h _i And h _k And obtaining the data sequence pair w after operation _h ，w _h ＝<h _i ,h _k >。

Still taking the example of the MapMpde object processed by ABCD respectively representing different DCN node data, the data sequence pair w after ABCD calculation _h The set of (a) is as follows: [<a,a′>,<b,b′>,<c,c′>,<d,d′>]；

Where a = b = d is the same, assuming 1001, c is 1002.

Final data sequence pair w _h The following:

<a{1001},a′{v:<d _c :AA0,t _a :order,w _t :20220401080100>}>,

<b{1001},b′{v:<d _c :AJ0,t _a :order,w _t :20220401080200>}>,

<c{1002},c′{v:<d _c :AK0,t _a :order,w _t :20220401080300>}>,

<d{1001},d′{v:<d _c :AI0,t _a :order,w _t :20220401080400>}>,

further, as shown in fig. 4, the client end uses a Selector (Selector) to select a characteristic sequence h according to the data entity _i And (3) selecting the same Q1 Queue and C selecting a Q2 Queue due to the ABD characteristic value a = b = d. And then shunting the data model objects to corresponding channels in a queue form to be sent to a data aggregation center (AggCenter).

In some embodiments of the present application, in step S2023, the data model object is shunted to the corresponding channel for aggregation by the determined at least one queue, so as to obtain an aggregated data sequence pair, which may be implemented by the following steps:

and B21, creating a new temporary measurement block for the same data entity characteristic sequence, and marking the aggregation time sequence and the same block color for the temporary measurement block.

In the embodiment of the application, the AggCenter completes aggregation processing on the reported MapMode data. The AggCenter processes feature value dimension data of data from the same queue through an aggregator, creates a new temporary measurement block from the same data entity feature sequence, the new block is dyed to the same block color through the block, the dyed block (AggBlock) is an independent grain unit, as shown in fig. 4, aggblocks of different colors in the aggregation center are represented by different patterns, and 4 example patterns are shown in fig. 4. The block is responsible for uniformly executing state operation, and the processing efficiency of the source data is improved. And by marking (tag) and typing the aggregation time AggTime (at) for use in determining the dual time dimension range.

AggBlock aggregate data sequence pair a of ABD at this time _h Is a _abd ：

[<a{1001},a′{v:{<d _c :AA0,t _a :order,w _t :20220401080100>,<tag:red,a _t :20220401080400>}}>,

<b{1001},b′{v:{<d _c :AJ0,t _a :order,w _t :20220401080200>,<tag:red,a _t :20220401080400>}}>,

<d{1001},d′{v:{<d _c :AI0,t _a :order,w _t :20220401080400>,<tag:red,a _t :20220401080400>}}>]

C aggregated data sequence pair a _h Is a _c ：

[<c{1002},c′{v:{<d _c :AK0,t _a :order,w _t :20220401080300>,<tag:blue,a _t :20220401080500>}}>,d>]

Data are polymerized and then subjected to AggBlock dyeing, internal data state overturning is processed, and data processing efficiency is improved. Meanwhile, data is collected into a data stream and continuously transmitted into the data stream to form an unbounded data set.

And B22, extracting the query time sequence, the source database area and the data source data table from the data entity characteristic sequence of the data model object.

And B23, obtaining an aggregation data sequence pair based on the characteristic value, the source database area, the data source data table, the query time sequence, the block color and the aggregation time sequence.

In the embodiment of the application, aggTimes are connected in series to form a time axis. By ABCD data sequence pairs w _h Extracting an acquisition time sequence:

a _h extracting an aggregation time sequence:

wherein,

w _t ＝[20220401080100,20220401080200,20220401080300,20220401080400]；

a _t ＝[20220401080400,20220401080400,20220401080500,20220401080400]。

in some embodiments of the present application, the start time is a data query time of each data model object, the end time is an aggregation time of each data model object, and step S203 sets a dynamic data window range based on the start time of each data model object entering a data stream and the end time of each data model object in the data stream, which may be implemented by the steps shown in fig. 5:

in step S2031, a set predetermined window and a delay window are obtained.

Wherein the delay window is the maximum delay time allowed by the data in the data stream.

In the embodiments of the present application, a window WindowTime (w) is predetermined _it ) And a delay window DelayTime (d) _et ) Can be flexibly set according to actual requirements. Wherein w _it For securing reference range data, d _et A window of data for guaranteeing delay reach.

Illustratively, set WindowTime (w) _it ) 1 minute, delayTime (d) _et ) Indicating the window allowed maximum delay time, also set to 1 minute.

Step S2032, the time range covered by the minimum time node to the maximum time node in the data query time and the aggregation time is used as a dynamic adjustment window.

In the embodiment of the present application, dynamic window time DynamicTime (d) _yt ) Is the dynamic window range, and d _yt By dual time dimension [ w ] _t ，a _t ]And (5) limiting. Wherein, the data query time is the data acquisition time w _t Record the start time of each data entering the data stream, aggregation time a _t D is determined by double dimensionality as the end time of each data in the data stream _yt The problem of inaccurate data range in a single time dimension can be avoided, and the double time dimension limitation has higher accuracy.

Wherein the minimum time node is d _min The maximum time node is d _max Then d is _yt ＝len[d _min ,d _max ]。

Here, the acquisition time sequence based on the aforementioned determination

And time series of polymerization

As shown in fig. 6, on the intersecting time axis, the minimum time node to the maximum time node may be regarded as a dynamic window range, that is, a time range defined by the double time dimensions. Time series w of acquisitions _t And polymerization time series a _t Combining to obtain dynamic window time sequence group d _t ＝[t1,t2,t3,t4,t1′,t2′,t3′,t4′]And taking a union of the sequences, d _yt ＝len[d _min ,d _max ]。

Step S2033, setting a dynamic data window range based on the predetermined window, the delay window, and the dynamic adjustment window.

In the embodiment of the present application, the dynamic data window range WindowScope (w) _s ) The determination is based on a predetermined window, a delay window, and a dynamic adjustment window.

Exemplary, w _s Can be determined by the following calculation formula (2),

w _s ＝w _it +d _yt +d _et formula (2)

In the embodiment of the application, the w is dynamically set by adopting multiple dimensions _s By the method, the accuracy of frame selection of the streaming data range is improved

In some embodiments of the present application, before the time range covered by the minimum time node to the maximum time node in the data query time and the aggregation time is used as the dynamic adjustment window in step S2032, the minimum time node to the maximum time node may be determined through the following steps:

and retrieving the minimum time node and the maximum time node in the data query time and the aggregation time based on the sparse table.

In the embodiment of the application, the acquisition time sequence is obtained

And a polymerization time series

In case of (2), the time series w will be acquired _t And polymerization time series a _t Combining to obtain the dynamic window time sequence group d _t ＝[t1,t2,t3,t4,t1′,t2′,t3′,t4′]And in the process of taking the union of the sequences, the minimum time node d in the data query time and the aggregation time can be searched out based on the sparse table _min And a maximum time node d _max . Here, the maximum value of the length of the dual time dimension range can be determined by finding the maximum value in the dynamic window time series by using a sparse table multiplication method.

In one achievable most-valued decision scenario, in the pre-processing stage, for array t, t [ i [ ]][j]Where i denotes the left end point and j denotes 2 ^j A length, i.e. in d _t [i]2 continuous as starting point ^j The maximum value of the number. The number of elements is 2 ^j So that the part is divided into two parts from the middle, and the number of each part is 2 ^j-1 Thus t [ i, j ]]Is represented by the interval [ i, i +2 ^j-1 ]The maximum value within the range.

Here, the dynamic window time series d determined as described above _t For illustration purposes: d is a radical of _t ＝[20220401080100,20220401080200,20220401080300,20220401080400,20220401080400,20220401080400,20220401080500,20220401080400]

Wherein, t 1][0]Represents the 1 st number and has a length of 2 ⁰ The maximum value of =1, which is the first value 20220401080100.

t[1][1]Represents the 1 st number and has a length of 2 ¹ Maximum value of =2, t [1, 1]]＝max(20220401080100,20220401080200)＝20220401080200。

t[1][2]Represents the 1 st number and has a length of 2 ² (ii) a maximum value of =4,t[1,2]＝max(20220401080100,20220401080200,20220401080300,20220401080400)＝20220401080400。

……

t [ i ] [0] represents the maximum value of 1 continuous point from i, namely [ i, i ];

t [ i ] [1] represents the maximum value of 2 continuous points from i, namely the maximum value in [ i, i +1 ];

t [ i ] [2] represents the maximum value of 4 consecutive points from i, i is the maximum value in [ i, i +1, i +2, i +3 ];

t [ i ] [3] represents the maximum value of 8 consecutive points from i, i.e., the maximum value in [ i, i +1, i +2, i +3, \ 8230;, i +7 ];

the state transition equation is expressed as the following calculation formula (3),

t[i,j]＝max(t[i,j-1],t[i+2 ^j-1 ,j-1]) Formula (3)

Each t [ i ] in the above pre-processing][j]All have a section length of 2 ^j The most significant value of the interval. Referring to FIG. 7, assume that the interval to be queried is [ l, r]. The union of two sub-intervals is found to contain the whole query interval, and the two sub-intervals have the same length. To ensure that the two cells can contain the whole large interval, the length of a single cell is not less than half of the length of the query interval, so as to realize the complete coverage of the length of the query interval. Secondly, and the interval length is a power of 2, the inter-cell length cannot cover the large interval, but twice this inter-cell covers the large interval.

The interval maximum is queried as follows: the query interval is [ l, r ]]The interval length is r-l +1, and satisfies 2 ^k ≤r-l+1<2 ^k+1 K = log can be taken ₂ (r-l + 1) is rounded down, then the maximum time node d is determined by equation (4) _max ：

d _max ＝max(t[l,k],t[r-2 ^k +1][k]) Formula (4)

E.g. d _t [20220401080100,20220401080200,20220401080300,20220401080400,20220401080400,20220401080400,20220401080500,20220401080400]Taking the interval [1,8 ]]Medium maximum value, i.e. k = log ₂ (8-1 +1) =3, then: max (t 1, 3)],t[8-2 ³ +1][3]) The maximum value is 20220401080500.

In the same way, the method for preparing the composite material,equation (4) the minimum d is calculated by min () in reverse _min To 20220401080100, gives the sequence pair d _yt ＝len[20220401080100,20220401080500]. Finally calculating to obtain dynamic adjusting window d of ABCD time sequence _yt Is length 4.

Thus, the dynamic data window range w _s The size is as follows: 1+4+ 1=.

In some embodiments of the present application, the expected data is a configuration threshold of the distributed application server node, and step S204 processes the aggregated data sequence pair in the data stream based on the set dynamic data window range to generate a detection result for the queried data, which may be implemented by the following steps:

first, the aggregated data series pairs within the dynamic data window are operated on to screen out each color block and determine the data threshold within each color block.

Second, a feature quantity value for each color patch is determined in the aggregated data sequence over the dynamic data window.

Wherein the eigenvalue values are indicative of the number of data of the same eigenvalue.

Finally, a detection result is generated based on the data threshold value in each color block, the characteristic quantity value of each color block and the configuration threshold value.

FIG. 8 is a diagram of a dynamic data window range w for data task processing via a trigger (WindowTrigger) in an achievable streaming window setting scenario _s Aggregation of data sequence pairs a within range _h And (5) performing operation, screening each dyeing block and calculating a data threshold value in the block. Here, since the foregoing has already been directed to a _abd ,a _c The block data is identified as common characteristic data, and thus, only the aggregate data sequence pair a is required _h The characteristic quantity value of each block is counted, and the DCN node quantity value is dynamically inquired in a correlated manner according to the characteristic quantity value. And further, judging whether the expected value is met, and triggering an event mode to match the EventMode notification data detection result according to different expected results.

In one achievable data detection scenario,ABD as described above _abd In the sequence, the threshold of data in the block is 3, corresponding to table type t _a Order, dcn node list is [ AA0, AJ0, AI0]. C is in a _c In the sequence, the threshold of the data in the block is 1, corresponding to the table type t _a Order, dcn node list is [ AK 0]]。

Further, according to the obtained DCN node configuration threshold number 4, it can be known that C is in a _c Data threshold 1 in sequence is less than ABD at a _abd And (4) judging that data difference occurs in the C node by using a data threshold value 3 in the sequence. Therefore, in the ABCD data set, the ABD data at the dcn nodes AA0, AJ0, and AI0 can be considered to conform to the same characteristics, and there is no data difference. And C, when the node AK0 has data variation, triggering event mode processing.

The embodiment of the application detects the data set condition through the real-time data sentinel and simultaneously processes various data source data sets. Data aggregation guarantees streaming computing timeliness. And determining a dynamic window calculation range through double time dimensions to improve the data accuracy rate, if the data sequence pair calculated by the streaming window is inconsistent with the expectation, notifying a trigger event mechanism, finding out abnormal data on the line, reducing a service influence surface and improving the fault tolerance rate of the system. The method adopts the streaming calculation, analyzes the large-scale flow data in real time in the constantly changing motion process, aggregates possibly useful information, and can also send the result to the next calculation node.

Continuing with the exemplary structure of the data detection device 154 provided by the embodiments of the present application implemented as a software module, in some embodiments, as shown in fig. 1, the software module stored in the data detection device 154 of the memory 150 may be a data detection device in the terminal 100, including:

the preprocessing module 1541 is configured to query data of the distributed application server node, and format the queried data to obtain a data model object that conforms to a predefined data model;

the aggregation module 1542 is configured to distribute the data model objects to corresponding channels for aggregation, to obtain an aggregated data sequence pair, and collect each data model object associated with the aggregated data sequence pair into a data stream;

a window setting module 1543, configured to set a dynamic data window range based on a start time of each data model object entering the data stream and a tail time of each data model object in the data stream;

a detection result matching module 1544, configured to process the aggregated data sequence pair in the data stream based on the set dynamic data window range, and generate a detection result for the queried data; wherein, the detection result comprises whether the inquired data is consistent with the expected data.

In some embodiments of the present application, the preprocessing module 1541 is configured to obtain a data sequence pair of a data model object; wherein, the data sequence pair comprises a data entity characteristic sequence and a data information characteristic sequence;

a preprocessing module 1541, configured to determine a queue corresponding to a feature value included in a data entity feature sequence;

and an aggregation module 1542, configured to distribute the data model object to the corresponding channel for aggregation by using the determined at least one queue, so as to obtain an aggregated data sequence pair.

In some embodiments of the present application, the preprocessing module 1541 is configured to obtain a data length, a multiplication coefficient, a modulus coefficient, and a coding standard value mapped by a character string included in the data model object; determining the characteristic value of the character string based on the product coefficient, the modulus coefficient, the data length and the coding standard value mapped by the character string, and taking the characteristic values of all the character strings contained in the data model object as the data entity characteristic sequence of the data model object; acquiring a source database region, a data source data table and data query time of the data model object, and taking the source database region, the data source data table and the data query time as a data information characteristic sequence of the data model object; and forming a data sequence pair of the data model object based on the data entity characteristic sequence and the data information characteristic sequence.

In some embodiments of the present application, the same feature value corresponds to a queue, and the aggregating module 1542 is configured to create a new temporary metric block for the same data entity feature sequence, and mark the aggregation time sequence and the same block color for the temporary metric block;

the aggregation module 1542 is configured to extract a query time sequence, a source database region, and a data source data table from the data entity feature sequence of the data model object;

the aggregation module 1542 is configured to obtain an aggregated data sequence pair based on the feature value, the source database region, the data source data table, the query time sequence, the block color, and the aggregation time sequence.

In some embodiments of the present application, the start time is a data query time of each data model object, the end time is an aggregation time of each data model object, and the detection result matching module 1544 is configured to obtain a preset window and a delay window; taking the time range covered from the minimum time node to the maximum time node in the data query time and the aggregation time as a dynamic adjustment window; a dynamic data window range is set based on the predetermined window, the delay window, and the dynamic adjustment window.

In some embodiments of the present application, the detection result matching module 1544 is configured to retrieve a minimum time node and a maximum time node from the data query time and the aggregation time based on the sparse table.

In some embodiments of the present application, the expected data is a configuration threshold of the distributed application server node, and based on the set dynamic data window range, the detection result matching module 1544 is configured to perform an operation on the aggregated data sequence pair in the dynamic data window range to screen out each color block and determine a data threshold in each color block; determining a characteristic quantity value of each color block in an aggregated data sequence within a dynamic data window; wherein the eigenvalue values represent the number of data of the same eigenvalue; and generating a detection result based on the data threshold value in each color block, the characteristic quantity value of each color block and the configuration threshold value.

The data detection device provided by the application queries data of distributed application server nodes and formats the queried data to obtain a data model object which accords with a predefined data model; that is to say, the data detection method provided by the application can process data of distributed application server nodes, namely data of different data sources, format the data of the different data sources, and support cross-domain and cross-library data detection; further, the data model objects are distributed to corresponding channels for aggregation to obtain aggregated data sequence pairs, and the data model objects associated with the aggregated data sequence pairs are collected into a data stream; setting a dynamic data window range based on the starting time of each data model object entering the data stream and the tail end time of each data model object in the data stream; processing the aggregated data sequence pair in the data stream based on the set dynamic data window range to generate a detection result aiming at the inquired data; wherein, the detection result comprises whether the inquired data is consistent with the expected data; the data detection method provided by the application achieves the purposes of supporting cross-domain data consistency detection and improving detection efficiency under the condition of cross-domain and cross-database types.

It should be noted that the description of the apparatus in the embodiment of the present application is similar to the description of the method embodiment, and has similar beneficial effects to the method embodiment, and therefore, the description is not repeated. For technical details not disclosed in the embodiments of the apparatus, reference is made to the description of the embodiments of the method of the present application for understanding.

Embodiments of the present application provide a storage medium having stored therein executable instructions, which when executed by a processor, will cause the processor to perform a method provided by embodiments of the present application, for example, the method as shown in fig. 2.

The computer-readable storage medium provided by the application queries data of distributed application server nodes and formats the queried data to obtain a data model object conforming to a predefined data model; that is to say, the data detection method provided by the application can process data of distributed application server nodes, namely data of different data sources, format the data of the different data sources, and support cross-domain and cross-library data detection; further, the data model objects are distributed to corresponding channels for aggregation to obtain aggregated data sequence pairs, and the data model objects associated with the aggregated data sequence pairs are collected into a data stream; setting a dynamic data window range based on the starting time of each data model object entering the data stream and the tail end time of each data model object in the data stream; processing the aggregated data sequence pair in the data stream based on the set dynamic data window range to generate a detection result aiming at the inquired data; the detection result comprises whether the inquired data is consistent with the expected data or not; the data detection method provided by the application achieves the purposes of supporting cross-domain data consistency detection and improving detection efficiency under the condition of cross-domain and cross-database types.

In some embodiments, the storage medium may be a computer-readable storage medium, such as a Ferroelectric Random Access Memory (FRAM), a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read Only Memory (EPROM), a charged Erasable Programmable Read Only Memory (EEPROM), a flash Memory, a magnetic surface Memory, an optical disc, or a Compact disc Read Only Memory (CD-ROM), among other memories; or may be various devices including one or any combination of the above memories.

In some embodiments, executable instructions may be written in any form of programming language (including compiled or interpreted languages), in the form of programs, software modules, scripts or code, and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.

By way of example, executable instructions may, but need not, correspond to files in a file system, and may also be stored in portions of files that hold other programs or data, such as in one or more scripts in a hypertext markup language (HyperText markup language) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code). By way of example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network.

The above description is only an example of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present application are included in the protection scope of the present application.

Claims

1. A method of data detection, the method comprising:

inquiring data of the distributed application server nodes, and formatting the inquired data to obtain a data model object which accords with a predefined data model;

setting a dynamic data window range based on a start time of each data model object entering the data stream and an end time of each data model object in the data stream;

2. The method according to claim 1, wherein the splitting the data model objects into corresponding channels for aggregation to obtain an aggregated data sequence pair includes:

obtaining a data sequence pair of the data model object; wherein the data sequence pair comprises a data entity characteristic sequence and a data information characteristic sequence;

determining a queue corresponding to a characteristic value contained in the data entity characteristic sequence;

and shunting the data model objects to corresponding channels for aggregation by using the determined at least one queue to obtain the aggregated data sequence pair.

3. The method of claim 2, wherein obtaining the pair of data sequences of the data model object comprises:

obtaining the data length, the product coefficient, the modulus coefficient and the coding standard value of the character string mapping contained in the data model object;

determining the characteristic value of the character string based on the multiplication coefficient, the modulus coefficient, the data length and the coding standard value mapped by the character string, and taking the characteristic values of all the character strings contained in the data model object as the data entity characteristic sequence of the data model object;

obtaining a source database region, a data source data table and data query time of the data model object, and taking the source database region, the data source data table and the data query time as the data information characteristic sequence of the data model object;

and forming a data sequence pair of the data model object based on the data entity characteristic sequence and the data information characteristic sequence.

4. The method according to claim 2, wherein the same feature value corresponds to a queue, and the splitting of the data model object into corresponding channels for aggregation with the determined at least one queue to obtain the aggregated data sequence pair includes:

creating a new temporary measurement block for the same data entity feature sequence, and marking the temporary measurement block with an aggregation time sequence and the same block color;

extracting a query time sequence, a source database region and a data source data table from the data entity characteristic sequence of the data model object;

and obtaining the aggregation data sequence pair based on the characteristic value, the source database area, the data source data table, the query time sequence, the block color and the aggregation time sequence.

5. The method of any of claims 1 to 4, wherein the starting time is a data query time of the respective data model objects, the ending time is an aggregation time of the respective data model objects, and the setting of the dynamic data window range based on the starting time of the respective data model objects into the data stream and the ending time of the respective data model objects in the data stream comprises:

obtaining a preset window and a delay window;

taking the time range covered by the minimum time node to the maximum time node in the data query time and the aggregation time as a dynamic adjustment window;

setting the dynamic data window range based on the predetermined window, the deferral window, and the dynamic adjustment window.

6. The method of claim 5, wherein before the time range from the minimum time node to the maximum time node included in the data query time and the aggregation time is used as a dynamic adjustment window, the method further comprises:

retrieving the minimum time node and the maximum time node of the data query time and the aggregation time based on a sparse table.

7. The method according to any one of claims 1 to 4, wherein the expected data is a configuration threshold of the distributed application server node, and the processing the aggregated data sequence pair in the data stream based on the set dynamic data window range to generate a detection result for the queried data comprises:

computing the aggregated data sequence pairs within the dynamic data window range to screen out each color block and determine a data threshold value within each color block;

determining a characteristic quantity value of each color block in an aggregation data sequence within the dynamic data window; wherein the eigenvalue magnitude characterizes the number of data of the same eigenvalue;

and generating the detection result based on the data threshold value in each color block, the characteristic quantity value of each color block and the configuration threshold value.

8. A data detection apparatus, characterized in that the apparatus comprises:

the aggregation module is used for shunting the data model objects to corresponding channels for aggregation to obtain an aggregated data sequence pair, and collecting each data model object associated with the aggregated data sequence pair into a data stream;

9. A data detection apparatus, comprising:

a memory for storing executable instructions; a processor for implementing the method of any one of claims 1 to 7 when executing executable instructions stored in the memory.

10. A computer-readable storage medium having stored thereon executable instructions for causing a processor, when executed, to implement the method of any one of claims 1 to 7.