CN117667908A - Data processing method and device and electronic equipment - Google Patents

Data processing method and device and electronic equipment Download PDF

Info

Publication number
CN117667908A
CN117667908A CN202311640774.6A CN202311640774A CN117667908A CN 117667908 A CN117667908 A CN 117667908A CN 202311640774 A CN202311640774 A CN 202311640774A CN 117667908 A CN117667908 A CN 117667908A
Authority
CN
China
Prior art keywords
data
detection
data stream
quality
rule
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311640774.6A
Other languages
Chinese (zh)
Inventor
阮宜龙
罗杰
张云龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Telecom Corp Ltd
Original Assignee
China Telecom Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Telecom Corp Ltd filed Critical China Telecom Corp Ltd
Priority to CN202311640774.6A priority Critical patent/CN117667908A/en
Publication of CN117667908A publication Critical patent/CN117667908A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a data processing method and device and electronic equipment. Wherein the method comprises the following steps: acquiring data in an input data stream, wherein the data is required by executing a task; performing quality detection on data in an input data stream according to a preset detection rule to generate a quality index, wherein the quality index is used for representing the qualification degree of the data; repairing the data with the quality index lower than the preset threshold value to obtain an output data stream, and storing the data in the output data stream into a storage medium. The method and the device solve the technical problem of real-time task interrupt operation of the data quality problem caused by the fact that the data is monitored after being written into the storage medium in the related art.

Description

Data processing method and device and electronic equipment
Technical Field
The present invention relates to the field of data management, and in particular, to a data processing method, device and electronic equipment.
Background
The data quality is monitored in the process of real-time calculation, the monitoring behavior can be configured to use a flink to perform real-time calculation, but the real-time calculation quality monitoring in the related technology is mostly a post-monitoring strategy, and the monitoring is performed after the data is written into the sink, namely, the data quality is checked after the data calculation is completed, so that the data quality problem cannot be found in time in the data calculation process, and in the process of data circulation, no good configuration rule and control strategy are available for operating and processing the data with the quality problem.
In view of the above problems, no effective solution has been proposed at present.
Disclosure of Invention
The embodiment of the application provides a data processing method, a data processing device and electronic equipment, which at least solve the technical problem of real-time task interrupt operation of data quality problems caused by the fact that related technologies monitor data after the data are written into a storage medium.
According to an aspect of an embodiment of the present application, there is provided a data processing method, including: acquiring data in an input data stream, wherein the data is required by executing a task; performing quality detection on data in an input data stream according to a preset detection rule to generate a quality index, wherein the quality index is used for representing the qualification degree of the data; repairing the data with the quality index lower than the preset threshold value to obtain an output data stream, and storing the data in the output data stream into a storage medium.
Optionally, before acquiring the data in the input data stream, the method further includes: determining a rule operator from a quality rule base in response to configuration operation of a target object, and initializing the rule operator, wherein the rule operator comprises a detection operator and a control operator; and combining and configuring rule operators according to preset detection rules to generate a topological graph, wherein the topological graph is used for representing the detection flow of the preset detection rules, and the topological graph is a directed acyclic graph.
Optionally, one detection operator is connected with another detection operator in the topological graph, and the control operator and the connected detection operator have the same detection object.
Optionally, performing quality detection on data in the input data stream according to a preset detection rule includes: acquiring an identifier which is carried in the data and used for indicating whether the data is in an independent mode; when the identification representing data is in an independent mode, copying the data in the input data stream, and carrying out quality detection on the copied data according to a preset detection rule; and when the identification characterization data are in the non-independent mode, performing quality detection on the data in the input data stream according to a preset detection rule.
Optionally, after generating the quality index, the method further includes: writing the quality index into the message middleware, and carrying out shunting processing on data in the input data stream through a shunting strategy under the condition that the quality index is lower than a preset threshold value; and obtaining the detected data stream according to the data after the splitting process is executed.
Optionally, after obtaining the detected data stream, the method further includes: under the condition that the detected data stream contains an abnormal stream, executing repair operation on the abnormal data in the abnormal stream according to a repair strategy to obtain repaired data; and under the condition that the detected data stream does not contain the abnormal stream, not executing the repair operation, and determining the data which does not execute the repair operation in the detected data stream as unrepaired data.
Optionally, the method further comprises: and merging the repaired data and the unrepaired data to obtain an output data stream.
According to another aspect of the embodiments of the present application, there is also provided a data processing apparatus, including: the acquisition module is used for acquiring data in an input data stream, wherein the data is required by executing a task; the detection module is used for carrying out quality detection on the data in the input data stream according to a preset detection rule to generate a quality index, wherein the quality index is used for representing the qualification degree of the data; and the restoration module is used for restoring the data with the quality index lower than the preset threshold value to obtain an output data stream and storing the data in the output data stream into a storage medium.
According to still another aspect of the embodiments of the present application, there is further provided an electronic device, including: a memory for storing program instructions; a processor coupled to the memory for executing program instructions that perform the following functions: acquiring data in an input data stream, wherein the data is required by executing a task; performing quality detection on data in an input data stream according to a preset detection rule to generate a quality index, wherein the quality index is used for representing the qualification degree of the data; repairing the data with the quality index lower than the preset threshold value to obtain an output data stream, and storing the data in the output data stream into a storage medium.
According to still another aspect of the embodiments of the present application, there is further provided a nonvolatile storage medium, where the nonvolatile storage medium includes a stored computer program, and a device where the nonvolatile storage medium is located executes the processing method of data by running the computer program.
In the embodiment of the application, the data in the input data stream is acquired, wherein the data is the data required for executing the task; performing quality detection on data in an input data stream according to a preset detection rule to generate a quality index, wherein the quality index is used for representing the qualification degree of the data; the method comprises the steps of repairing data with quality indexes lower than a preset threshold value to obtain an output data stream, storing the data in the output data stream into a storage medium, and achieving the purpose of embedding data quality monitoring into a real-time calculation process to provide real-time control or repair of quality difference data, so that the technical effect of optimizing data with quality problems before the real-time calculation quality monitoring is achieved, and further the technical problem that real-time tasks of data quality problems are interrupted due to the fact that the data are monitored after the data are written into the storage medium in a related technology is solved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:
fig. 1 is a hardware block diagram of a computer terminal for implementing a data processing method according to an embodiment of the present application;
FIG. 2 is a flow chart of a method of processing data according to an embodiment of the present application;
FIG. 3 is a flow chart of real-time quality detection of data according to an embodiment of the present application;
FIG. 4 is a schematic diagram of a quality rule topology in a configuration operation according to an embodiment of the present application;
FIG. 5 is a topology diagram of a quality detection rule combination implementation in accordance with an embodiment of the present application;
FIG. 6 is a functional architecture diagram of a data processing according to an embodiment of the present application;
fig. 7 is a block diagram of a data processing apparatus according to an embodiment of the present application.
Detailed Description
In order to make the present application solution better understood by those skilled in the art, the following description will be made in detail and with reference to the accompanying drawings in the embodiments of the present application, it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, shall fall within the scope of the present application.
It should be noted that the terms "first," "second," and the like in the description and claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
For a better understanding of the embodiments of the present application, the following technical terms related to the embodiments of the present application are explained as follows:
flink: the open source real-time stream computing engine is used for carrying out stateful computation on borderless and bordered data streams and is the main stream real-time computing engine at present.
Borderless data flow: meaning that there is no restriction on the size or amount of data during the data transfer process, the data can flow continuously and there is no explicit endpoint. For example, real-time streaming of audio or video is a borderless data stream.
With boundary data streams: it means that the size or amount of data is limited during data transmission and that the data stream has a definite start and end point. For example, the data stream in the file transmission process is a bordered data stream, because the size of the file is limited, and the data transmission ends after the file transmission is completed.
flinkSQL: and the SQL calculation engine based on the flink is convenient for a user to directly operate the flink by using the SQL language.
UDF (User Defined Function): the user-defined function provided by the flink is used for realizing the nonstandard function of the row level in the flinkSQL. The embodiment of the application particularly refers to a java method for realizing a flinkSQL scalar function interface (scaler function).
UTDF (User Defined Table Function): the user-defined table function provided by the flink is used for realizing the nonstandard function of table level join operation in the flinkSQL. The embodiment of the application particularly refers to a java method for realizing a flinkSQL table function interface (Tablef unit).
State: and in the link state, temporarily storing data objects of calculation results in parallel tasks in the link running process.
Metrics: and (5) an index collection interface in the process of the link index and the link operation.
In the related technology, most of real-time calculation quality monitoring is a post-monitoring strategy, and the monitoring is performed after the data is written into sink, namely, the data quality is checked after the data calculation is completed, and the data quality problem cannot be found in time in the data calculation process; in the process of data circulation, no good configuration rule and control strategy are available for operating and processing the data with quality problems. In order to solve the above-mentioned problems, the embodiments of the present application provide a data processing method, which may be executed in a computer terminal shown in fig. 1, and the computer terminal will be described in detail below.
The data processing method provided by the embodiment of the application can be executed in a mobile terminal, a computer terminal or a similar computing device. Fig. 1 shows a hardware block diagram of a computer terminal for implementing a data processing method. As shown in fig. 1, the computer terminal 10 may include one or more processors (shown as 102a, 102b, … …,102n in the figures) which may include, but are not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA, a memory 104 for storing data, and a transmission module 106 for communication functions. In addition, the method may further include: a display, an input/output interface (I/O interface), a Universal Serial Bus (USB) port (which may be included as one of the ports of the I/O interface), a network interface, a power supply, and/or a camera. It will be appreciated by those of ordinary skill in the art that the configuration shown in fig. 1 is merely illustrative and is not intended to limit the configuration of the electronic device described above. For example, the computer terminal 10 may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.
It should be noted that the one or more processors and/or other data processing circuits described above may be referred to herein generally as "data processing circuits. The data processing circuit may be embodied in whole or in part in software, hardware, firmware, or any other combination. Furthermore, the data processing circuitry may be a single stand-alone processing module or incorporated, in whole or in part, into any of the other elements in the computer terminal 10. As referred to in the embodiments of the present application, the data processing circuit acts as a processor control (e.g., selection of the path of the variable resistor termination to interface).
The memory 104 may be used to store software programs and modules of application software, such as program instructions/data storage devices corresponding to the data processing methods in the embodiments of the present application, and the processor executes the software programs and modules stored in the memory 104, thereby executing various functional applications and data processing, that is, implementing the data processing methods described above. Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor, which may be connected to the computer terminal 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The transmission module 106 is used to receive or transmit data via a network. The specific examples of the network described above may include a wireless network provided by a communication provider of the computer terminal 10. In one example, the transmission module 106 includes a network adapter (Network Interface Controller, NIC) that can connect to other network devices through a base station to communicate with the internet. In one example, the transmission module 106 may be a Radio Frequency (RF) module for communicating with the internet wirelessly.
The display may be, for example, a touch screen type Liquid Crystal Display (LCD) that may enable a user to interact with a user interface of the computer terminal 10.
It should be noted here that, in some alternative embodiments, the computer terminal shown in fig. 1 may include hardware elements (including circuits), software elements (including computer code stored on a computer readable medium), or a combination of both hardware and software elements. It should be noted that fig. 1 is only one example of a specific example, and is intended to illustrate the types of components that may be present in the computer terminals described above.
In the above-described operating environment, the embodiments of the present application provide an embodiment of a method for processing data, and it should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer executable instructions, and although a logical order is illustrated in the flowcharts, in some cases, the steps illustrated or described may be performed in an order different from that illustrated herein.
Fig. 2 is a flowchart of a method for processing data according to an embodiment of the present application, as shown in fig. 2, the method includes the following steps:
step S202, data in an input data stream is acquired, wherein the data is required for executing a task.
Step S204, quality detection is carried out on the data in the input data stream according to a preset detection rule, and a quality index is generated, wherein the quality index is used for representing the qualification degree of the data.
In the step S204, the quality index is used to characterize the qualification degree of the data, for example, the data above the preset threshold is qualified data, and the data below the preset threshold is unqualified data in the quality index.
Step S206, repairing the data with the quality index lower than the preset threshold value to obtain an output data stream, and storing the data in the output data stream into a storage medium.
In the step S206, the data stream after the detection rule and the repair process is stored in the storage medium, and then the real-time monitoring of the data quality is performed, so that the real-time data processing scenario is more satisfied, and the quality problem of the data can be found and processed in time.
Through the steps S202 to S206, the purpose of embedding the data quality monitoring into the real-time computing process to provide real-time control or repair of quality difference data is achieved, so that the technical effect of optimizing data with quality problems before the real-time computing quality monitoring is achieved, and the technical problem of real-time task interrupt operation of the data quality problems caused by the fact that the data is monitored after being written into a storage medium in the related technology is solved. The following is a detailed description.
In order to better understand the above data processing method, the embodiment of the present application further provides a flow chart for real-time quality detection of data, as shown in fig. 3, where the data is processed by monitoring rule configuration, quality rule execution and control policy execution. Before real-time quality detection of data, various rules, policies and service designs including monitoring rule design, control policy design, monitoring rule package design and index service design are needed, specifically:
1. and (3) monitoring rule design: processing is carried out on the outer layer and the inner part, wherein the outer layer encapsulation is embodied as follows: the data quality rule uses the UDF/UDTF of the flinkSQL to make outer packaging, so that the data quality rule can directly act on the data stream (the flinkdata stream or the data source can be converted with the table or the view of the flinkSQL); the internal implementation is as follows: logic for implementing rules based on flinkState and Metrics, the data itself remains unchanged during processing.
2. And (3) designing a control strategy: the control strategy also uses the UDF of the flinkSQL to make outer packaging, logic of the control strategy is realized, the control strategy is configured on a quality detection operator in a web page configuration mode, and the control strategy can also be configured on a topological graph independently. The control strategy is mainly divided into a shunt strategy and a repair strategy, wherein the shunt strategy comprises: split, no split, ignore three. Splitting: writing the bad data into the side stream for post-treatment has higher efficiency, but can influence the data sequence; neglecting: ignoring the quality data, and subsequently not repairing any more; no split flow: and processing the quality difference data on the original data stream without influencing the data sequence. Repair strategy: including no processing, early warning, deletion, replacement, interception, replenishment, etc. The repair strategy is obtained by adding control operators on the topological graph, wherein an access point of the control operators is behind a detection operator and is free from subsequent topological point connection, and one rule operator can be connected with a plurality of control operators. In the control operator, it is also necessary to configure control parameters, including: (1) control objects, rows, columns; (2) The control mode is not processed, early-warned, deleted, replaced, intercepted, completed and the like; (3) value: mean, median, 0, null string, etc.
3. Monitoring rule package design: the topology structure of the monitoring logic is configured on the interface, the input data stream is obtained through the Source of SDK or flinkSQL, after the back-end conversion processing, the DAG graph is restored, the monitoring operator is used as the vertex, the control operator is used as the attribute, the quality rule and the control strategy are converted into the UDF/UDTF of flinkSQL, then the UDF/UDTF is supplied to the real-time data user, and then the service code is written.
4. And (3) index service design: the index service is divided into three services of acquisition, calculation and application. Comprising the following steps: the collector comprises: values output from the data stream in the form of Metrics and parsed; an arithmetic unit: calculating according to the acquired value of the atomic real-time index and a preset process to obtain the values of indexes such as a summary index, a combination index and the like; real-time large screen: displaying the quality index on a large screen in real time according to a preset layout, and triggering early warning when the quality index value exceeds a threshold value; quality report: including metrics, trends, problem analysis, quality improvement recommendations.
The following describes the steps S202 to S206 in detail with reference to fig. 3.
Optionally, before acquiring the data in the input data stream, the method further includes: determining a rule operator from a quality rule base in response to configuration operation of a target object, and initializing the rule operator, wherein the rule operator comprises a detection operator and a control operator; and combining and configuring rule operators according to preset detection rules to generate a topological graph, wherein the topological graph is used for representing the detection flow of the preset detection rules, and the topological graph is a directed acyclic graph.
Optionally, one detection operator is connected with another detection operator in the topological graph, and the control operator and the connected detection operator have the same detection object.
In the embodiment of the present application, unlike the post-monitoring method in the related art, before the data quality problem does not actually affect, a monitoring rule packet is formed by configuring the data quality detection rule and configuring the control policy, and acts on a link data stream, as shown in fig. 3. Wherein the data quality detection rule is a rule set for detecting data quality; the control strategy is a preset processing strategy for abnormal data, is configured on a quality detection operator in a web page configuration mode, and can be independently configured on a topological graph.
The real-time data quality detection rule is attached to a table (not specially used for timeliness verification, guaranteed by a flink, and directly obtained by flink delay related Metrics), and the table is as follows:
in the above stage, the content of the configuration of the target object in the web page includes:
1. the input source configuration is that a data stream source is obtained from other parts of the system, and the data stream source can be a subject message of a message middleware supported by the flink, can be a data stream processed by the flink in the system, and can be an output table of flinkSQL;
2. the configuration of the input stream data structure, which configures the structure of a single data entity of the input stream and comprises information such as column name, type, length, description and the like, wherein the input stream data structure can be directly obtained from a metadata management system if the input stream data structure is registered in the metadata management system;
3. configuring a detection topological graph, dragging a detection operator or a prefabricated detection rule packet from a page, and constructing the topological graph according to logic to be detected (generally linear topology, and having a unique primary key to support DAG topology after input stream);
4. verifying the topological graph, wherein the logical relationship among detection operators is verified in the construction process to ensure that the logic of the detection rule package is feasible;
5. rule operator configuration, rules require configuration of parameters for rule validation, including action objects (columns, rows, field sets, etc.), functions (inclusion, exclusion, equality, greater than, regular matching, enumerated value matching, numbered rule matching, etc.), parameters (including value ranges, enumerated values, lengths, regular expressions, etc.).
Fig. 4 is a topology schematic diagram of a quality rule in a configuration operation according to an embodiment of the present application, as shown in fig. 4, in a configuration process of the quality rule, in response to a configuration operation with the above-mentioned target object, specifically:
1. obtaining a detection operator from the configuration quality detection rule, obtaining a control operator from the configuration control strategy, and obtaining a rule operator used in quality rule configuration according to the detection operator and the control operator;
2. initializing a rule operator, configuring the rule operator according to a preset detection rule, and generating a topological graph according to rule configuration; in fig. 4, when the detection operator is null, a control operator is used to perform filling processing, so as to obtain a simple topology map. It should be noted that, the topological graph is a directed acyclic graph, and is used for representing a detection flow of a prediction detection rule, wherein one detection operator is connected with another detection operator, and the control operator and the connected detection operator have the same detection object; in fig. 4, for example, a detection operator for detecting whether data is null and a control operator for performing data population are the same, and the detection objects acted on by these two operators may be represented by IDs or object names, for example.
3. And forming an operator set, namely a quality detection rule packet, according to the generated topological graph.
In the embodiment of the present application, the implementation of the quality detection rule packet in the above procedure is further explained, and the UDF is used for the column and the UDTF is used for the row, regardless of the quality detection rule and the control policy. The data flow process is approximately as follows: 1. data enter UDF/UDTF;2. according to logic to be implemented, using Metrics as an output result and State as an intermediate result cache; 3. and realizing MetricReporterFactory based on Kafka, and directly reporting the index to the Kafka middleware.
The specific quality detection rule combination is realized as follows: performing DAG graph reduction at the back end through the DAG graph configured by the interface; each quality detection rule acts as a point on the DAG graph, and the control strategy is an attribute; on DAG graphs, the quality detection operator and the control (policy) operator can be used as UDFs acting directly on columns or UDTFs acting on rows, since the structure of the data itself remains unchanged during the transfer. The explanation is made below with reference to fig. 5.
Fig. 5 is a schematic diagram of a topology implemented by a combination of quality detection rules according to an embodiment of the present application, when a percentage of values designated as f1 column is checked and when the average value is outside (95, 99), incoming data is deleted and whether the standard deviation and the variation coefficient are within a preset range is calculated, then the topology is as shown in fig. 5. By setting a plurality of detection operators, each detection operator corresponds to a detection rule, for example, a first detection operator is used for determining the percentage of null contained in data, a second detection operator is used for calculating an average value, a third detection operator is used for calculating a standard deviation, a fourth detection operator is used for calculating a variation coefficient, each detection operator comprises a corresponding detection rule, and the method comprises the following steps: the corresponding detection objects (rows/columns), the detection object names, the detection operators, the parameters and the like. When the detection result of the detection operator does not meet the requirement, carrying out data restoration through control operators acting on the corresponding detection operator, wherein each control operator comprises a corresponding control strategy, and specifically comprises the following steps: control objects (rows/columns), control object names, control operators and parameters, etc.
After generating the topology map, the SQL statement needs to be generated from the topology map, specifically: the topology graph is restored to a DAG structure, a DAG graph (in node types and relations, the letter M is a detector abbreviation, the letter C is a controller) is constructed in a triplet mode, the front end adds nodes into a node list when generating the nodes according to user operation, and adds relations into the relation list when generating connecting lines. Wherein the detector can only be connected with another detector, the controller can only be connected with a detector with the same monitoring object, and the sending request pseudo code segment to the back end is as follows:
/>
the back-end program detects whether a loop exists according to the node list and the relation list returned by the front-end, if the loop exists, the back-end program feeds back an error, and if the loop does not exist, the back-end program assembles SQL according to the keywords. The node 'Function' attribute global naming is unique, is a well-built UDF or UDTF in advance for the system, and parameters are sequentially transmitted through the 'Pramaters' attribute, a graph is constructed through a relation list, the graph is regrouped according to the type of the detection object and the detection object, and SQL is assembled according to the grouped relation sequence. The column type detection object calls the built-in UDF of the system, the row type detection object calls the built-in UDTF of the system, and then the generated SQL pseudo code fragment is as follows:
SELECT*,
DQM_COEFFICIENT_OF_VARIATIONT
(DQM_STDDEV_POP
(DQC_DEL(
DQM_AVG(
DQC_FILLNA(
DQM_NULL_PRECENT(f1),0),95,99))))
FROM source_table
it should be noted that this example is only for one column, is a linear topology, and when there are multiple detection columns or row objects, the manner of generating the SQL is similar, and only the implementation logic of the rear end based on node type keywords to assemble the SQL is different.
Optionally, performing quality detection on data in the input data stream according to a preset detection rule includes: acquiring an identifier which is carried in the data and used for indicating whether the data is in an independent mode; when the identification representing data is in an independent mode, copying the data in the input data stream, and carrying out quality detection on the copied data according to a preset detection rule; and when the identification characterization data are in the non-independent mode, performing quality detection on the data in the input data stream according to a preset detection rule.
In this embodiment of the present application, as shown in fig. 3, data in an input data stream is data required for executing a task, the data stream enters a rule packet formed by configuring a data quality detection rule and configuring a control policy, and the data in the input data stream flows through each operator according to a generated topological graph; judging whether the data is in an independent mode, and copying the data in the input data stream under the condition that the data in the input stream is in the independent mode; the copied data is directly used as data in an output stream or is subjected to quality detection according to a preset detection rule to generate a quality index, wherein the preset detection rule is used for monitoring the data in an input stream so as to acquire a data entity for checking according to a checking rule; and under the condition that the data in the input stream is in a non-independent mode, detecting the quality of the data in the input data stream according to a preset detection rule, and generating a quality index. The quality index is used for representing the qualification degree of the data, the data higher than the preset threshold value in the quality index is qualified data, and the data lower than the preset threshold value is unqualified data.
Optionally, after generating the quality index, the method further includes: writing the quality index into the message middleware, and carrying out shunting processing on data in the input data stream through a shunting strategy under the condition that the quality index is lower than a preset threshold value; and obtaining the detected data stream according to the data after the splitting process is executed.
In this embodiment of the present application, as shown in fig. 3, when there is data (qualified data) higher than a preset threshold in the quality index, no processing is required, and a data entity, that is, a detected data stream, is obtained from the data stream; and executing a distribution process strategy on the data (unqualified data) with the quality index lower than a preset threshold value to obtain a data stream, and acquiring a data entity, namely the detected data stream. Wherein the shunt strategy comprises: splitting: writing the bad data into the side stream for post-treatment has higher efficiency, but can influence the data sequence; neglecting: ignoring the quality data, and subsequently not repairing any more; no split flow: and processing the quality difference data on the original data stream without influencing the data sequence.
Optionally, after obtaining the detected data stream, the method further includes: under the condition that the detected data stream contains an abnormal stream, executing repair operation on the abnormal data in the abnormal stream according to a repair strategy to obtain repaired data; and under the condition that the detected data stream does not contain the abnormal stream, not executing the repair operation, and determining the data which does not execute the repair operation in the detected data stream as unrepaired data.
Optionally, the method further comprises: and merging the repaired data and the unrepaired data to obtain an output data stream.
In this embodiment of the present application, as shown in fig. 3, the detected data of the data stream may contain abnormal data, so a repair policy is further configured for the detected data stream, where the repair policy is used to repair the abnormal data. Under the condition that abnormal data exists in the data stream, repairing the abnormal data according to a repairing strategy to obtain repaired data; in the event that there is no anomalous data in the data stream, not executing a repair policy and determining the data stream as unrepaired data; and further, carrying out data merging on the repaired data and the unrepaired data to obtain final output stream data.
The repairing strategy comprises the operations of no processing, early warning, deleting, replacing, intercepting, completing and the like.
Fig. 6 is a functional architecture diagram of a data process involving a configuration service, a management service, a flank service, and an index service, as shown in fig. 6, according to an embodiment of the present application. Specifically: the configuration service is used for monitoring the configuration of the rule package, and the topology configuration is formed through the quality configuration rule and the control strategy configuration; the management service comprises the steps of generating a monitoring rule package for subsequent business logic writing, code compiling and submitting operation, wherein the generation of the monitoring rule package comprises the step of further generating a topological structure through generating quality rules and control strategies; the Flink service comprises rule execution, policy execution and calculation execution; the index service comprises index collection, index calculation and index application, wherein the index application comprises a real-time large screen and a quality report, the real-time large screen displays quality indexes on the large screen in real time according to a preset layout, and when the quality index value exceeds a threshold value, early warning is triggered, and the quality report comprises indexes, trends, problem analysis and quality improvement suggestions.
In the embodiment of the application, the data in the input data stream is acquired, wherein the data is the data required for executing the task; performing quality detection on data in an input data stream according to a preset detection rule to generate a quality index, wherein the quality index is used for representing the qualification degree of the data; the method comprises the steps of repairing data with quality indexes lower than a preset threshold value to obtain an output data stream, storing the data in the output data stream into a storage medium, and achieving the purpose of embedding data quality monitoring into a real-time calculation process to provide real-time control or repair of quality difference data, so that the technical effect of optimizing data with quality problems before the real-time calculation quality monitoring is achieved, and further the technical problem that real-time tasks of data quality problems are interrupted due to the fact that the data are monitored after the data are written into the storage medium in a related technology is solved.
According to the embodiment of the application, a data processing device is provided, and it should be noted that the data processing device of the embodiment of the application may be used to execute the data processing method provided by the embodiment of the application. The following describes a data processing apparatus provided in an embodiment of the present application.
Fig. 7 is a block diagram of a data processing apparatus according to an embodiment of the present application. As shown in fig. 7, the apparatus includes:
an obtaining module 70, configured to obtain data in an input data stream, where the data is data required for executing a task;
the detection module 72 is configured to perform quality detection on data in the input data stream according to a preset detection rule, and generate a quality index, where the quality index is used to characterize the qualification degree of the data;
and the repair module 74 is configured to repair data in the quality index that is lower than a preset threshold value, obtain an output data stream, and store the data in the output data stream in the storage medium.
The acquisition module 70, the detection module 72 and the repair module 74 in the data processing device achieve the purpose of embedding the data quality monitoring into the real-time calculation process to provide real-time control or repair of quality difference data, thereby realizing the technical effect of optimizing the data with quality problems before the real-time calculation quality monitoring, and further solving the technical problem of the interruption operation of real-time tasks of the data quality problems caused by the monitoring after the data is written into a storage medium in the related art.
In the data processing device provided in the embodiment of the present application, the detection module 72 is further configured to obtain an identifier that is carried in the data and is used to indicate whether the data is in an independent mode; when the identification representing data is in an independent mode, copying the data in the input data stream, and carrying out quality detection on the copied data according to a preset detection rule; and when the identification characterization data are in the non-independent mode, performing quality detection on the data in the input data stream according to a preset detection rule.
In the data processing apparatus provided in the embodiment of the present application, the repair module 74 is further configured to perform a repair operation on the abnormal data in the abnormal flow according to the repair policy, to obtain repaired data when the detected data flow includes the abnormal flow; and under the condition that the detected data stream does not contain the abnormal stream, not executing the repair operation, and determining the data which does not execute the repair operation in the detected data stream as unrepaired data.
In the data processing apparatus provided in this embodiment of the present application, the repair module 74 is further configured to perform a merging operation on repaired data and unrepaired data to obtain an output data stream.
The embodiment of the application also provides electronic equipment, which comprises: a memory for storing program instructions; a processor coupled to the memory for executing program instructions that perform the following functions: acquiring data in an input data stream, wherein the data is required by executing a task; performing quality detection on data in an input data stream according to a preset detection rule to generate a quality index, wherein the quality index is used for representing the qualification degree of the data; repairing the data with the quality index lower than the preset threshold value to obtain an output data stream, and storing the data in the output data stream into a storage medium.
It should be noted that, the electronic device is configured to execute the data processing method shown in fig. 2, so the explanation of the data processing method is also applicable to the electronic device, and will not be repeated here.
The embodiment of the application also provides a nonvolatile storage medium which comprises a stored computer program, wherein the nonvolatile storage medium comprises the stored computer program, and the equipment where the nonvolatile storage medium is located executes the data processing method by running the computer program.
It should be noted that, since the above-mentioned nonvolatile storage medium is used for executing the data processing method shown in fig. 2, the explanation of the data processing method is also applicable to the above-mentioned nonvolatile storage medium, and will not be repeated here.
The foregoing embodiment numbers of the present application are merely for describing, and do not represent advantages or disadvantages of the embodiments.
In the foregoing embodiments of the present application, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.
In the several embodiments provided in the present application, it should be understood that the disclosed technology content may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and the division of the units, for example, may be a logic function division, and may be implemented in another manner, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The foregoing is merely a preferred embodiment of the present application and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present application and are intended to be comprehended within the scope of the present application.

Claims (10)

1. A method of processing data, comprising:
acquiring data in an input data stream, wherein the data is required by executing a task;
performing quality detection on the data in the input data stream according to a preset detection rule to generate a quality index, wherein the quality index is used for representing the qualification degree of the data;
repairing the data lower than a preset threshold in the quality index to obtain an output data stream, and storing the data in the output data stream into a storage medium.
2. The method of claim 1, wherein prior to acquiring the data in the input data stream, the method further comprises:
determining a rule operator from a quality rule base in response to configuration operation of a target object, and initializing the rule operator, wherein the rule operator comprises a detection operator and a control operator;
and combining and configuring the rule operators according to the preset detection rules to generate a topological graph, wherein the topological graph is used for representing the detection flow of the preset detection rules, and the topological graph is a directed acyclic graph.
3. The method of claim 2, wherein one detection operator is connected to another detection operator in the topology map, and the control operator has the same detection object as the connected detection operator.
4. The method of claim 1, wherein quality detecting the data in the input data stream according to a preset detection rule comprises:
acquiring an identifier which is carried in the data and used for indicating whether the data is in an independent mode;
copying data in the input data stream when the identifier characterizes the data to be in an independent mode, and carrying out quality detection on the copied data according to the preset detection rule;
and when the identifier characterizes the data to be in a non-independent mode, performing quality detection on the data in the input data stream according to the preset detection rule.
5. The method of claim 1, wherein after generating the quality indicator, the method further comprises:
writing the quality index into a message middleware, and carrying out splitting treatment on the data in the input data stream through a splitting strategy under the condition that the quality index is lower than the preset threshold value;
and obtaining the detected data stream according to the data after the splitting process is executed.
6. The method of claim 5, wherein after obtaining the detected data stream, the method further comprises:
under the condition that the detected data stream contains an abnormal stream, executing repair operation on the abnormal data in the abnormal stream according to a repair strategy to obtain repaired data;
and under the condition that the detected data stream does not contain the abnormal stream, not executing the repair operation, and determining the data which does not execute the repair operation in the detected data stream as unrepaired data.
7. The method of claim 6, wherein the method further comprises:
and merging the repaired data and the unrepaired data to obtain the output data stream.
8. A data processing apparatus, comprising:
the acquisition module is used for acquiring data in an input data stream, wherein the data are required by executing a task;
the detection module is used for carrying out quality detection on the data in the input data stream according to a preset detection rule to generate a quality index, wherein the quality index is used for representing the qualification degree of the data;
and the repair module is used for performing repair processing on the data lower than the preset threshold value in the quality index to obtain an output data stream, and storing the data in the output data stream into a storage medium.
9. An electronic device, comprising:
a memory for storing program instructions;
a processor, coupled to the memory, for executing program instructions that perform the following functions: acquiring data in an input data stream, wherein the data is required by executing a task; performing quality detection on the data in the input data stream according to a preset detection rule to generate a quality index, wherein the quality index is used for representing the qualification degree of the data; repairing the data lower than a preset threshold in the quality index to obtain an output data stream, and storing the data in the output data stream into a storage medium.
10. A non-volatile storage medium, characterized in that the non-volatile storage medium comprises a stored computer program, wherein the device in which the non-volatile storage medium is located performs the method of processing data according to any one of claims 1 to 7 by running the computer program.
CN202311640774.6A 2023-12-01 2023-12-01 Data processing method and device and electronic equipment Pending CN117667908A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311640774.6A CN117667908A (en) 2023-12-01 2023-12-01 Data processing method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311640774.6A CN117667908A (en) 2023-12-01 2023-12-01 Data processing method and device and electronic equipment

Publications (1)

Publication Number Publication Date
CN117667908A true CN117667908A (en) 2024-03-08

Family

ID=90085799

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311640774.6A Pending CN117667908A (en) 2023-12-01 2023-12-01 Data processing method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN117667908A (en)

Similar Documents

Publication Publication Date Title
US10459780B2 (en) Automatic application repair by network device agent
CN103176974B (en) The method and apparatus of access path in optimization data storehouse
CN107924360B (en) Diagnostic framework in a computing system
US9654580B2 (en) Proxy-based web application monitoring through script instrumentation
US11310140B2 (en) Mitigating failure in request handling
CN112737800B (en) Service node fault positioning method, call chain generating method and server
CN110908910B (en) Block chain-based test monitoring method and device and readable storage medium
CN112711496A (en) Log information full link tracking method and device, computer equipment and storage medium
CN106126419A (en) The adjustment method of a kind of application program and device
CN107885634B (en) Method and device for processing abnormal information in monitoring
US20240039782A1 (en) Computer network troubleshooting and diagnostics using metadata
US11860858B1 (en) Decoding distributed ledger transaction records
CN113239114A (en) Data storage method, data storage device, storage medium and electronic device
WO2012088761A1 (en) Data analysis-based security information exchange monitoring system and method
US11914495B1 (en) Evaluating machine and process performance in distributed system
CN116578911A (en) Data processing method, device, electronic equipment and computer storage medium
CN117667908A (en) Data processing method and device and electronic equipment
CN111737351A (en) Transaction management method and device for distributed management system
US10621033B1 (en) System for generating dataflow lineage information in a data network
US11658889B1 (en) Computer network architecture mapping using metadata
US20160041892A1 (en) System for discovering bugs using interval algebra query language
CN114281549A (en) Data processing method and device
WO2017143986A1 (en) Method and device for determining resource indicator
CN112491601B (en) Traffic topology generation method and device, storage medium and electronic equipment
CN110825609A (en) Service testing method, device and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination