CN112148762A - Statistical method and device for real-time data stream - Google Patents

Statistical method and device for real-time data stream Download PDF

Info

Publication number
CN112148762A
CN112148762A CN201910573816.6A CN201910573816A CN112148762A CN 112148762 A CN112148762 A CN 112148762A CN 201910573816 A CN201910573816 A CN 201910573816A CN 112148762 A CN112148762 A CN 112148762A
Authority
CN
China
Prior art keywords
path
data
file
data stream
time
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910573816.6A
Other languages
Chinese (zh)
Inventor
韩路
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xi'an Jingxundi Supply Chain Technology Co ltd
Original Assignee
Xi'an Jingxundi Supply Chain Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xi'an Jingxundi Supply Chain Technology Co ltd filed Critical Xi'an Jingxundi Supply Chain Technology Co ltd
Priority to CN201910573816.6A priority Critical patent/CN112148762A/en
Publication of CN112148762A publication Critical patent/CN112148762A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24568Data stream processing; Continuous queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries

Abstract

The invention discloses a statistical method and a statistical device for real-time data streams, and relates to the technical field of computers. One embodiment of the method comprises: registering a virtual table of the data stream according to the read data stream; loading and operating a preset SQL statement file on the virtual table of the data stream; and storing the real-time statistical result obtained after operation. The implementation mode overcomes the technical problem that the processing process is complex when the data stream computing framework processes a plurality of data streams based on various API operators, and further achieves the technical effect that the complex conversion computing function can be replaced by writing SQL sentences to realize the statistics of the real-time data streams in a service scene.

Description

Statistical method and device for real-time data stream
Technical Field
The invention relates to the technical field of computers, in particular to a statistical method and a statistical device for real-time data streams.
Background
The current big data real-time processing technology is rapidly developed, and the real-time processing engine is more and more commonly applied to the production environment with abundant user data, so that the method not only adapts to the operation trend of refined merchants, but also helps the merchants to improve the decision accuracy. In the service processing, a plurality of data streams often need to be accessed into a stream data calculation framework, and according to different service logic scenarios, the data streams are subjected to filtering, field conversion, aggregation statistics of different dimensions, statistical result output and the like. In order to ensure that data is not lost and repeated in the real-time statistical process, the checkpoint mechanism of the streaming data calculation framework can be recovered from the previous checkpoint after the failure and the restart, so that the data is not lost.
In the process of implementing the invention, the inventor finds that at least the following problems exist in the prior art:
1. in actual service, various service logics need to be processed simultaneously on a plurality of data streams, but the current streaming data calculation framework provides step processing based on various API operators, and the processing process is complex.
2. The checkpoint mechanism ensures that data is not lost when a failure occurs, but does not ensure non-duplication when data is accessed.
Disclosure of Invention
In view of this, embodiments of the present invention provide a method and an apparatus for real-time data stream statistics, which can solve the problem that when a current stream data calculation framework processes a plurality of data streams in business logic based on various API operators, a processing process is complex.
To achieve the above object, according to an aspect of an embodiment of the present invention, there is provided a statistical method for a real-time data stream, including: registering a virtual table of the data stream according to the read data stream; loading and operating a preset SQL statement file on the virtual table of the data stream; and storing the real-time statistical result obtained after operation.
Optionally, before registering the virtual table of the data stream according to the read data stream, the method further includes: taking a checkpoint file storage path with the latest generation time and carrying a set identifier as a first path, and taking a result data file storage path with the latest generation time and carrying the set identifier as a second path; determining a latest checkpoint file according to the generation time of the first path and the generation time of the second path; reading a data stream from the latest checkpoint file.
Optionally, determining a latest checkpoint file according to the generation time of the first path and the generation time of the second path includes: if the generation time of the second path is later than that of the first path, deleting a result data file storage path of which the generation time is later than that of the first path, and taking the checkpoint file of the first path as a latest checkpoint file; if the generation time of the second path is earlier than that of the first path, deleting the checkpoint file storage path of which the generation time is later than that of the second path, and taking the checkpoint file corresponding to the checkpoint file storage path of which the rest generation time is latest as the latest checkpoint file; and if the generation time of the second path is equal to the generation time of the first path, taking the checkpoint file of the first path as a latest checkpoint file.
Optionally, registering a virtual table of the data stream according to the read data stream includes: presetting an SQL statement file in a configuration file according to business logic; and determining the number and the name of the virtual table of the data stream according to the read category and the number of the data stream, and registering the virtual table of the data stream.
Optionally, storing the real-time statistical result obtained after the operation includes: the real-time statistical result obtained after loading and operating the preset SQL statement file comprises result data and offset data; storing the result data, generating a result data file, judging whether the storage is successful, if so, setting a set identifier in a storage path of the result data file, and if not, ending the flow abnormally; and storing the offset data, generating a check point file, judging whether the storage is successful, if so, setting the set identifier in a storage path of the check point file, and if not, ending the flow abnormally.
Optionally, after storing the real-time statistical result obtained after the operation, the method further includes: modifying the table name and/or date conversion function in the preset SQL statement file according to the off-line environment data; generating an off-line index result based on the off-line environment data by using the modified SQL statement file; and performing data comparison verification on the real-time statistical result and the off-line index result to determine the accuracy of the real-time statistical result.
According to another aspect of the embodiments of the present invention, there is provided a device for statistics of real-time data streams, including: a registered virtual table module to: registering a virtual table of the data stream according to the read data stream; an execute statement module to: loading and operating a preset SQL statement file on the virtual table of the data stream; a data storage module to: and storing the real-time statistical result obtained after operation.
Optionally, the apparatus further comprises a reading module configured to: taking a checkpoint file storage path with the latest generation time and carrying a set identifier as a first path, and taking a result data file storage path with the latest generation time and carrying the set identifier as a second path; determining a latest checkpoint file according to the generation time of the first path and the generation time of the second path; reading a data stream from the latest checkpoint file.
Optionally, the reading module is further configured to: if the generation time of the second path is later than that of the first path, deleting a result data file storage path of which the generation time is later than that of the first path, and taking the checkpoint file of the first path as a latest checkpoint file; if the generation time of the second path is earlier than that of the first path, deleting the checkpoint file storage path of which the generation time is later than that of the second path, and taking the checkpoint file corresponding to the checkpoint file storage path of which the rest generation time is latest as the latest checkpoint file; and if the generation time of the second path is equal to the generation time of the first path, taking the checkpoint file of the first path as a latest checkpoint file.
Optionally, the register virtual table module is further configured to: presetting an SQL statement file in a configuration file according to business logic; and determining the number and the name of the virtual table of the data stream according to the read category and the number of the data stream, and registering the virtual table of the data stream.
Optionally, the data storage module is further configured to: the real-time statistical result obtained after loading and operating the preset SQL statement file comprises result data and offset data; storing the result data, generating a result data file, judging whether the storage is successful, if so, setting a set identifier in a storage path of the result data file, and if not, ending the flow abnormally; and storing the offset data, generating a check point file, judging whether the storage is successful, if so, setting the set identifier in a storage path of the check point file, and if not, ending the flow abnormally.
Optionally, the apparatus further comprises a data verification module configured to: modifying the table name and/or date conversion function in the preset SQL statement file according to the off-line environment data; generating an off-line index result based on the off-line environment data by using the modified SQL statement file; and performing data comparison verification on the real-time statistical result and the off-line index result to determine the accuracy of the real-time statistical result.
According to still another aspect of an embodiment of the present invention, there is provided an electronic apparatus including: one or more processors; a storage device for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the statistical method for real-time data streams as set forth in the foregoing embodiments.
According to a further aspect of the embodiments of the present invention, there is provided a computer readable medium, on which a computer program is stored, which when executed by a processor, implements the statistical method of real-time data streams as set forth in the foregoing embodiments.
One embodiment of the above invention has the following advantages or benefits: because the technical means of directly loading and operating the SQL statement file meeting the business requirements on the virtual table of the registered data stream is adopted, the technical problem that the processing process is complex when a data stream computing framework processes a plurality of data streams based on various API operators is solved, and the technical effect of realizing the statistics of the real-time data stream in the business scene by writing the SQL statement can be achieved by replacing a complex conversion computing function.
Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.
Drawings
The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:
FIG. 1 is a schematic diagram of a basic flow of a statistical method of real-time data flow according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a flow of executing an SQL statement according to an embodiment of the invention;
FIG. 3 is a schematic diagram of a preferred flow of a statistical method of real-time data flow according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of the basic modules of a statistical apparatus of real-time data streams according to an embodiment of the present invention;
FIG. 5 is an exemplary system architecture diagram in which embodiments of the present invention may be employed;
fig. 6 is a schematic block diagram of a computer system suitable for use in implementing a terminal device or server of an embodiment of the invention.
Detailed Description
Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
Spark Streaming is a Streaming data computing framework, and is widely applied to the field of real-time processing in combination with the open source Streaming processing platform Kafka. In an actual scenario, a process that 2 or 3 Kafka data streams need to be accessed into a Spark Streaming framework, and according to different service scenarios, the data streams are filtered, field-converted, aggregated and counted in different dimensions, and statistical results are output is often encountered. Real-time statistics based on Spark monitoring can ensure that data cannot be lost through a checkpoint mechanism provided by the real-time statistics based on Spark monitoring, data can be recovered from a previous checkpoint after a fault occurs and is restarted, but the mechanism can only ensure that data cannot be lost and cannot ensure that data cannot be repeatedly processed, and the requirement of exact processing during expected real-time data processing is not met. In the prior art, processing operations such as data conversion and the like are performed on an accessed data stream by using various operators (such as foreachRDD, transform, reduce bykey and the like) of a Spark Streaming API to implement relevant service logic.
Aiming at the problem that a checkpoint mechanism cannot ensure that data is not repeated when being accessed, in the prior art, a re-partitioning operator (replication) of an open-source spark API is used for processing an output behavior operator (action), so that a plurality of partition data are changed into one partition data (replication), and then the partition data can be operated by using the transaction function of a database, and the data processing process is ensured not to have the data repetition problem. Alternatively, the output result data may include offset data (also referred to as checkpoint offset data), so that the commit result and commit offset are completed in one operation, and no data is lost or processed repeatedly. The offset data in the last commit result can be used when the failure is recovered.
However, the code frame of the existing solution is not highly versatile, different business processing logic-related spark operator codes need to be written for different business processing, and the logic implementation is more complex; real-time data and offline data are not easy to troubleshoot problems when verifying DIFF (data DIFF). The action only has one partition by setting the repartition parameters, thereby influencing the concurrency of program processing, prolonging the processing time delay in the data batch time, and particularly influencing the program performance under the condition of large data volume; writing checkpoint offset data and result data together can result in tight coupling of the result data and checkpoint offset data, which can make subsequent use of the result data alone inconvenient.
Fig. 1 is a schematic diagram of a basic flow of a statistical method of a real-time data stream according to an embodiment of the present invention. As shown in fig. 1, an embodiment of the present invention provides a statistical method for a real-time data stream, including:
s101, registering a virtual table of a data stream according to the read data stream;
step S102, loading and operating a preset SQL (Structured Query Language) statement file on a virtual table of the data stream;
and S103, storing the real-time statistical result obtained after operation.
Specifically, an SQL statement file is written in an external configuration file of the Spark streaming handler according to business logic, after a plurality of data streams are read by the Spark streaming, virtual tables of the data streams are respectively registered, the written SQL statement file is directly loaded on the virtual tables, and a calculation result is generated and stored. When a new business logic requirement exists, the statistical function of the corresponding business logic can be quickly realized only by writing a new SQL statement file and not paying attention to whether other Spark operator codes are modified. For example, in the logistics online advertisement statistics service, it is necessary to count indexes such as the exposure amount of the logistics online advertisement and the click rate of a user, and write a service SQL statement by accessing a Kafka question of exposure and click of the logistics advertisement (Kafka data stream), registering an exposure virtual table (impression _ table) and a click virtual table (click _ table) of the advertisement: and (4) selecting the exposure and the click rate from the exposure table UNION click table where (illegal clicks and exposures are filtered), executing the SQL statement, processing the exposure and click data in real time, and generating report data of the real-time exposure and click rate of the user.
The embodiment of the invention adopts the technical means of directly loading and operating the SQL statement file meeting the business requirements on the virtual table of the registered data stream, thereby overcoming the technical problem of complex processing process when a data stream computing framework processes a plurality of data streams based on various API (Application Programming Interface) operators, and further achieving the technical effect of replacing complex conversion computing functions by writing SQL statements to realize the statistics of real-time data streams in a business scene.
In this embodiment of the present invention, step S101 registers a virtual table of the data stream according to the read data stream, including: presetting an SQL statement file in a configuration file according to business logic; and determining the number and the name of the virtual table of the data stream according to the read category and the number of the data stream, and registering the virtual table of the data stream.
Before step S101 registers the virtual table of the data stream according to the read data stream, the embodiment of the present invention further includes: taking a checkpoint file storage path with the latest generation time and carrying a set identifier (the set identifier can include but is not limited to a SUCCESS identifier) as a first path, and taking a result data file storage path with the latest generation time and carrying the set identifier as a second path; determining a latest checkpoint file according to the generation time of the first path and the generation time of the second path; reading a data stream from the latest checkpoint file.
The embodiment of the invention adopts the technical means of determining the latest checkpoint file according to the generation time of the first path and the generation time of the second path and reading the data stream from the latest checkpoint file, thereby realizing the processing of data at the right time, and the data can not be lost or repeated when the failure and the restart occur.
Based on the foregoing embodiment, in the embodiment of the present invention, determining the latest checkpoint file according to the generation time of the first path and the generation time of the second path may include: if the generation time of the second path is later than that of the first path, deleting a result data file storage path of which the generation time is later than that of the first path, and taking the checkpoint file of the first path as a latest checkpoint file; if the generation time of the second path is earlier than that of the first path, deleting the checkpoint file storage path of which the generation time is later than that of the second path, and taking the checkpoint file corresponding to the checkpoint file storage path of which the rest generation time is latest as the latest checkpoint file; and if the generation time of the second path is equal to the generation time of the first path, taking the checkpoint file of the first path as a latest checkpoint file.
The embodiment of the invention determines the latest checkpoint file by comparing the generation time of the first path with the generation time of the second path, so that the data is processed at the right time in the subsequent method, and when the failure and the restart occur, the data can be read and processed from the latest checkpoint file, so that the data cannot be lost or repeated, and the real-time processing of the data stream is realized.
Step S103 in the embodiment of the present invention stores the real-time statistical result obtained after the operation, including: the real-time statistical result obtained after loading and operating the preset SQL statement file comprises result data and offset data; storing the result data, generating a result data file, judging whether the storage is successful, if so, setting a set identifier in a storage path of the result data file, and if not, ending the flow abnormally; and storing the offset data, generating a check point file, judging whether the storage is successful, if so, setting the set identifier in a storage path of the check point file, and if not, ending the flow abnormally.
The embodiment of the invention separately stores the result data and the offset data, independently stores the generated offset data after each SQL statement execution, and directly recovers and reads the subject data of the Kafka from the result data and the offset data stored last time in the next operation, thereby ensuring no data repetition, solving the problem of data repetition or data loss after restart possibly caused by various abnormalities and ensuring the accuracy of the data. Specifically, the checkpoint data and the result data may be separately stored in an HDFS (Hadoop Distributed File system), which not only ensures the stability of the data, but also may use the result data independently.
After the step S103 stores the real-time statistical result obtained after the operation, the embodiment of the present invention further includes: modifying the table name and/or date conversion function in the preset SQL statement file according to the off-line environment data; generating an off-line index result based on the off-line environment data by using the modified SQL statement file; and performing data comparison verification on the real-time statistical result and the off-line index result to determine the accuracy of the real-time statistical result.
The accuracy check and comparison of the data indexes produced by real-time data processing usually verifies whether the real-time result is correct by comparing the offline data index result, and the verification can be performed by using the SQL statements used in the above embodiments. Because the real-time business logic and the off-line business logic are consistent, the SQL sentences generated by the real-time virtual table can directly run in an off-line environment only by modifying the table name and the individual function to generate off-line index result data, and the comparison of the real-time result data and the off-line result data is very convenient and is simpler for checking DIFF with the off-line data.
FIG. 2 is a schematic diagram of a flow of executing an SQL statement according to an embodiment of the invention. As shown in fig. 2, after the Spark streaming reads a plurality of data streams from the latest checkpoint, virtual tables of the data streams are respectively registered, a written SQL statement file is directly loaded on the virtual tables, and the generated result incremental data (i.e., result data) and the checkpoint data are respectively stored in the HDFS. The problem of multi-data source fusion and step processing of business logic can be solved by registering a plurality of data sources into a plurality of virtual tables, and converting the business logic into various filtering conditions of SQL sentences. SQL sentences meeting the business requirements are executed on the registered virtual table, and the complicated operation problems that a plurality of data sources need to be called step by step, a plurality of filtering conditions need to be executed step by step and the like are solved in one step. The user writes SQL sentences which meet business requirements in advance through the configuration file, the number and the name of the registered virtual tables are determined according to the types and the number of input kafka data streams, the new business and the new requirements can be realized only by writing corresponding SQL sentences, and specific codes in a frame do not need to be modified.
Fig. 3 is a schematic diagram of a preferred flow of a statistical method of real-time data flow according to an embodiment of the present invention. As shown in FIG. 3, the program starts and searches for the most recent path P1 identified by SUCCESS in the HDFS checkpoint file. The most recent path P2 identified by SUCCESS in the HDFS data file is looked up. Comparing whether the time of the data path P2 is longer than the time of the checkpoint path P1, if so, indicating that the checkpoint file is not successfully written due to the fact that the program is abnormal after the latest batch of result data files are written, the program is required to delete all data paths of which the data path time is longer than the checkpoint path time of the process 1, and the data duplication phenomenon does not occur after the program is started from the last checkpoint. If the time of the data path P2 is equal to the time of the checkpoint path P1, it indicates that the last batch data path and the checkpoint path both successfully written without special processing. If the time of the data path P2 is less than the time of the checkpoint path P1, which indicates that the data in the previous batch of data paths has been deleted abnormally, the program is required to delete the checkpoint data file whose checkpoint path time is greater than the data path time, so as to ensure that the program will pull a copy of the abnormally lost data file again when starting. After the above various conditions are processed, reading data from the latest check point file, and analyzing and processing the data. And generating result data after the service is processed, storing the result data in the HDFS, judging whether the writing is successful, falling the SUCCESS mark after the writing is successful, and exiting the program if the writing is not successful. And after the result data is successfully written, writing the current check point data into the HDFS, judging whether the writing is successful, falling the SUCCESS identification after the writing is successful, and if the writing is not successful, exiting the program abnormally.
Fig. 4 is a schematic diagram of basic modules of a real-time data flow statistical device according to an embodiment of the present invention. As shown in fig. 4, an embodiment of the present invention provides an apparatus 400 for statistics of real-time data streams, including:
a registered virtual table module 401 for: registering a virtual table of the data stream according to the read data stream;
an execute statement module 402 for: loading and operating a preset SQL statement file on the virtual table of the data stream;
a data storage module 403 for: and storing the real-time statistical result obtained after operation.
The embodiment of the invention adopts the technical means of directly loading and operating the SQL statement file meeting the business requirements on the virtual table of the registered data stream, thereby overcoming the technical problem that the processing process is complex when the data stream computing framework processes a plurality of data streams based on various API operators, and further achieving the technical effect of replacing a complex conversion computing function by writing the SQL statement to realize the statistics of the real-time data stream in the business scene.
In an embodiment of the present invention, the apparatus further includes a reading module, configured to: taking a checkpoint file storage path with the latest generation time and carrying a set identifier as a first path, and taking a result data file storage path with the latest generation time and carrying the set identifier as a second path; determining a latest checkpoint file according to the generation time of the first path and the generation time of the second path; reading a data stream from the latest checkpoint file.
The embodiment of the invention adopts the technical means of determining the latest checkpoint file according to the generation time of the first path and the generation time of the second path and reading the data stream from the latest checkpoint file, thereby realizing the processing of data at the right time, and the data can not be lost or repeated when the failure and the restart occur.
In the embodiment of the present invention, the reading module is further configured to: if the generation time of the second path is later than that of the first path, deleting a result data file storage path of which the generation time is later than that of the first path, and taking the checkpoint file of the first path as a latest checkpoint file; if the generation time of the second path is earlier than that of the first path, deleting the checkpoint file storage path of which the generation time is later than that of the second path, and taking the checkpoint file corresponding to the checkpoint file storage path of which the rest generation time is latest as the latest checkpoint file; and if the generation time of the second path is equal to the generation time of the first path, taking the checkpoint file of the first path as a latest checkpoint file.
The embodiment of the invention determines the latest checkpoint file by comparing the generation time of the first path with the generation time of the second path, so that the data is processed at the right time in the subsequent method, and when the failure and the restart occur, the data can be read and processed from the latest checkpoint file, so that the data cannot be lost or repeated, and the real-time processing of the data stream is realized.
In this embodiment of the present invention, the registered virtual table module 401 is further configured to: presetting an SQL statement file in a configuration file according to business logic; and determining the number and the name of the virtual table of the data stream according to the read category and the number of the data stream, and registering the virtual table of the data stream.
According to the embodiment of the invention, the SQL statement is compiled, so that the complex conversion calculation function can be replaced to process the statistical problem of the real-time data flow in the service scene, and the processing efficiency is greatly improved.
In this embodiment of the present invention, the data storage module 403 is further configured to: the real-time statistical result obtained after loading and operating the preset SQL statement file comprises result data and offset data; storing the result data, generating a result data file, judging whether the storage is successful, if so, setting a set identifier in a storage path of the result data file, and if not, ending the flow abnormally; and storing the offset data, generating a check point file, judging whether the storage is successful, if so, setting the set identifier in a storage path of the check point file, and if not, ending the flow abnormally.
The generated offset data is independently stored after the SQL statement is executed each time, and the subject data of the Kyoff card is recovered and read directly from the result data and the offset data stored last time in the next operation, so that the data can be ensured not to be repeated, the problem of repeated or lost data possibly caused by various abnormalities after restarting is solved, and the accuracy of the data is ensured. Specifically, the checkpoint data and the result data may be separately stored in an HDFS (Hadoop Distributed File system), which not only ensures the stability of the data, but also may use the result data independently.
In an embodiment of the present invention, the apparatus further includes a data verification module, configured to: modifying the table name and/or date conversion function in the preset SQL statement file according to the off-line environment data; generating an off-line index result based on the off-line environment data by using the modified SQL statement file; and performing data comparison verification on the real-time statistical result and the off-line index result to determine the accuracy of the real-time statistical result.
The accuracy check and comparison of the data indexes produced by real-time data processing usually verifies whether the real-time result is correct by comparing the offline data index result, and the verification can be performed by using the SQL statements used in the above embodiments. Because the real-time business logic and the off-line business logic are consistent, the SQL sentences generated by the real-time virtual table can directly run in an off-line environment only by modifying the table name and the individual function to generate off-line index result data, and the comparison of the real-time result data and the off-line result data is very convenient and is simpler for checking DIFF with the off-line data.
Fig. 5 illustrates an exemplary system architecture 500 to which the statistical method or device of the real-time data stream of the embodiments of the present invention may be applied.
As shown in fig. 5, the system architecture 500 may include terminal devices 501, 502, 503, a network 504, and a server 505. The network 504 serves to provide a medium for communication links between the terminal devices 501, 502, 503 and the server 505. Network 504 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
The user may use the terminal devices 501, 502, 503 to interact with a server 505 over a network 504 to receive or send messages or the like. The terminal devices 501, 502, 503 may have various communication client applications installed thereon, such as a shopping application, a web browser application, a search application, an instant messaging tool, a mailbox client, social platform software, and the like.
The terminal devices 501, 502, 503 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.
The server 505 may be a server that provides various services, such as a background management server that supports shopping websites browsed by users using the terminal devices 501, 502, 503. The background management server can analyze and process the received data such as the product information inquiry request and feed back the processing result to the terminal equipment.
It should be noted that the statistical method for the real-time data stream provided by the embodiment of the present invention is generally executed by the server 505, and accordingly, the statistical apparatus for the real-time data stream is generally disposed in the server 505.
It should be understood that the number of terminal devices, networks, and servers in fig. 5 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
The invention also provides an electronic device and a computer readable medium according to the embodiment of the invention.
The electronic device of the embodiment of the invention comprises: one or more processors; a storage device for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the statistical method for real-time data streams as set forth in the foregoing embodiments.
The computer-readable medium of the present invention has stored thereon a computer program, which when executed by a processor implements the statistical method of real-time data streams as proposed in the previous embodiments.
Referring now to FIG. 6, a block diagram of a computer system 600 suitable for use with a terminal device implementing an embodiment of the invention is shown. The terminal device shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.
As shown in fig. 6, the computer system 600 includes a Central Processing Unit (CPU)601 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage section 608 into a Random Access Memory (RAM) 603. In the RAM603, various programs and data necessary for the operation of the system 600 are also stored. The CPU601, ROM602, and RAM603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.
The following components are connected to the I/O interface 605: an input portion 606 including a keyboard, a mouse, and the like; an output portion 607 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 608 including a hard disk and the like; and a communication section 609 including a network interface card such as a LAN card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The driver 610 is also connected to the I/O interface 605 as needed. A removable medium 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 610 as necessary, so that a computer program read out therefrom is mounted in the storage section 608 as necessary.
In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 609, and/or installed from the removable medium 611. The computer program performs the above-described functions defined in the system of the present invention when executed by the Central Processing Unit (CPU) 601.
It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor, comprising: the system comprises a registered virtual table module, an execution statement module and a data storage module. The names of these modules do not in some cases constitute a limitation on the module itself, for example, the register virtual table module may also be described as a "module for registering a virtual table of a read data stream according to the data stream".
As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise: registering a virtual table of the data stream according to the read data stream; loading and operating a preset SQL statement file on the virtual table of the data stream; and storing the real-time statistical result obtained after operation.
According to the statistical method of the real-time data flow, the technical means that SQL statement files meeting business requirements are directly loaded and operated on the virtual table of the registered data flow is adopted, so that the technical problem that a data flow calculation framework is complex in processing process when a plurality of data flows are processed based on various API operators is solved, and the technical effect of realizing the statistics of the real-time data flow in a business scene by writing SQL statements to replace complex conversion calculation functions is achieved.
The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (14)

1. A method for statistical analysis of real-time data streams, comprising:
registering a virtual table of the data stream according to the read data stream;
loading and operating a preset SQL statement file on the virtual table of the data stream;
and storing the real-time statistical result obtained after operation.
2. The method of claim 1, wherein prior to registering the virtual table of the data stream from the read data stream, the method further comprises:
taking a checkpoint file storage path with the latest generation time and carrying a set identifier as a first path, and taking a result data file storage path with the latest generation time and carrying the set identifier as a second path;
determining a latest checkpoint file according to the generation time of the first path and the generation time of the second path;
reading a data stream from the latest checkpoint file.
3. The method of claim 2, wherein determining the latest checkpoint file based on the generation time of the first path and the generation time of the second path comprises:
if the generation time of the second path is later than that of the first path, deleting a result data file storage path of which the generation time is later than that of the first path, and taking the checkpoint file of the first path as a latest checkpoint file;
if the generation time of the second path is earlier than that of the first path, deleting the checkpoint file storage path of which the generation time is later than that of the second path, and taking the checkpoint file corresponding to the checkpoint file storage path of which the rest generation time is latest as the latest checkpoint file;
and if the generation time of the second path is equal to the generation time of the first path, taking the checkpoint file of the first path as a latest checkpoint file.
4. The method of claim 1, wherein registering the virtual table of the data stream according to the read data stream comprises:
presetting an SQL statement file in a configuration file according to business logic;
and determining the number and the name of the virtual table of the data stream according to the read category and the number of the data stream, and registering the virtual table of the data stream.
5. The method of claim 1, wherein storing the real-time statistics obtained after the run comprises:
the real-time statistical result obtained after loading and operating the preset SQL statement file comprises result data and offset data;
storing the result data, generating a result data file, judging whether the storage is successful, if so, setting a set identifier in a storage path of the result data file, and if not, ending the flow abnormally;
and storing the offset data, generating a check point file, judging whether the storage is successful, if so, setting the set identifier in a storage path of the check point file, and if not, ending the flow abnormally.
6. The method of claim 1, wherein after storing the real-time statistics obtained after the running, the method further comprises:
modifying the table name and/or date conversion function in the preset SQL statement file according to the off-line environment data;
generating an off-line index result based on the off-line environment data by using the modified SQL statement file;
and performing data comparison verification on the real-time statistical result and the off-line index result to determine the accuracy of the real-time statistical result.
7. A device for real-time data stream statistics, comprising:
a registered virtual table module to: registering a virtual table of the data stream according to the read data stream;
an execute statement module to: loading and operating a preset SQL statement file on the virtual table of the data stream;
a data storage module to: and storing the real-time statistical result obtained after operation.
8. The apparatus of claim 7, further comprising a reading module to:
taking a checkpoint file storage path with the latest generation time and carrying a set identifier as a first path, and taking a result data file storage path with the latest generation time and carrying the set identifier as a second path;
determining a latest checkpoint file according to the generation time of the first path and the generation time of the second path;
reading a data stream from the latest checkpoint file.
9. The apparatus of claim 8, wherein the reading module is further configured to:
if the generation time of the second path is later than that of the first path, deleting a result data file storage path of which the generation time is later than that of the first path, and taking the checkpoint file of the first path as a latest checkpoint file;
if the generation time of the second path is earlier than that of the first path, deleting the checkpoint file storage path of which the generation time is later than that of the second path, and taking the checkpoint file corresponding to the checkpoint file storage path of which the rest generation time is latest as the latest checkpoint file;
and if the generation time of the second path is equal to the generation time of the first path, taking the checkpoint file of the first path as a latest checkpoint file.
10. The apparatus of claim 7, wherein the register virtual table module is further configured to:
presetting an SQL statement file in a configuration file according to business logic;
and determining the number and the name of the virtual table of the data stream according to the read category and the number of the data stream, and registering the virtual table of the data stream.
11. The apparatus of claim 7, wherein the data storage module is further configured to:
the real-time statistical result obtained after loading and operating the preset SQL statement file comprises result data and offset data;
storing the result data, generating a result data file, judging whether the storage is successful, if so, setting a set identifier in a storage path of the result data file, and if not, ending the flow abnormally;
and storing the offset data, generating a check point file, judging whether the storage is successful, if so, setting the set identifier in a storage path of the check point file, and if not, ending the flow abnormally.
12. The apparatus of claim 7, further comprising a data validation module to:
modifying the table name and/or date conversion function in the preset SQL statement file according to the off-line environment data;
generating an off-line index result based on the off-line environment data by using the modified SQL statement file;
and performing data comparison verification on the real-time statistical result and the off-line index result to determine the accuracy of the real-time statistical result.
13. An electronic device, comprising:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-6.
14. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-6.
CN201910573816.6A 2019-06-28 2019-06-28 Statistical method and device for real-time data stream Pending CN112148762A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910573816.6A CN112148762A (en) 2019-06-28 2019-06-28 Statistical method and device for real-time data stream

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910573816.6A CN112148762A (en) 2019-06-28 2019-06-28 Statistical method and device for real-time data stream

Publications (1)

Publication Number Publication Date
CN112148762A true CN112148762A (en) 2020-12-29

Family

ID=73870113

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910573816.6A Pending CN112148762A (en) 2019-06-28 2019-06-28 Statistical method and device for real-time data stream

Country Status (1)

Country Link
CN (1) CN112148762A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113342542A (en) * 2021-05-12 2021-09-03 北京百度网讯科技有限公司 Service processing method, device, equipment and computer storage medium
CN113407600A (en) * 2021-08-18 2021-09-17 浩鲸云计算科技股份有限公司 Enhanced real-time calculation method for dynamically synchronizing multi-source large table data in real time

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106293892A (en) * 2015-06-26 2017-01-04 阿里巴巴集团控股有限公司 Distributed stream calculates system, method and apparatus
CN107092598A (en) * 2016-02-17 2017-08-25 阿里巴巴集团控股有限公司 The management method and device of data storage location information
CN107368517A (en) * 2017-06-02 2017-11-21 上海恺英网络科技有限公司 A kind of method and apparatus of high amount of traffic inquiry
CN107577717A (en) * 2017-08-09 2018-01-12 阿里巴巴集团控股有限公司 A kind of processing method, device and server for ensureing data consistency
CN108984547A (en) * 2017-05-31 2018-12-11 北京京东尚科信息技术有限公司 The method and apparatus of data processing
CN109189835A (en) * 2018-08-21 2019-01-11 北京京东尚科信息技术有限公司 The method and apparatus of the wide table of data are generated in real time

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106293892A (en) * 2015-06-26 2017-01-04 阿里巴巴集团控股有限公司 Distributed stream calculates system, method and apparatus
CN107092598A (en) * 2016-02-17 2017-08-25 阿里巴巴集团控股有限公司 The management method and device of data storage location information
CN108984547A (en) * 2017-05-31 2018-12-11 北京京东尚科信息技术有限公司 The method and apparatus of data processing
CN107368517A (en) * 2017-06-02 2017-11-21 上海恺英网络科技有限公司 A kind of method and apparatus of high amount of traffic inquiry
CN107577717A (en) * 2017-08-09 2018-01-12 阿里巴巴集团控股有限公司 A kind of processing method, device and server for ensureing data consistency
CN109189835A (en) * 2018-08-21 2019-01-11 北京京东尚科信息技术有限公司 The method and apparatus of the wide table of data are generated in real time

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113342542A (en) * 2021-05-12 2021-09-03 北京百度网讯科技有限公司 Service processing method, device, equipment and computer storage medium
CN113342542B (en) * 2021-05-12 2024-03-22 北京百度网讯科技有限公司 Service processing method, device, equipment and computer storage medium
CN113407600A (en) * 2021-08-18 2021-09-17 浩鲸云计算科技股份有限公司 Enhanced real-time calculation method for dynamically synchronizing multi-source large table data in real time

Similar Documents

Publication Publication Date Title
CN107506451B (en) Abnormal information monitoring method and device for data interaction
US10310969B2 (en) Systems and methods for test prediction in continuous integration environments
CN108139958B (en) System and method for processing events of an event stream
US20180349254A1 (en) Systems and methods for end-to-end testing of applications using dynamically simulated data
CN107818431B (en) Method and system for providing order track data
US10180836B1 (en) Generating source code review comments using code analysis tools
US9122804B2 (en) Logic validation and deployment
CN111339073A (en) Real-time data processing method and device, electronic equipment and readable storage medium
CN111309550A (en) Data acquisition method, system, equipment and storage medium of application program
CN110990420A (en) Data query method and device
US11531539B2 (en) Automated compliance and testing framework for software development
CN115335821B (en) Offloading statistics collection
CN111429241A (en) Accounting processing method and device
JP2017516202A (en) Promotion status data monitoring method, apparatus, device, and non-executable computer storage medium
CN112445866A (en) Data processing method and device, computer readable medium and electronic equipment
CN112148762A (en) Statistical method and device for real-time data stream
CN111831536A (en) Automatic testing method and device
CN113220907A (en) Business knowledge graph construction method and device, medium and electronic equipment
US11243979B1 (en) Asynchronous propagation of database events
CN109597819B (en) Method and apparatus for updating a database
CN111159207A (en) Information processing method and device
CN113760568A (en) Data processing method and device
CN113779017A (en) Method and apparatus for data asset management
CN111695986B (en) Method and device for controlling accumulated gold image
CN113590425A (en) Data processing method, apparatus, device, medium, and program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination