CN112148762A

CN112148762A - Statistical method and device for real-time data stream

Info

Publication number: CN112148762A
Application number: CN201910573816.6A
Authority: CN
Inventors: 韩路
Original assignee: Xi'an Jingxundi Supply Chain Technology Co ltd
Current assignee: Xi'an Jingxundi Supply Chain Technology Co ltd
Priority date: 2019-06-28
Filing date: 2019-06-28
Publication date: 2020-12-29

Abstract

The invention discloses a statistical method and a statistical device for real-time data streams, and relates to the technical field of computers. One embodiment of the method comprises: registering a virtual table of the data stream according to the read data stream; loading and operating a preset SQL statement file on the virtual table of the data stream; and storing the real-time statistical result obtained after operation. The implementation mode overcomes the technical problem that the processing process is complex when the data stream computing framework processes a plurality of data streams based on various API operators, and further achieves the technical effect that the complex conversion computing function can be replaced by writing SQL sentences to realize the statistics of the real-time data streams in a service scene.

Description

Statistical method and device for real-time data stream

Technical Field

The invention relates to the technical field of computers, in particular to a statistical method and a statistical device for real-time data streams.

Background

The current big data real-time processing technology is rapidly developed, and the real-time processing engine is more and more commonly applied to the production environment with abundant user data, so that the method not only adapts to the operation trend of refined merchants, but also helps the merchants to improve the decision accuracy. In the service processing, a plurality of data streams often need to be accessed into a stream data calculation framework, and according to different service logic scenarios, the data streams are subjected to filtering, field conversion, aggregation statistics of different dimensions, statistical result output and the like. In order to ensure that data is not lost and repeated in the real-time statistical process, the checkpoint mechanism of the streaming data calculation framework can be recovered from the previous checkpoint after the failure and the restart, so that the data is not lost.

In the process of implementing the invention, the inventor finds that at least the following problems exist in the prior art:

1. in actual service, various service logics need to be processed simultaneously on a plurality of data streams, but the current streaming data calculation framework provides step processing based on various API operators, and the processing process is complex.

2. The checkpoint mechanism ensures that data is not lost when a failure occurs, but does not ensure non-duplication when data is accessed.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method and an apparatus for real-time data stream statistics, which can solve the problem that when a current stream data calculation framework processes a plurality of data streams in business logic based on various API operators, a processing process is complex.

To achieve the above object, according to an aspect of an embodiment of the present invention, there is provided a statistical method for a real-time data stream, including: registering a virtual table of the data stream according to the read data stream; loading and operating a preset SQL statement file on the virtual table of the data stream; and storing the real-time statistical result obtained after operation.

Optionally, before registering the virtual table of the data stream according to the read data stream, the method further includes: taking a checkpoint file storage path with the latest generation time and carrying a set identifier as a first path, and taking a result data file storage path with the latest generation time and carrying the set identifier as a second path; determining a latest checkpoint file according to the generation time of the first path and the generation time of the second path; reading a data stream from the latest checkpoint file.

Optionally, determining a latest checkpoint file according to the generation time of the first path and the generation time of the second path includes: if the generation time of the second path is later than that of the first path, deleting a result data file storage path of which the generation time is later than that of the first path, and taking the checkpoint file of the first path as a latest checkpoint file; if the generation time of the second path is earlier than that of the first path, deleting the checkpoint file storage path of which the generation time is later than that of the second path, and taking the checkpoint file corresponding to the checkpoint file storage path of which the rest generation time is latest as the latest checkpoint file; and if the generation time of the second path is equal to the generation time of the first path, taking the checkpoint file of the first path as a latest checkpoint file.

Optionally, registering a virtual table of the data stream according to the read data stream includes: presetting an SQL statement file in a configuration file according to business logic; and determining the number and the name of the virtual table of the data stream according to the read category and the number of the data stream, and registering the virtual table of the data stream.

Optionally, storing the real-time statistical result obtained after the operation includes: the real-time statistical result obtained after loading and operating the preset SQL statement file comprises result data and offset data; storing the result data, generating a result data file, judging whether the storage is successful, if so, setting a set identifier in a storage path of the result data file, and if not, ending the flow abnormally; and storing the offset data, generating a check point file, judging whether the storage is successful, if so, setting the set identifier in a storage path of the check point file, and if not, ending the flow abnormally.

Optionally, after storing the real-time statistical result obtained after the operation, the method further includes: modifying the table name and/or date conversion function in the preset SQL statement file according to the off-line environment data; generating an off-line index result based on the off-line environment data by using the modified SQL statement file; and performing data comparison verification on the real-time statistical result and the off-line index result to determine the accuracy of the real-time statistical result.

According to another aspect of the embodiments of the present invention, there is provided a device for statistics of real-time data streams, including: a registered virtual table module to: registering a virtual table of the data stream according to the read data stream; an execute statement module to: loading and operating a preset SQL statement file on the virtual table of the data stream; a data storage module to: and storing the real-time statistical result obtained after operation.

Optionally, the apparatus further comprises a reading module configured to: taking a checkpoint file storage path with the latest generation time and carrying a set identifier as a first path, and taking a result data file storage path with the latest generation time and carrying the set identifier as a second path; determining a latest checkpoint file according to the generation time of the first path and the generation time of the second path; reading a data stream from the latest checkpoint file.

Optionally, the reading module is further configured to: if the generation time of the second path is later than that of the first path, deleting a result data file storage path of which the generation time is later than that of the first path, and taking the checkpoint file of the first path as a latest checkpoint file; if the generation time of the second path is earlier than that of the first path, deleting the checkpoint file storage path of which the generation time is later than that of the second path, and taking the checkpoint file corresponding to the checkpoint file storage path of which the rest generation time is latest as the latest checkpoint file; and if the generation time of the second path is equal to the generation time of the first path, taking the checkpoint file of the first path as a latest checkpoint file.

Optionally, the register virtual table module is further configured to: presetting an SQL statement file in a configuration file according to business logic; and determining the number and the name of the virtual table of the data stream according to the read category and the number of the data stream, and registering the virtual table of the data stream.

Optionally, the data storage module is further configured to: the real-time statistical result obtained after loading and operating the preset SQL statement file comprises result data and offset data; storing the result data, generating a result data file, judging whether the storage is successful, if so, setting a set identifier in a storage path of the result data file, and if not, ending the flow abnormally; and storing the offset data, generating a check point file, judging whether the storage is successful, if so, setting the set identifier in a storage path of the check point file, and if not, ending the flow abnormally.

Optionally, the apparatus further comprises a data verification module configured to: modifying the table name and/or date conversion function in the preset SQL statement file according to the off-line environment data; generating an off-line index result based on the off-line environment data by using the modified SQL statement file; and performing data comparison verification on the real-time statistical result and the off-line index result to determine the accuracy of the real-time statistical result.

According to still another aspect of an embodiment of the present invention, there is provided an electronic apparatus including: one or more processors; a storage device for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the statistical method for real-time data streams as set forth in the foregoing embodiments.

According to a further aspect of the embodiments of the present invention, there is provided a computer readable medium, on which a computer program is stored, which when executed by a processor, implements the statistical method of real-time data streams as set forth in the foregoing embodiments.

One embodiment of the above invention has the following advantages or benefits: because the technical means of directly loading and operating the SQL statement file meeting the business requirements on the virtual table of the registered data stream is adopted, the technical problem that the processing process is complex when a data stream computing framework processes a plurality of data streams based on various API operators is solved, and the technical effect of realizing the statistics of the real-time data stream in the business scene by writing the SQL statement can be achieved by replacing a complex conversion computing function.

Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.

Drawings

The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:

FIG. 1 is a schematic diagram of a basic flow of a statistical method of real-time data flow according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a flow of executing an SQL statement according to an embodiment of the invention;

FIG. 3 is a schematic diagram of a preferred flow of a statistical method of real-time data flow according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of the basic modules of a statistical apparatus of real-time data streams according to an embodiment of the present invention;

FIG. 5 is an exemplary system architecture diagram in which embodiments of the present invention may be employed;

fig. 6 is a schematic block diagram of a computer system suitable for use in implementing a terminal device or server of an embodiment of the invention.

Detailed Description

Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Spark Streaming is a Streaming data computing framework, and is widely applied to the field of real-time processing in combination with the open source Streaming processing platform Kafka. In an actual scenario, a process that 2 or 3 Kafka data streams need to be accessed into a Spark Streaming framework, and according to different service scenarios, the data streams are filtered, field-converted, aggregated and counted in different dimensions, and statistical results are output is often encountered. Real-time statistics based on Spark monitoring can ensure that data cannot be lost through a checkpoint mechanism provided by the real-time statistics based on Spark monitoring, data can be recovered from a previous checkpoint after a fault occurs and is restarted, but the mechanism can only ensure that data cannot be lost and cannot ensure that data cannot be repeatedly processed, and the requirement of exact processing during expected real-time data processing is not met. In the prior art, processing operations such as data conversion and the like are performed on an accessed data stream by using various operators (such as foreachRDD, transform, reduce bykey and the like) of a Spark Streaming API to implement relevant service logic.

Aiming at the problem that a checkpoint mechanism cannot ensure that data is not repeated when being accessed, in the prior art, a re-partitioning operator (replication) of an open-source spark API is used for processing an output behavior operator (action), so that a plurality of partition data are changed into one partition data (replication), and then the partition data can be operated by using the transaction function of a database, and the data processing process is ensured not to have the data repetition problem. Alternatively, the output result data may include offset data (also referred to as checkpoint offset data), so that the commit result and commit offset are completed in one operation, and no data is lost or processed repeatedly. The offset data in the last commit result can be used when the failure is recovered.

However, the code frame of the existing solution is not highly versatile, different business processing logic-related spark operator codes need to be written for different business processing, and the logic implementation is more complex; real-time data and offline data are not easy to troubleshoot problems when verifying DIFF (data DIFF). The action only has one partition by setting the repartition parameters, thereby influencing the concurrency of program processing, prolonging the processing time delay in the data batch time, and particularly influencing the program performance under the condition of large data volume; writing checkpoint offset data and result data together can result in tight coupling of the result data and checkpoint offset data, which can make subsequent use of the result data alone inconvenient.

Fig. 1 is a schematic diagram of a basic flow of a statistical method of a real-time data stream according to an embodiment of the present invention. As shown in fig. 1, an embodiment of the present invention provides a statistical method for a real-time data stream, including:

s101, registering a virtual table of a data stream according to the read data stream;

step S102, loading and operating a preset SQL (Structured Query Language) statement file on a virtual table of the data stream;

and S103, storing the real-time statistical result obtained after operation.

Specifically, an SQL statement file is written in an external configuration file of the Spark streaming handler according to business logic, after a plurality of data streams are read by the Spark streaming, virtual tables of the data streams are respectively registered, the written SQL statement file is directly loaded on the virtual tables, and a calculation result is generated and stored. When a new business logic requirement exists, the statistical function of the corresponding business logic can be quickly realized only by writing a new SQL statement file and not paying attention to whether other Spark operator codes are modified. For example, in the logistics online advertisement statistics service, it is necessary to count indexes such as the exposure amount of the logistics online advertisement and the click rate of a user, and write a service SQL statement by accessing a Kafka question of exposure and click of the logistics advertisement (Kafka data stream), registering an exposure virtual table (impression _ table) and a click virtual table (click _ table) of the advertisement: and (4) selecting the exposure and the click rate from the exposure table UNION click table where (illegal clicks and exposures are filtered), executing the SQL statement, processing the exposure and click data in real time, and generating report data of the real-time exposure and click rate of the user.

The embodiment of the invention adopts the technical means of directly loading and operating the SQL statement file meeting the business requirements on the virtual table of the registered data stream, thereby overcoming the technical problem of complex processing process when a data stream computing framework processes a plurality of data streams based on various API (Application Programming Interface) operators, and further achieving the technical effect of replacing complex conversion computing functions by writing SQL statements to realize the statistics of real-time data streams in a business scene.

In this embodiment of the present invention, step S101 registers a virtual table of the data stream according to the read data stream, including: presetting an SQL statement file in a configuration file according to business logic; and determining the number and the name of the virtual table of the data stream according to the read category and the number of the data stream, and registering the virtual table of the data stream.

Before step S101 registers the virtual table of the data stream according to the read data stream, the embodiment of the present invention further includes: taking a checkpoint file storage path with the latest generation time and carrying a set identifier (the set identifier can include but is not limited to a SUCCESS identifier) as a first path, and taking a result data file storage path with the latest generation time and carrying the set identifier as a second path; determining a latest checkpoint file according to the generation time of the first path and the generation time of the second path; reading a data stream from the latest checkpoint file.

The embodiment of the invention adopts the technical means of determining the latest checkpoint file according to the generation time of the first path and the generation time of the second path and reading the data stream from the latest checkpoint file, thereby realizing the processing of data at the right time, and the data can not be lost or repeated when the failure and the restart occur.

Based on the foregoing embodiment, in the embodiment of the present invention, determining the latest checkpoint file according to the generation time of the first path and the generation time of the second path may include: if the generation time of the second path is later than that of the first path, deleting a result data file storage path of which the generation time is later than that of the first path, and taking the checkpoint file of the first path as a latest checkpoint file; if the generation time of the second path is earlier than that of the first path, deleting the checkpoint file storage path of which the generation time is later than that of the second path, and taking the checkpoint file corresponding to the checkpoint file storage path of which the rest generation time is latest as the latest checkpoint file; and if the generation time of the second path is equal to the generation time of the first path, taking the checkpoint file of the first path as a latest checkpoint file.

The embodiment of the invention determines the latest checkpoint file by comparing the generation time of the first path with the generation time of the second path, so that the data is processed at the right time in the subsequent method, and when the failure and the restart occur, the data can be read and processed from the latest checkpoint file, so that the data cannot be lost or repeated, and the real-time processing of the data stream is realized.

Step S103 in the embodiment of the present invention stores the real-time statistical result obtained after the operation, including: the real-time statistical result obtained after loading and operating the preset SQL statement file comprises result data and offset data; storing the result data, generating a result data file, judging whether the storage is successful, if so, setting a set identifier in a storage path of the result data file, and if not, ending the flow abnormally; and storing the offset data, generating a check point file, judging whether the storage is successful, if so, setting the set identifier in a storage path of the check point file, and if not, ending the flow abnormally.

The embodiment of the invention separately stores the result data and the offset data, independently stores the generated offset data after each SQL statement execution, and directly recovers and reads the subject data of the Kafka from the result data and the offset data stored last time in the next operation, thereby ensuring no data repetition, solving the problem of data repetition or data loss after restart possibly caused by various abnormalities and ensuring the accuracy of the data. Specifically, the checkpoint data and the result data may be separately stored in an HDFS (Hadoop Distributed File system), which not only ensures the stability of the data, but also may use the result data independently.

After the step S103 stores the real-time statistical result obtained after the operation, the embodiment of the present invention further includes: modifying the table name and/or date conversion function in the preset SQL statement file according to the off-line environment data; generating an off-line index result based on the off-line environment data by using the modified SQL statement file; and performing data comparison verification on the real-time statistical result and the off-line index result to determine the accuracy of the real-time statistical result.

The accuracy check and comparison of the data indexes produced by real-time data processing usually verifies whether the real-time result is correct by comparing the offline data index result, and the verification can be performed by using the SQL statements used in the above embodiments. Because the real-time business logic and the off-line business logic are consistent, the SQL sentences generated by the real-time virtual table can directly run in an off-line environment only by modifying the table name and the individual function to generate off-line index result data, and the comparison of the real-time result data and the off-line result data is very convenient and is simpler for checking DIFF with the off-line data.

FIG. 2 is a schematic diagram of a flow of executing an SQL statement according to an embodiment of the invention. As shown in fig. 2, after the Spark streaming reads a plurality of data streams from the latest checkpoint, virtual tables of the data streams are respectively registered, a written SQL statement file is directly loaded on the virtual tables, and the generated result incremental data (i.e., result data) and the checkpoint data are respectively stored in the HDFS. The problem of multi-data source fusion and step processing of business logic can be solved by registering a plurality of data sources into a plurality of virtual tables, and converting the business logic into various filtering conditions of SQL sentences. SQL sentences meeting the business requirements are executed on the registered virtual table, and the complicated operation problems that a plurality of data sources need to be called step by step, a plurality of filtering conditions need to be executed step by step and the like are solved in one step. The user writes SQL sentences which meet business requirements in advance through the configuration file, the number and the name of the registered virtual tables are determined according to the types and the number of input kafka data streams, the new business and the new requirements can be realized only by writing corresponding SQL sentences, and specific codes in a frame do not need to be modified.

Fig. 3 is a schematic diagram of a preferred flow of a statistical method of real-time data flow according to an embodiment of the present invention. As shown in FIG. 3, the program starts and searches for the most recent path P1 identified by SUCCESS in the HDFS checkpoint file. The most recent path P2 identified by SUCCESS in the HDFS data file is looked up. Comparing whether the time of the data path P2 is longer than the time of the checkpoint path P1, if so, indicating that the checkpoint file is not successfully written due to the fact that the program is abnormal after the latest batch of result data files are written, the program is required to delete all data paths of which the data path time is longer than the checkpoint path time of the process 1, and the data duplication phenomenon does not occur after the program is started from the last checkpoint. If the time of the data path P2 is equal to the time of the checkpoint path P1, it indicates that the last batch data path and the checkpoint path both successfully written without special processing. If the time of the data path P2 is less than the time of the checkpoint path P1, which indicates that the data in the previous batch of data paths has been deleted abnormally, the program is required to delete the checkpoint data file whose checkpoint path time is greater than the data path time, so as to ensure that the program will pull a copy of the abnormally lost data file again when starting. After the above various conditions are processed, reading data from the latest check point file, and analyzing and processing the data. And generating result data after the service is processed, storing the result data in the HDFS, judging whether the writing is successful, falling the SUCCESS mark after the writing is successful, and exiting the program if the writing is not successful. And after the result data is successfully written, writing the current check point data into the HDFS, judging whether the writing is successful, falling the SUCCESS identification after the writing is successful, and if the writing is not successful, exiting the program abnormally.

Fig. 4 is a schematic diagram of basic modules of a real-time data flow statistical device according to an embodiment of the present invention. As shown in fig. 4, an embodiment of the present invention provides an apparatus 400 for statistics of real-time data streams, including:

a registered virtual table module 401 for: registering a virtual table of the data stream according to the read data stream;

an execute statement module 402 for: loading and operating a preset SQL statement file on the virtual table of the data stream;

a data storage module 403 for: and storing the real-time statistical result obtained after operation.

The embodiment of the invention adopts the technical means of directly loading and operating the SQL statement file meeting the business requirements on the virtual table of the registered data stream, thereby overcoming the technical problem that the processing process is complex when the data stream computing framework processes a plurality of data streams based on various API operators, and further achieving the technical effect of replacing a complex conversion computing function by writing the SQL statement to realize the statistics of the real-time data stream in the business scene.

In an embodiment of the present invention, the apparatus further includes a reading module, configured to: taking a checkpoint file storage path with the latest generation time and carrying a set identifier as a first path, and taking a result data file storage path with the latest generation time and carrying the set identifier as a second path; determining a latest checkpoint file according to the generation time of the first path and the generation time of the second path; reading a data stream from the latest checkpoint file.

In the embodiment of the present invention, the reading module is further configured to: if the generation time of the second path is later than that of the first path, deleting a result data file storage path of which the generation time is later than that of the first path, and taking the checkpoint file of the first path as a latest checkpoint file; if the generation time of the second path is earlier than that of the first path, deleting the checkpoint file storage path of which the generation time is later than that of the second path, and taking the checkpoint file corresponding to the checkpoint file storage path of which the rest generation time is latest as the latest checkpoint file; and if the generation time of the second path is equal to the generation time of the first path, taking the checkpoint file of the first path as a latest checkpoint file.

In this embodiment of the present invention, the registered virtual table module 401 is further configured to: presetting an SQL statement file in a configuration file according to business logic; and determining the number and the name of the virtual table of the data stream according to the read category and the number of the data stream, and registering the virtual table of the data stream.

According to the embodiment of the invention, the SQL statement is compiled, so that the complex conversion calculation function can be replaced to process the statistical problem of the real-time data flow in the service scene, and the processing efficiency is greatly improved.

In this embodiment of the present invention, the data storage module 403 is further configured to: the real-time statistical result obtained after loading and operating the preset SQL statement file comprises result data and offset data; storing the result data, generating a result data file, judging whether the storage is successful, if so, setting a set identifier in a storage path of the result data file, and if not, ending the flow abnormally; and storing the offset data, generating a check point file, judging whether the storage is successful, if so, setting the set identifier in a storage path of the check point file, and if not, ending the flow abnormally.

The generated offset data is independently stored after the SQL statement is executed each time, and the subject data of the Kyoff card is recovered and read directly from the result data and the offset data stored last time in the next operation, so that the data can be ensured not to be repeated, the problem of repeated or lost data possibly caused by various abnormalities after restarting is solved, and the accuracy of the data is ensured. Specifically, the checkpoint data and the result data may be separately stored in an HDFS (Hadoop Distributed File system), which not only ensures the stability of the data, but also may use the result data independently.

In an embodiment of the present invention, the apparatus further includes a data verification module, configured to: modifying the table name and/or date conversion function in the preset SQL statement file according to the off-line environment data; generating an off-line index result based on the off-line environment data by using the modified SQL statement file; and performing data comparison verification on the real-time statistical result and the off-line index result to determine the accuracy of the real-time statistical result.

Fig. 5 illustrates an exemplary system architecture 500 to which the statistical method or device of the real-time data stream of the embodiments of the present invention may be applied.

As shown in fig. 5, the system architecture 500 may include

terminal devices

501, 502, 503, a network 504, and a server 505. The network 504 serves to provide a medium for communication links between the

terminal devices

501, 502, 503 and the server 505. Network 504 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

501, 502, 503 to interact with a server 505 over a network 504 to receive or send messages or the like. The

terminal devices

501, 502, 503 may have various communication client applications installed thereon, such as a shopping application, a web browser application, a search application, an instant messaging tool, a mailbox client, social platform software, and the like.

The

terminal devices

501, 502, 503 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.

The server 505 may be a server that provides various services, such as a background management server that supports shopping websites browsed by users using the

terminal devices

501, 502, 503. The background management server can analyze and process the received data such as the product information inquiry request and feed back the processing result to the terminal equipment.

It should be noted that the statistical method for the real-time data stream provided by the embodiment of the present invention is generally executed by the server 505, and accordingly, the statistical apparatus for the real-time data stream is generally disposed in the server 505.

It should be understood that the number of terminal devices, networks, and servers in fig. 5 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

The invention also provides an electronic device and a computer readable medium according to the embodiment of the invention.

The electronic device of the embodiment of the invention comprises: one or more processors; a storage device for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the statistical method for real-time data streams as set forth in the foregoing embodiments.

The computer-readable medium of the present invention has stored thereon a computer program, which when executed by a processor implements the statistical method of real-time data streams as proposed in the previous embodiments.

Referring now to FIG. 6, a block diagram of a computer system 600 suitable for use with a terminal device implementing an embodiment of the invention is shown. The terminal device shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.

As shown in fig. 6, the computer system 600 includes a Central Processing Unit (CPU)601 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage section 608 into a Random Access Memory (RAM) 603. In the RAM603, various programs and data necessary for the operation of the system 600 are also stored. The CPU601, ROM602, and RAM603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

The following components are connected to the I/O interface 605: an input portion 606 including a keyboard, a mouse, and the like; an output portion 607 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 608 including a hard disk and the like; and a communication section 609 including a network interface card such as a LAN card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The driver 610 is also connected to the I/O interface 605 as needed. A removable medium 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 610 as necessary, so that a computer program read out therefrom is mounted in the storage section 608 as necessary.

In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 609, and/or installed from the removable medium 611. The computer program performs the above-described functions defined in the system of the present invention when executed by the Central Processing Unit (CPU) 601.

It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor, comprising: the system comprises a registered virtual table module, an execution statement module and a data storage module. The names of these modules do not in some cases constitute a limitation on the module itself, for example, the register virtual table module may also be described as a "module for registering a virtual table of a read data stream according to the data stream".

As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise: registering a virtual table of the data stream according to the read data stream; loading and operating a preset SQL statement file on the virtual table of the data stream; and storing the real-time statistical result obtained after operation.

According to the statistical method of the real-time data flow, the technical means that SQL statement files meeting business requirements are directly loaded and operated on the virtual table of the registered data flow is adopted, so that the technical problem that a data flow calculation framework is complex in processing process when a plurality of data flows are processed based on various API operators is solved, and the technical effect of realizing the statistics of the real-time data flow in a business scene by writing SQL statements to replace complex conversion calculation functions is achieved.

The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for statistical analysis of real-time data streams, comprising:

registering a virtual table of the data stream according to the read data stream;

loading and operating a preset SQL statement file on the virtual table of the data stream;

and storing the real-time statistical result obtained after operation.

2. The method of claim 1, wherein prior to registering the virtual table of the data stream from the read data stream, the method further comprises:

taking a checkpoint file storage path with the latest generation time and carrying a set identifier as a first path, and taking a result data file storage path with the latest generation time and carrying the set identifier as a second path;

determining a latest checkpoint file according to the generation time of the first path and the generation time of the second path;

reading a data stream from the latest checkpoint file.

3. The method of claim 2, wherein determining the latest checkpoint file based on the generation time of the first path and the generation time of the second path comprises:

if the generation time of the second path is later than that of the first path, deleting a result data file storage path of which the generation time is later than that of the first path, and taking the checkpoint file of the first path as a latest checkpoint file;

if the generation time of the second path is earlier than that of the first path, deleting the checkpoint file storage path of which the generation time is later than that of the second path, and taking the checkpoint file corresponding to the checkpoint file storage path of which the rest generation time is latest as the latest checkpoint file;

and if the generation time of the second path is equal to the generation time of the first path, taking the checkpoint file of the first path as a latest checkpoint file.

4. The method of claim 1, wherein registering the virtual table of the data stream according to the read data stream comprises:

presetting an SQL statement file in a configuration file according to business logic;

and determining the number and the name of the virtual table of the data stream according to the read category and the number of the data stream, and registering the virtual table of the data stream.

5. The method of claim 1, wherein storing the real-time statistics obtained after the run comprises:

the real-time statistical result obtained after loading and operating the preset SQL statement file comprises result data and offset data;

storing the result data, generating a result data file, judging whether the storage is successful, if so, setting a set identifier in a storage path of the result data file, and if not, ending the flow abnormally;

and storing the offset data, generating a check point file, judging whether the storage is successful, if so, setting the set identifier in a storage path of the check point file, and if not, ending the flow abnormally.

6. The method of claim 1, wherein after storing the real-time statistics obtained after the running, the method further comprises:

modifying the table name and/or date conversion function in the preset SQL statement file according to the off-line environment data;

generating an off-line index result based on the off-line environment data by using the modified SQL statement file;

and performing data comparison verification on the real-time statistical result and the off-line index result to determine the accuracy of the real-time statistical result.

7. A device for real-time data stream statistics, comprising:

a registered virtual table module to: registering a virtual table of the data stream according to the read data stream;

an execute statement module to: loading and operating a preset SQL statement file on the virtual table of the data stream;

a data storage module to: and storing the real-time statistical result obtained after operation.

8. The apparatus of claim 7, further comprising a reading module to:

reading a data stream from the latest checkpoint file.

9. The apparatus of claim 8, wherein the reading module is further configured to:

10. The apparatus of claim 7, wherein the register virtual table module is further configured to:

11. The apparatus of claim 7, wherein the data storage module is further configured to:

12. The apparatus of claim 7, further comprising a data validation module to:

13. An electronic device, comprising:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-6.

14. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-6.