CN117312761B - Method and device for calculating data fragment processing time - Google Patents

Method and device for calculating data fragment processing time Download PDF

Info

Publication number
CN117312761B
CN117312761B CN202311597815.8A CN202311597815A CN117312761B CN 117312761 B CN117312761 B CN 117312761B CN 202311597815 A CN202311597815 A CN 202311597815A CN 117312761 B CN117312761 B CN 117312761B
Authority
CN
China
Prior art keywords
data
segment
data segment
processing
identification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311597815.8A
Other languages
Chinese (zh)
Other versions
CN117312761A (en
Inventor
孟江华
姜栋琛
崔文辉
李磊
王致茹
陈群
刘海龙
董鸿毅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kmerit Suzhou Information Science & Technology Co ltd
Taicang Yangtze River Delta Research Institute of Northwestern Polytechnical University
Original Assignee
Kmerit Suzhou Information Science & Technology Co ltd
Taicang Yangtze River Delta Research Institute of Northwestern Polytechnical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kmerit Suzhou Information Science & Technology Co ltd, Taicang Yangtze River Delta Research Institute of Northwestern Polytechnical University filed Critical Kmerit Suzhou Information Science & Technology Co ltd
Priority to CN202311597815.8A priority Critical patent/CN117312761B/en
Publication of CN117312761A publication Critical patent/CN117312761A/en
Application granted granted Critical
Publication of CN117312761B publication Critical patent/CN117312761B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Software Systems (AREA)
  • Pure & Applied Mathematics (AREA)
  • Operations Research (AREA)
  • Algebra (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Fuzzy Systems (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a method and a device for calculating processing time of a data fragment, and relates to the technical field of data processing. The method comprises the following steps: receiving a data segment, and identifying the processing starting time of the data segment; performing subtask division on the data preparation process of each data segment, and performing data preparation based on each subtask; determining end identification data of each subtask in response to completion of data preparation, and taking other data as non-end identification data; acquiring end identification data and non-end identification data of the target data fragment by adopting a statistical operator, and respectively calculating an end data quantity and a calculated data quantity; acquiring the processing start time of the target data segment; the processing time of the target data segment is calculated based on the end data amount and the calculated data amount of the target data segment, and the processing start time of the target data segment. The embodiment reduces the complexity of processing time calculation and solves the problem of clock deviation of different nodes in the processing time calculation process.

Description

Method and device for calculating data fragment processing time
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a method and an apparatus for calculating processing time of a data segment.
Background
Furin (Flink) is a framework focused on stream processing, has a high degree of parallelism and fault tolerance, can process continuously generated data streams with low latency, and allows real-time processing and analysis of data, suitable for applications requiring fast response. In real-time application of stream data processing, it is necessary to know the processing time of a data segment in real time for monitoring the state and processing performance of a task, so as to conveniently display the data calculation result, perform performance optimization, and the like.
Because the streaming data processing is operated in a distributed environment, the data can be processed on a plurality of nodes in parallel, and the calculation process of the total processing time is complex; in addition, clock bias exists in different nodes, and the processing time is inaccurate. Therefore, it is difficult to obtain the start-stop time of the task, and there is still a certain difficulty in statistics of the processing time of the data segment in the Flink.
In view of the foregoing, there is a need for a method that can calculate the processing time of a data segment in a stream in a link.
Disclosure of Invention
In view of this, the embodiment of the invention provides a method and a device for calculating the processing time of a data segment, which reduce the complexity of the calculation process of the processing time of the data segment and improve the accuracy of the calculation of the processing time.
To achieve the above object, according to one aspect of the embodiments of the present invention, there is provided a data fragment processing time calculation method.
The method for calculating the data fragment processing time comprises the following steps:
receiving a data segment, and identifying the processing starting time of the data segment;
performing subtask division on the data preparation process of each data segment based on the parallelism of the source operator, and performing data preparation based on each subtask;
determining end identification data of each subtask in response to completion of data preparation, and taking other data except the end identification data as non-end identification data; the end identification data comprises the preparation data quantity of the current subtask and the parallel subtask quantity;
acquiring end identification data and non-end identification data of a target data segment by adopting a statistic operator, and respectively calculating the end data quantity and calculated data quantity of the target data segment;
acquiring the processing start time of the target data segment by adopting the statistical operator according to the processing start time of the data segment;
the processing time of the target data segment is calculated based on the end data amount and the calculated data amount of the target data segment, and the processing start time of the target data segment.
Optionally, after the receiving the data segment and identifying the processing start time of the data segment, the method further includes:
and acquiring the data segment identification of the data segment.
Optionally, determining end identification data of each subtask in response to completion of data preparation, and taking other data except the end identification data as non-end identification data; comprising the following steps:
obtaining preparation data and corresponding preparation data quantity of each subtask in response to the completion of data preparation;
preparing data for each piece, and marking a data fragment identifier;
acquiring the number of parallel subtasks of the data segment based on the subtask division result;
the number of parallel subtasks, the prepared data amount of the subtasks and the data fragment identification are used as the end identification data of the corresponding subtasks;
all the preparation data except the above-mentioned end identification data and the data fragment identification of each preparation data are used as non-end identification data.
Optionally, after determining the end identification data and the non-end identification data of each of the subtasks in response to completion of data preparation, the method further comprises: performing data processing on the non-ending identification data;
The data processing of the non-ending identification data includes:
performing data processing on the prepared data;
and deleting the prepared data after the processing is completed in response to the prepared data processing, and reserving the data fragment identification of each piece of prepared data after the processing is completed.
Optionally, the acquiring end identification data and non-end identification data of the target data segment by using a statistics operator, calculating an end data amount and a calculated data amount of the target data segment respectively, includes:
identifying the data fragment identification in the non-ending identification data by adopting a statistic operator;
acquiring non-ending identification data of which the data segment identification is a target data segment identification;
identifying the data segment identification in the end identification data by adopting the statistic operator;
acquiring end identification data of which the data segment identification is a target data segment identification;
and respectively calculating the end data quantity and the calculated data quantity of the target data segment based on the end identification data and the non-end identification data of which the data segment identification is the target data segment identification.
Optionally, the calculating the end data amount and the calculated data amount of the target data segment based on the end identification data and the non-end identification data of the target data segment identification, respectively, includes:
Setting the initial values of the ending data quantity and the calculated data quantity of the target data segment to be 0;
each time the statistics operator obtains one end identification data of the target data segment, the end data quantity is increased; and each time the statistic operator acquires a piece of non-ending identification data of the target data segment, the calculated data quantity is increased.
Optionally, the calculating the processing time of the target data segment based on the ending data amount and the calculated data amount of the target data segment, and the processing start time of the target data segment includes:
identifying a processing end time of the target data segment in response to the end data amount of the target data segment being equal to the number of parallel subtasks of the target data segment and the calculated data amount of the target data segment being equal to the sum of the prepared data amounts of all subtasks of the target data segment;
the processing time of the target data segment is calculated based on the processing start time of the target data segment and the processing end time of the target data segment.
To achieve the above object, according to still another aspect of the embodiments of the present invention, there is provided a data fragment processing time calculation apparatus.
The device for calculating the data fragment processing time in the embodiment of the invention comprises the following components:
a receiving module for receiving a data segment, and identifying a processing start time of the data segment;
the preparation module is used for dividing subtasks of the data preparation process of each data fragment based on the parallelism of the source operator and preparing data based on each subtask;
a determining module, configured to determine end identification data of each subtask in response to completion of data preparation, and use other data except the end identification data as non-end identification data; the end identification data comprises the preparation data quantity of the current subtask and the parallel subtask quantity;
the first calculation module is used for acquiring end identification data and non-end identification data of the target data segment by adopting a statistical operator and respectively calculating the end data quantity and the calculated data quantity of the target data segment;
the acquisition module is used for acquiring the processing start time of the target data segment by adopting the statistical operator according to the processing start time of the data segment;
and a second calculation module for calculating the processing time of the target data segment based on the end data amount and the calculation data amount of the target data segment and the processing start time of the target data segment.
To achieve the above object, according to still another aspect of the embodiments of the present invention, there is provided an electronic device for data fragment processing time calculation.
The electronic equipment for calculating the data fragment processing time comprises: one or more processors; and the storage device is used for storing one or more programs, and when the one or more programs are executed by the one or more processors, the one or more processors are enabled to realize the data fragment processing time calculation method.
To achieve the above object, according to still another aspect of the embodiments of the present invention, there is provided a computer-readable storage medium.
A computer-readable storage medium of an embodiment of the present invention has stored thereon a computer program which, when executed by a processor, implements a data fragment processing time calculation method of an embodiment of the present invention.
One embodiment of the above invention has the following advantages or benefits: the end identification data of each subtask in the data preparation process is determined, the non-end identification data is further determined, and a statistical operator is adopted to calculate the end data quantity and the calculation data quantity of the data fragments according to the end identification data and the non-end identification data, so that the processing time of the data fragments is calculated according to the calculation result and the processing start time of the data fragments, the complexity of the processing time calculation is further reduced, the same statistical operator is used for calculating the processing time of all the data fragments, the problem of clock deviation of different nodes in the processing time calculation process is effectively solved, the problem of inaccurate results caused by the processing time calculation on different nodes is avoided, and the accuracy of the processing time of the statistical data fragments in the distributed environment is improved.
Further effects of the above-described non-conventional alternatives are described below in connection with the embodiments.
Drawings
The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:
FIG. 1 is a flow chart of a method for calculating a data segment processing time according to an embodiment of the present invention;
FIG. 2 is a flow chart of a method of determining ending identification data and non-ending identification data, according to an embodiment of the invention;
FIG. 3 is a flow chart of a method for calculating an ending data amount and a calculated data amount of a target data segment according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of the main modules of a data fragment processing time calculation apparatus according to an embodiment of the present invention;
FIG. 5 is an exemplary system architecture diagram in which embodiments of the present invention may be applied;
fig. 6 is a schematic diagram of a computer system suitable for use in implementing an embodiment of the invention.
Detailed Description
Exemplary embodiments of the present invention will now be described with reference to the accompanying drawings, in which various details of the embodiments of the present invention are included to facilitate understanding, and are to be considered merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
It should be noted that the embodiments of the present invention and the technical features in the embodiments may be combined with each other without collision.
In the technical scheme of the invention, the aspects of the related personal information of the user, such as acquisition, collection, updating, analysis, processing, use, transmission, storage and the like, all conform to the rules of related laws and regulations, are used for legal purposes, and do not violate the popular public order. Necessary measures are taken for the personal information of the user, illegal access to the personal information data of the user is prevented, and the personal information security, network security and national security of the user are maintained.
Fig. 1 is a schematic diagram of main steps of a data segment processing time calculation method according to an embodiment of the present invention.
As shown in fig. 1, the method for calculating the data segment processing time according to the embodiment of the present invention mainly includes the following steps:
step S101, receiving data fragments, and identifying the processing start time of the data fragments;
step S102, dividing subtasks of the data preparation process of each data segment based on the parallelism of the source operator, and preparing data based on each subtask;
step S103, determining the end identification data of each subtask in response to the completion of data preparation, and taking other data except the end identification data as non-end identification data; the end identification data comprises the preparation data quantity of the current subtask and the parallel subtask quantity;
Step S104, acquiring end identification data and non-end identification data of the target data segment by adopting a statistical operator, and respectively calculating the end data quantity and the calculated data quantity of the target data segment;
step S105, acquiring the processing start time of the target data segment by adopting the statistical operator according to the processing start time of the data segment;
step S106, calculating the processing time of the target data segment based on the end data amount and the calculated data amount of the target data segment and the processing start time of the target data segment.
Specifically, when the stream processing service receives a data segment requiring data processing through a data channel, the current receiving time is taken as the processing start time of the data segment and is identified. Wherein each data segment has a specific data segment identification (data segment ID) to distinguish the current data segment from other data segments. Here, the data fragment ID is set by the program submitting the data fragment itself. As an example, the data channel may employ a Kafka (Kafka) message queue.
The data fragments, which have been identified with the processing start time, enter a data preparation phase, and since the stream processing service is running in a distributed environment, the data preparation process of each data fragment can be regarded as a data preparation task (job), which is divided into a corresponding number of sub-tasks according to the parallelism of the source operators, and each sub-task performs data preparation on one node of the distributed cluster when executing the sub-task.
When the data preparation is completed, end identification data (EOS) of each sub-task is determined, and all data except the end identification data is further taken as non-end identification data.
After entering the data processing stage, the data processing process of preparing the data by the data fragment is taken as a job of data processing, wherein the subtask division can be different from the subtask division of the data preparation process, and the subtask division of different stages is not required to be consistent.
After the data processing stage is completed, acquiring end identification data and non-end identification data of the target data segment by adopting a statistical operator, and entering a statistical stage. In the statistics stage, the end identification data and the non-end identification data of all the data fragments received by the stream processing service can be sent to the same statistics operator for processing time statistics, and the target data fragment can be determined according to the user requirement and can be any one of the data fragments received by the stream processing service.
In an alternative embodiment, after the receiving the data segment and identifying the processing start time of the data segment, the method further includes:
and acquiring the data segment identification of the data segment.
In particular, each data segment has a particular data segment identification, and different data segment identifications are used to identify different data segments. The data fragment identification is defined by the user/application that sent the data fragment.
In an alternative embodiment, as shown in fig. 2, the end identification data of each subtask may be determined by the following steps in response to completion of data preparation, and other data than the end identification data may be used as non-end identification data:
step S201, in response to the completion of data preparation, obtaining the preparation data and the corresponding preparation data amount of each subtask;
step S202, labeling data segment identifiers for each piece of prepared data;
step S203, obtaining the number of parallel subtasks of the data segment based on the subtask division result;
step S204, the number of parallel subtasks, the prepared data amount of the subtasks and the data fragment identification are used as the corresponding end identification data of the subtasks;
step S205, all the preparation data except the above-described end identification data and the data fragment identification of each preparation data are regarded as non-end identification data.
After the data preparation phase, the preparation data and the corresponding preparation data amount for each subtask can be obtained. For each data segment, the number of parallel subtasks can be obtained according to the subtask division result of the data preparation stage. The number of parallel subtasks, the prepared data amount of the subtasks, and the data fragment are identified as end identification data of each subtask of the data fragment. The end identification data of the subtask of each data preparation stage is transmitted to the next stage, namely the data processing stage along with the preparation data, and the end identification data is not processed in the data processing stage.
For each data segment, all the preparation data are summarized, and each preparation data is marked with a data segment identifier. All the preparation data except the end identification data and the data fragment of each preparation data are identified as non-end identification data of the corresponding data fragment. Specifically, for each prepared data labeling data segment identifier, the processing time calculation of all the data segments can be sent to one node of the distributed cluster, and the data segments can be processed by the same statistics operator, so that the prepared data of each data segment need to be labeled with the corresponding data segment identifier for distinguishing and accurately identifying the non-ending identification data of different data segments.
In an alternative embodiment, after determining the end identification data of each of the subtasks in response to the data preparation being completed and taking the other data than the end identification data as the non-end identification data, the method further includes: performing data processing on the non-ending identification data;
the data processing of the non-ending identification data includes:
performing data processing on the prepared data;
and deleting the prepared data after the processing is completed in response to the prepared data processing, and reserving the data fragment identification of each piece of prepared data after the processing is completed.
Specifically, after entering the data processing stage, the data processing is performed on the prepared data in the non-ending identification data based on the service logic, and as the processed prepared data cannot be continuously applied in the statistical stage of the calculation processing time after the data processing is completed, for the non-ending identification data, after each piece of prepared data processing is completed, the processed prepared data can be deleted, and only the corresponding data segment identification is reserved, so that the serialization and deserialization time in the data transmission process is reduced, and the load is reduced.
For the end identification data, no processing is performed in the data processing stage.
In an alternative embodiment, as shown in fig. 3, the step of obtaining the end identification data and the non-end identification data of the target data segment by using a statistics operator to calculate the end data amount and the calculated data amount of the target data segment, respectively, may include the following steps:
step S301, adopting a statistical operator to identify the data segment identification in the non-ending identification data;
step S302, acquiring non-ending identification data of which the data segment identification is a target data segment identification;
step S303, adopting the statistical operator to identify the data segment identification in the end identification data;
Step S304, end identification data of which the data segment identification is a target data segment identification is obtained;
step S305, calculating the end data amount and the calculated data amount of the target data segment based on the end identification data and the non-end identification data of the target data segment identification.
After the data processing stage, the end identification data and the processed non-end identification data of each data segment are sent to one node of the distributed cluster, and the processing time of each data segment is calculated and counted by the same statistic operator.
Specifically, a target data segment is determined from all the data segments, and a target data segment identifier is obtained. The identification data segment is identified as end identification data and non-end identification data of the target data segment identification to calculate an end data amount and a calculation data amount of the target data segment.
In an optional embodiment, the calculating the end data amount and the calculated data amount of the target data segment based on the end identification data and the non-end identification data of the target data segment, respectively, includes:
setting the initial values of the ending data quantity and the calculated data quantity of the target data segment to be 0;
Each time the statistics operator obtains one end identification data of the target data segment, the end data quantity is increased; and each time the statistic operator acquires a piece of non-ending identification data of the target data segment, the calculated data quantity is increased.
In the statistics operator, for each DATA segment for which processing time calculation is performed, an end DATA amount (EOS-count) and a calculation DATA amount (DATA-count) are set, and initial values of the end DATA amount and the calculation DATA amount are set to 0.
Each time the statistics operator obtains one end identification data of the target data segment, the end data quantity is calculated by adding 1; and each time the statistical operator acquires a piece of non-ending identification data of the target data segment, calculating the data quantity and adding 1.
In an alternative embodiment, the calculating the processing time of the target data segment based on the end data amount of the target data segment and the calculated data amount, and the processing start time of the target data segment includes:
identifying a processing end time of the target data segment in response to the end data amount of the target data segment being equal to the number of parallel subtasks of the target data segment and the calculated data amount of the target data segment being equal to the sum of the prepared data amounts of all subtasks of the target data segment;
The processing time of the target data segment is calculated based on the processing start time of the target data segment and the processing end time of the target data segment.
Specifically, when the end data amount of the target data segment is equal to the parallel subtasks in the end identification data thereof, the characterization statistics operator has acquired all the end identification data of the target data segment; when the calculated data amount of the target data segment is equal to the sum of the prepared data amounts of all the subtasks, the characterization statistics operator has acquired all the non-ending identification data of the target data segment. Therefore, when the end data amount of the target data segment is equal to the parallel subtasks of the target data segment and the calculated data amount of the target data segment is equal to the sum of the prepared data amounts of all the subtasks of the target data segment, the data processing for characterizing the target data segment has ended, and the current time is identified as the processing end time of the target data segment.
Further, when the statistics operator obtains the end identification data, the data of the target data segment and the data of other data segments can be distinguished according to the data segment identification in the end identification data, and similarly, when the statistics operator obtains the non-end identification data, the non-end identification data only contains the data segment identification at the moment, and the data of the target data segment and the data of other data segments are distinguished according to the data segment identification.
Then, the processing end time of the target data segment is subtracted by the processing start time, and the processing start time is the processing time of the target data segment.
The following describes the calculation of the processing time of the data segment in detail according to an embodiment:
the data segment identified as 0001 is the target data segment. The application acts as the sender of the target data fragment.
In the service scheduling stage, the application program transmits the target data segment to the stream processing service through the data channel, and the time when the stream processing service receives the target data segment is taken as the processing starting time of the target data segment.
In the data preparation stage, a stream processing service running in a distributed environment divides the data preparation process of the target data fragment into a corresponding number of subtasks according to the parallelism of a source operator (source operator) when the parallelism of the source operator is 3, namely 3 subtasks are obtained. When the data preparation is completed, the preparation data quantity of each subtask can be obtained, and the preparation data quantity of the subtask 1 is 300; the prepared data amount of subtask 2 is 360; the prepared data amount of subtask 3 is 320; obtaining the number of parallel subtasks to be 3 according to the subtask division result, and therefore, taking 3,300,0001 as ending identification data of the subtask 1; 3,360,0001 is taken as the end identification data of the subtask 2; 3,320,0001 is taken as the end identification data of the subtask 3.
According to the data preparation result of each subtask, 980 pieces of preparation data of the target data segment can be obtained. And respectively labeling the target data segment identification for each piece of preparation data of the target data segment so as to distinguish the data of the target data segment from the data of other data segments. The preparation data and the corresponding target data fragment identification are used as non-ending identification data.
In the data processing stage, the end identification data and the non-end identification data are transmitted to the nodes for data processing through the data channels, and the statistics operator can count the preparation data of all the data preparation stages and the parallel subtasks of the data preparation process in the statistics stage, so that the subtasks of the data processing process and the subtasks of the data preparation process can be different in number. In this case, the 3 end identification data are not processed, and after the preparation data in the non-end identification data are processed, the processed preparation data do not need to be processed and applied in the statistics stage, so that the processed preparation data can be deleted, and only the target data segment identification corresponding to each preparation data is reserved.
In the statistics stage, the end data quantity and the calculated data quantity of the target data segment are calculated according to the end identification data and the non-end identification data of the target data segment. The initial values of the ending data quantity and the calculated data quantity of the target data segment are 0, and each time the statistical operator receives one ending identification data of the target data segment, the ending data quantity is added with 1; the data amount is calculated plus 1 each time the statistics operator receives a non-ending identification data of the target data segment. Until the end data amount is equal to 3 and the calculated data amount is equal to 980, the characterization statistics operator has received all end identification data and non-end identification data for the target data segment, and the current time is identified as the processing end time of the target data segment.
And subtracting the processing start time from the processing end time of the target data segment to obtain the processing time of the target data segment. To this end, the processing time calculation of the target data segment whose data segment is identified as 0001 is completed.
According to the data segment processing time calculation method, it can be seen that the end identification data of each subtask in the data preparation process is determined, the non-end identification data is further determined, and the statistical operator is adopted to calculate the end data quantity and the calculation data quantity of the data segment according to the end identification data and the non-end identification data, so that the processing time of the data segment is calculated according to the calculation result and the processing start time of the data segment, the complexity of the processing time calculation is further reduced, the same statistical operator is used for calculating the processing time of all the data segments, the clock deviation problem of different nodes in the processing time calculation process is effectively solved, the problem of inaccurate results caused by processing time calculation on different nodes is avoided, and the accuracy of the processing time of the statistical data segment in the distributed environment is improved.
The end identification data comprises the parallel subtask number of the preparation stage, the preparation data amount of the current subtask and the data segment identification, the non-end identification data before data processing comprises all preparation data of the preparation stage corresponding to the data segment and the data segment identification of each preparation data, and the end data amount and the calculation data amount are calculated based on the parallel subtask number of the preparation stage and the data segment identification corresponding to the preparation data when the statistics operator acquires the end identification data and the processed non-end identification data, so that the subtask number of the data processing stage after the preparation stage does not need to be consistent with the subtask number of the preparation stage, and the flexibility of each stage of the stream processing service is improved.
By labeling the data segment identifiers on each piece of prepared data and including the data segment identifiers in each piece of end identifier data, the data segments can be distinguished through the data segment identifiers in the statistics stage, so that statistical errors caused by data confusion are avoided, and each data segment does not need to be processed after the previous data segment is processed due to labeling the data segment identifiers, and the processing efficiency is greatly improved.
By dividing the stream processing service into four phases, namely a service scheduling phase, a data preparation phase, a data processing phase and a statistics phase, the loose coupling of the stream processing service phase is realized, and each phase can independently execute the task of the phase after receiving the data transmitted by the data channel, so that the applicability of the stream processing service is improved, and the practicability is enhanced.
Fig. 4 is a schematic diagram of main modules of a data fragment processing time calculation apparatus according to an embodiment of the present invention.
As shown in fig. 4, the data fragment processing time calculation apparatus 400 according to the embodiment of the present invention includes:
a receiving module 401, configured to receive a data segment, and identify a processing start time of the data segment;
a preparation module 402, configured to divide subtasks for the data preparation process of each data segment based on the parallelism of the source operator, and perform data preparation based on each subtask;
A determining module 403, configured to determine end identification data of each subtask in response to completion of data preparation, and use other data except the end identification data as non-end identification data; the end identification data comprises the preparation data quantity of the current subtask and the parallel subtask quantity;
a first calculation module 404, configured to obtain end identification data and non-end identification data of a target data segment by using a statistics operator, and calculate an end data amount and a calculated data amount of the target data segment respectively;
an obtaining module 405, configured to obtain a processing start time of a target data segment by using the statistics operator according to the processing start time of the data segment;
a second calculation module 406, configured to calculate a processing time of the target data segment based on the end data amount and the calculated data amount of the target data segment, and a processing start time of the target data segment.
In an alternative embodiment of the present invention, the apparatus further includes: the identification acquisition module is used for acquiring the data fragment identification of the data fragment.
In an alternative embodiment of the present invention, the determining module 403 is further configured to: obtaining preparation data and corresponding preparation data quantity of each subtask in response to the completion of data preparation; preparing data for each piece, and marking a data fragment identifier; acquiring the number of parallel subtasks of the data segment based on the subtask division result; the number of parallel subtasks, the prepared data amount of the subtasks and the data fragment identification are used as the end identification data of the corresponding subtasks; all the preparation data except the above-mentioned end identification data and the data fragment identification of each preparation data are used as non-end identification data.
In an alternative embodiment of the present invention, the apparatus further includes: and the processing module is used for carrying out data processing on the non-ending identification data.
The processing module is also used for: performing data processing on the prepared data; and deleting the prepared data after the processing is completed in response to the prepared data processing, and reserving the data fragment identification of each piece of prepared data after the processing is completed.
In an alternative embodiment of the present invention, the first computing module 404 is further configured to: identifying the data fragment identification in the non-ending identification data by adopting a statistic operator; acquiring non-ending identification data of which the data segment identification is a target data segment identification; identifying the data segment identification in the end identification data by adopting the statistic operator; acquiring end identification data of which the data segment identification is a target data segment identification; and respectively calculating the end data quantity and the calculated data quantity of the target data segment based on the end identification data and the non-end identification data of which the data segment identification is the target data segment identification.
In an optional embodiment of the present invention, the calculating the end data amount and the calculated data amount of the target data segment based on the end identification data and the non-end identification data of the target data segment identification, respectively, includes: setting the initial values of the ending data quantity and the calculated data quantity of the target data segment to be 0; each time the statistics operator obtains one end identification data of the target data segment, the end data quantity is increased; and each time the statistic operator acquires a piece of non-ending identification data of the target data segment, the calculated data quantity is increased.
In an alternative embodiment of the present invention, the second computing module 406 is further configured to: identifying a processing end time of the target data segment in response to the end data amount of the target data segment being equal to the number of parallel subtasks of the target data segment and the calculated data amount of the target data segment being equal to the sum of the prepared data amounts of all subtasks of the target data segment; the processing time of the target data segment is calculated based on the processing start time of the target data segment and the processing end time of the target data segment.
According to the data segment processing time calculation device, it can be seen that the end identification data of each subtask in the data preparation process is determined, the non-end identification data is further determined, and the statistical operator is adopted to calculate the end data quantity and the calculation data quantity of the data segment according to the end identification data and the non-end identification data, so that the processing time of the data segment is calculated according to the calculation result and the processing start time of the data segment, the complexity of the processing time calculation is further reduced, the same statistical operator is used for calculating the processing time of all the data segments, the clock deviation problem of different nodes in the processing time calculation process is effectively solved, the problem of inaccurate results caused by the processing time calculation on different nodes is avoided, and the accuracy of the processing time of the statistical data segment in the distributed environment is improved.
Fig. 5 illustrates an exemplary system architecture 500 to which the data fragment processing time calculation method or the data fragment processing time calculation apparatus of the embodiment of the present invention may be applied.
As shown in fig. 5, the system architecture 500 may include terminal devices 501, 502, 503, a network 504, and a server 505. The network 504 is used as a medium to provide communication links between the terminal devices 501, 502, 503 and the server 505. The network 504 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
A user may interact with the server 505 via the network 504 using the terminal devices 501, 502, 503 to receive or transmit data or the like. Various communication client applications, such as shopping class applications, web browser applications, search class applications, instant messaging tools, mailbox clients, social platform software, etc., may be installed on the terminal devices 501, 502, 503.
The terminal devices 501, 502, 503 may be a variety of electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.
The server 505 may be a server providing various services, such as a background management server that calculates processing time for data segments transmitted by users using the terminal devices 501, 502, 503. The background management server may analyze and process the acquired data such as the data fragment, and feed back the processing result (for example, processing time) to the terminal device.
It should be noted that, the method for calculating the processing time of the data segment according to the embodiment of the present invention is generally executed by the server 505, and accordingly, the device for calculating the processing time of the data segment is generally disposed in the server 505.
It should be understood that the number of terminal devices, networks and servers in fig. 5 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
Referring now to FIG. 6, there is illustrated a schematic diagram of a computer system 600 suitable for use in implementing an embodiment of the present invention. The terminal device shown in fig. 6 is only an example, and should not impose any limitation on the functions and the scope of use of the embodiment of the present invention.
As shown in fig. 6, the computer system 600 includes a Central Processing Unit (CPU) 601, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 602 or a program loaded from a storage section 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data required for the operation of the system 600 are also stored. The CPU 601, ROM 602, and RAM 603 are connected to each other through a bus 604. An input/output (I/O) first interface 605 is also connected to the bus 604.
The following components are connected to the I/O first interface 605: an input portion 606 including a keyboard, mouse, etc.; an output portion 607 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like; a storage section 608 including a hard disk and the like; and a communication section 609 including a network first interface card such as a LAN card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The driver 610 is also connected to the I/O first interface 605 as needed. Removable media 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed as needed on drive 610 so that a computer program read therefrom is installed as needed into storage section 608.
In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication portion 609, and/or installed from the removable medium 611. The above-described functions defined in the system of the present invention are performed when the computer program is executed by a Central Processing Unit (CPU) 601.
The computer readable medium shown in the present invention may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules involved in the embodiments of the present invention may be implemented in software or in hardware. The described modules may also be provided in a processor, for example, as: a processor includes a receiving module, a preparing module, a determining module, a first computing module, an obtaining module, and a second computing module. The names of these modules do not constitute a limitation on the module itself in some cases, and for example, the receiving module may also be described as "a module that receives a data fragment, identifies the processing start time of the above-mentioned data fragment".
As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be present alone without being fitted into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to include: receiving a data segment, and identifying the processing starting time of the data segment; performing subtask division on the data preparation process of each data segment based on the parallelism of the source operator, and performing data preparation based on each subtask; determining end identification data of each subtask in response to completion of data preparation, and taking other data except the end identification data as non-end identification data; the end identification data comprises the preparation data quantity of the current subtask and the parallel subtask quantity; acquiring end identification data and non-end identification data of a target data segment by adopting a statistic operator, and respectively calculating the end data quantity and calculated data quantity of the target data segment; acquiring the processing start time of the target data segment by adopting the statistical operator according to the processing start time of the data segment; the processing time of the target data segment is calculated based on the end data amount of the target data segment, the calculated data amount, and the processing start time of the target data segment.
According to the technical scheme, the end identification data of each subtask in the data preparation process is determined, the non-end identification data is further determined, and the statistical operator is adopted to calculate the end data quantity and the calculation data quantity of the data fragments according to the end identification data and the non-end identification data, so that the processing time of the data fragments is calculated according to the calculation result and the processing start time of the data fragments, the complexity of the processing time calculation is further reduced, the same statistical operator is used for calculating the processing time of all the data fragments, the clock deviation problem of different nodes in the processing time calculation process is effectively solved, the problem of inaccurate results caused by the fact that the processing time calculation is carried out on different nodes is avoided, and the accuracy of the processing time of the statistical data fragments in the distributed environment is improved.
The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives can occur depending upon design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims (9)

1. A method for calculating a processing time of a data segment, comprising:
receiving a data segment, and identifying the processing starting time of the data segment;
performing subtask division on the data preparation process of each data segment based on the parallelism of the source operator, and performing data preparation based on each subtask;
determining end identification data of each subtask in response to completion of data preparation, and taking other data except the end identification data as non-end identification data; the end identification data comprises the preparation data quantity of the current subtask and the parallel subtask quantity;
acquiring end identification data and non-end identification data of a target data segment by adopting a statistic operator, and respectively calculating the end data quantity and calculated data quantity of the target data segment;
acquiring the processing start time of the target data segment by adopting the statistical operator according to the processing start time of the data segment;
calculating the processing time of the target data segment based on the ending data amount and the calculated data amount of the target data segment and the processing start time of the target data segment;
wherein the calculating the processing time of the target data segment based on the end data amount and the calculated data amount of the target data segment, and the processing start time of the target data segment includes:
Identifying a processing end time of the target data segment in response to an end data amount of the target data segment being equal to a number of parallel subtasks of the target data segment and a calculated data amount of the target data segment being equal to a sum of prepared data amounts of all subtasks of the target data segment;
the processing time of the target data segment is calculated based on the processing start time of the target data segment and the processing end time of the target data segment.
2. The data fragment processing time calculation method according to claim 1, wherein after the received data fragment, the processing start time of the data fragment is identified, the method further comprises:
and acquiring the data segment identification of the data segment.
3. The method according to claim 2, wherein said determining end identification data of each of said subtasks in response to completion of data preparation and taking other data than end identification data as non-end identification data comprises:
obtaining preparation data and corresponding preparation data quantity of each subtask in response to data preparation completion;
Preparing data for each piece, and marking a data fragment identifier;
acquiring the number of parallel subtasks of the data segment based on the subtask division result;
the number of parallel subtasks, the prepared data amount of the subtasks and the data fragment identification are used as the end identification data of the corresponding subtasks;
all the preparation data except the end identification data and the data fragment of each preparation data are identified as non-end identification data.
4. A data fragment processing time calculation method according to claim 3, wherein after said determining end identification data of each of said subtasks in response to completion of data preparation and taking other data than the end identification data as non-end identification data, said method further comprises: performing data processing on the non-ending identification data;
the data processing of the non-ending identification data comprises the following steps:
performing data processing on the prepared data;
and deleting the prepared data after the processing of the prepared data is completed, and reserving the data fragment identification of each piece of prepared data after the processing of the prepared data is completed.
5. The method for calculating a processing time of a data segment according to claim 2, wherein the obtaining end identification data and non-end identification data of a target data segment using a statistics operator, respectively calculating an end data amount and a calculation data amount of the target data segment, comprises:
Identifying the data fragment identification in the non-ending identification data by adopting a statistic operator;
acquiring non-ending identification data of which the data segment identification is a target data segment identification;
identifying a data fragment identifier in the end identifier data by adopting the statistic operator;
acquiring end identification data of which the data segment identification is a target data segment identification;
and respectively calculating the end data quantity and the calculated data quantity of the target data segment based on the end identification data and the non-end identification data of which the data segment identification is the target data segment identification.
6. The method according to claim 5, wherein calculating the end data amount and the calculated data amount of the target data segment based on the end identification data and the non-end identification data of the target data segment, respectively, comprises:
setting the initial values of the ending data quantity and the calculated data quantity of the target data segment to be 0;
each time the statistics operator acquires one end identification data of the target data segment, the end data quantity is increased; the calculated data amount is incremented each time the statistics operator obtains a non-ending identification data of the target data segment.
7. A data fragment processing time calculation apparatus, comprising:
a receiving module, configured to receive a data segment, and identify a processing start time of the data segment;
the preparation module is used for dividing subtasks of the data preparation process of each data fragment based on the parallelism of the source operator and preparing data based on each subtask;
a determining module, configured to determine end identification data of each subtask in response to completion of data preparation, and use other data except the end identification data as non-end identification data; the end identification data comprises the preparation data quantity of the current subtask and the parallel subtask quantity;
the first calculation module is used for acquiring end identification data and non-end identification data of the target data segment by adopting a statistical operator and respectively calculating the end data quantity and the calculated data quantity of the target data segment;
the acquisition module is used for acquiring the processing start time of the target data fragment by adopting the statistical operator according to the processing start time of the data fragment;
a second calculation module for calculating a processing time of the target data segment based on an end data amount and a calculation data amount of the target data segment, and a processing start time of the target data segment;
The second computing module is further for:
identifying a processing end time of the target data segment in response to an end data amount of the target data segment being equal to a number of parallel subtasks of the target data segment and a calculated data amount of the target data segment being equal to a sum of prepared data amounts of all subtasks of the target data segment;
the processing time of the target data segment is calculated based on the processing start time of the target data segment and the processing end time of the target data segment.
8. An electronic device for data segment processing time calculation, comprising:
one or more processors;
storage means for storing one or more programs,
when executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1-6.
9. A computer readable medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any of claims 1-6.
CN202311597815.8A 2023-11-28 2023-11-28 Method and device for calculating data fragment processing time Active CN117312761B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311597815.8A CN117312761B (en) 2023-11-28 2023-11-28 Method and device for calculating data fragment processing time

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311597815.8A CN117312761B (en) 2023-11-28 2023-11-28 Method and device for calculating data fragment processing time

Publications (2)

Publication Number Publication Date
CN117312761A CN117312761A (en) 2023-12-29
CN117312761B true CN117312761B (en) 2024-03-05

Family

ID=89250167

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311597815.8A Active CN117312761B (en) 2023-11-28 2023-11-28 Method and device for calculating data fragment processing time

Country Status (1)

Country Link
CN (1) CN117312761B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112181797A (en) * 2020-10-28 2021-01-05 武汉悦学帮网络技术有限公司 Software platform operation time-consuming calculation method and device, storage medium and equipment
CN113176937A (en) * 2021-05-21 2021-07-27 北京字节跳动网络技术有限公司 Task processing method and device and electronic equipment
CN115129460A (en) * 2021-03-25 2022-09-30 上海寒武纪信息科技有限公司 Method and device for acquiring operator hardware time, computer equipment and storage medium
CN115982148A (en) * 2023-01-10 2023-04-18 中国建设银行股份有限公司 Database table processing method and device, electronic equipment and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10942979B2 (en) * 2018-08-29 2021-03-09 International Business Machines Corporation Collaborative creation of content snippets

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112181797A (en) * 2020-10-28 2021-01-05 武汉悦学帮网络技术有限公司 Software platform operation time-consuming calculation method and device, storage medium and equipment
CN115129460A (en) * 2021-03-25 2022-09-30 上海寒武纪信息科技有限公司 Method and device for acquiring operator hardware time, computer equipment and storage medium
CN113176937A (en) * 2021-05-21 2021-07-27 北京字节跳动网络技术有限公司 Task processing method and device and electronic equipment
CN115982148A (en) * 2023-01-10 2023-04-18 中国建设银行股份有限公司 Database table processing method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN117312761A (en) 2023-12-29

Similar Documents

Publication Publication Date Title
CN109145023B (en) Method and apparatus for processing data
CN109002395B (en) Code coverage rate management method and device
CN110719215B (en) Flow information acquisition method and device of virtual network
US9251227B2 (en) Intelligently provisioning cloud information services
CN111478781B (en) Message broadcasting method and device
CN108764866B (en) Method and equipment for allocating resources and drawing resources
CN113127225A (en) Method, device and system for scheduling data processing tasks
CN110245014B (en) Data processing method and device
CN108696554B (en) Load balancing method and device
CN113778499B (en) Method, apparatus, device and computer readable medium for publishing services
CN110795328A (en) Interface testing method and device
CN117312761B (en) Method and device for calculating data fragment processing time
CN117076250A (en) Data processing method and device
CN115248735A (en) Log data output control method, device, equipment and storage medium
CN114049065A (en) Data processing method, device and system
CN109087097B (en) Method and device for updating same identifier of chain code
CN113778844A (en) Automatic performance testing method and device
CN113572704A (en) Information processing method, production end, consumption end and server
CN112749204A (en) Method and device for reading data
CN113114612B (en) Determination method and device for distributed system call chain
CN111786801A (en) Method and device for charging based on data flow
CN115309612B (en) Method and device for monitoring data
CN112783665B (en) Interface compensation method and device
CN110262756B (en) Method and device for caching data
CN110120958B (en) Task pushing method and device based on crowdsourcing mode

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant