CN110716959A - Streaming data processing method and device, electronic equipment and storage medium - Google Patents

Streaming data processing method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN110716959A
CN110716959A CN201910953104.7A CN201910953104A CN110716959A CN 110716959 A CN110716959 A CN 110716959A CN 201910953104 A CN201910953104 A CN 201910953104A CN 110716959 A CN110716959 A CN 110716959A
Authority
CN
China
Prior art keywords
data
pointer
storage medium
fragment
streaming
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910953104.7A
Other languages
Chinese (zh)
Inventor
谢维柱
邢越
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201910953104.7A priority Critical patent/CN110716959A/en
Publication of CN110716959A publication Critical patent/CN110716959A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24568Data stream processing; Continuous queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/0223User address space allocation, e.g. contiguous or non contiguous base addressing
    • G06F12/023Free address space management
    • G06F12/0238Memory management in non-volatile memory, e.g. resistive RAM or ferroelectric memory
    • G06F12/0246Memory management in non-volatile memory, e.g. resistive RAM or ferroelectric memory in block erasable memory, e.g. flash memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2291User-Defined Types; Storage management thereof

Abstract

The application discloses a streaming data processing method and device, electronic equipment and a storage medium, and relates to the technical field of data processing. The specific implementation scheme is as follows: the method comprises the steps of firstly obtaining the running state of an instance corresponding to each data fragment in a first data storage container, wherein the first data storage container comprises a plurality of data fragments, cache data of a data issuing component is stored in the first data storage container according to the granularity of the data fragments, then determining overflow data fragments according to the running state, the overflow data fragments are the data fragments corresponding to the instances with the running state being an abnormal state, and finally caching the data in the overflow data fragments into a storage medium. According to the streaming data processing method, the data fragments corresponding to the instances with the abnormal operating states are cached in the storage medium, so that the abnormal data fragments and the normal data fragments are processed separately, and further the phenomenon that the overall processing capacity is reduced due to short-term abnormality of the operator instances at a certain level is avoided.

Description

Streaming data processing method and device, electronic equipment and storage medium
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a streaming data processing method and apparatus, an electronic device, and a storage medium.
Background
With the development of server data processing capacity, stream computing is widely applied to large-scale distributed computing scenes such as information stream, search and database building, retrieval and charging and the like.
Among them, in the streaming calculation, when the rate at which the system receives data is higher than the rate at which it processes data in a short time, the system will normally perform back pressure (back pressure) processing. The reasons for the back pressure may be as follows: 1. the flow suddenly increases and exceeds the processing capacity of the system; 2. and the processing capacity is reduced due to short-term exception of a certain-level operator instance. At present, the two situations are not distinguished in the prior art for processing the back pressure, and all the situations are processed in a back pressure mode uniformly.
However, if the processing capability is reduced due to short-term abnormality of an operator instance at a certain level, and a backpressure mode is directly adopted, normal and abnormal instances cannot be isolated, and further the flow-type computing power and the waste of memory are caused.
Disclosure of Invention
The application provides a streaming data processing method, a streaming data processing device, electronic equipment and a storage medium, which are used for solving the problem of reduced overall processing capacity caused by short-term abnormality of a certain-level operator instance.
In a first aspect, the present application provides a streaming data processing method, including:
acquiring the running state of an instance corresponding to each data fragment in a first data storage container, wherein the first data storage container comprises a plurality of data fragments, and cache data of a data issuing component is stored in the first data storage container according to the granularity of the data fragments;
determining an overflow data fragment according to the running state, wherein the overflow data fragment is a data fragment corresponding to an instance of which the running state is an abnormal state;
and caching the data in the overflow data fragments into a storage medium.
In this embodiment, the running state of the instance corresponding to each data fragment in the first data storage structure is obtained, then the overflow data fragment corresponding to the instance in which the running state is the abnormal state is determined according to the running state, and then the data in the overflow data fragment is cached in the storage medium, so that the data in the first data storage structure can be overflowed according to the data fragment granularity, so as to cache the data fragment corresponding to the instance in which the running state is the abnormal state in the storage medium, so that the abnormal data fragment and the normal data fragment are separately processed, and the situation that the overall processing capability is reduced due to short-term abnormality of an operator instance at a certain level is avoided.
In one possible design, the caching the data in the overflow data slice into a storage medium includes:
and caching data in the data units into a destaging data file of the storage medium, wherein the overflowing data fragment comprises at least one data unit, and the data units store the data according to a preset data structure.
In one possible design, the preset data structure includes a data identifier, a data recording unit, and landing file meta information, where the landing file meta information includes storage location information and file information;
the data identifier is a unique identifier of each unit of data;
the data recording unit is used for storing data in the unit data, and when the data in the data unit is cached in a landing data file of the storage medium, the data unit is an empty unit;
the storage position information is the position information of the landing data file;
the file information is the position information of the data in the unit data in the landing data file.
In this embodiment, by defining the preset data structure, a relationship between the identifier in the data fragment and the data stored in the destage data file is established, so as to facilitate subsequent data loading.
In one possible design, after the buffering the data in the overflow data slice to the storage medium, the method further includes:
acquiring a pointer resetting instruction;
and resetting the identification pointer of the data fragment according to the pointer resetting instruction, wherein the identification pointer comprises a first pointer and a second pointer, the first pointer is used for pointing to the tail of the last pulled data, and the second pointer is used for pointing to the tail of the current pulled data.
In this embodiment, after the pointer reset instruction is acquired, the first pointer and the second pointer may be reset to perform data reading again.
In one possible design, the determining overflow data slices according to the operating state includes:
acquiring the occurrence frequency of the pointer resetting instruction;
if the occurrence frequency is less than a first frequency threshold, determining the data fragment corresponding to the pulled data as the overflow data fragment when the storage capacity proportion of the cache data in the first data storage container exceeds a first proportion threshold.
In one possible design, after the obtaining the occurrence frequency of the pointer reset instruction, the method further includes:
and if the occurrence frequency is greater than a second frequency threshold, cancelling the caching of the data in the overflow data fragment in the storage medium.
In one possible design, the storage medium is a solid state disk and/or a mechanical hard disk.
In one possible design, the streaming data processing method further includes:
processing first data in the data fragment;
and if the first data is cached in the storage medium, sending confirmation information, wherein the confirmation information is used for indicating an upstream node to recycle the first data.
In one possible design, the streaming data processing method further includes:
processing the first data in the data fragment to generate second data;
and if the second data is determined to be consumed by the downstream node, sending confirmation information, wherein the confirmation information is used for indicating the upstream node to recycle the first data.
In a second aspect, the present application further provides a streaming data processing apparatus, including:
the system comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring the running state of an instance corresponding to each data fragment in a first data storage container, the first data storage container comprises a plurality of data fragments, and cache data of a data issuing component is stored in the first data storage container according to the granularity of the data fragments;
the processing module is used for determining an overflow data fragment according to the running state, wherein the overflow data fragment is a data fragment corresponding to an instance of which the running state is an abnormal state;
and the cache module is used for caching the data in the overflow data fragments into a storage medium.
In one possible design, the cache module is specifically configured to:
and caching data in the data units into a destaging data file of the storage medium, wherein the overflowing data fragment comprises at least one data unit, and the data units store the data according to a preset data structure.
In one possible design, the preset data structure includes a data identifier, a data recording unit, and landing file meta information, where the landing file meta information includes storage location information and file information;
the data identifier is a unique identifier of each unit of data;
the data recording unit is used for storing data in the unit data, and when the data in the data unit is cached in a landing data file of the storage medium, the data unit is an empty unit;
the storage position information is the position information of the landing data file;
the file information is the position information of the data in the unit data in the landing data file.
In one possible design, the obtaining module is further configured to obtain a pointer reset instruction;
the processing module is further configured to reset an identifier pointer of the data segment according to the pointer resetting instruction, where the identifier pointer includes a first pointer and a second pointer, the first pointer is used to point to the tail of the last pulled data, and the second pointer is used to point to the tail of the current pulled data.
In one possible design, the processing module is specifically configured to:
acquiring the occurrence frequency of the pointer resetting instruction;
if the occurrence frequency is less than a first frequency threshold, determining the data fragment corresponding to the pulled data as the overflow data fragment when the storage capacity proportion of the cache data in the first data storage container exceeds a first proportion threshold.
In a possible design, the processing module is further configured to cancel caching of data in the overflow data slice in the storage medium if the occurrence frequency is greater than a second frequency threshold.
In one possible design, the storage medium is a solid state disk and/or a mechanical hard disk.
In one possible design, the processing module is specifically configured to:
processing first data in the data fragment;
and if the first data is cached in the storage medium, sending confirmation information, wherein the confirmation information is used for indicating an upstream node to recycle the first data.
In one possible design, the processing module is specifically configured to:
processing the first data in the data fragment to generate second data;
and if the second data is determined to be consumed by the downstream node, sending confirmation information, wherein the confirmation information is used for indicating the upstream node to recycle the first data.
In a third aspect, the present application further provides an electronic device, including:
a processor; and
a memory for storing a computer program for the processor;
wherein the processor is configured to implement any one of the possible methods of the first aspect by executing the computer program.
In a fourth aspect, the present application also provides a non-transitory computer readable storage medium having stored thereon a computer program of instructions for implementing any one of the possible methods of the first aspect when executed by a processor.
One embodiment in the above application has the following advantages or benefits:
the running state of an instance corresponding to each data fragment in the first data storage structure is obtained, then the overflow data fragment corresponding to the instance with the running state being the abnormal state is determined according to the running state, and then the data in the overflow data fragment is cached to the storage medium, so that the data in the first data storage structure can be overflowed according to the granularity of the data fragment, the data fragment corresponding to the instance with the running state being the abnormal state is cached to the storage medium, the abnormal data fragment and the normal data fragment are processed separately, and the condition that the overall processing capacity is reduced due to short-term abnormality of a certain level of operator instance is avoided.
Other effects of the above-described alternative will be described below with reference to specific embodiments.
Drawings
The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:
fig. 1 is a schematic application scenario diagram of a streaming data processing method according to a first embodiment of the present application;
FIG. 2 is a diagram illustrating a limited capacity blocking queue relationship provided in a second embodiment of the present application;
fig. 3 is a schematic flow chart of a streaming data processing method according to a third embodiment of the present application;
FIG. 4 is a diagram illustrating a data mapping relationship between data slices in the embodiment shown in FIG. 3;
fig. 5 is a schematic flow chart of a streaming data processing method according to a fourth embodiment of the present application;
FIG. 6 is a data fragmentation granularity overflow diagram in the embodiment shown in FIG. 5;
fig. 7 is a schematic structural diagram of a streaming data processing apparatus according to a fifth embodiment of the present application;
fig. 8 is a block diagram of an electronic device for implementing the streaming data processing method according to the embodiment of the present application.
Detailed Description
The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
In existing streaming computing, when the rate at which the system receives data is higher than the rate at which it processes data for a short period of time, the system will typically perform backpressure processing. Among them, the following two reasons for the back pressure may be considered: 1. the flow suddenly increases and exceeds the processing capacity of the system; 2. and the processing capacity is reduced due to short-term exception of a certain-level operator instance. However, in the prior art, the two situations are not distinguished, and all the situations are treated in a backpressure mode. If the processing capacity is reduced due to short-term abnormality of a certain level of operator instance, and a backpressure mode is directly adopted, normal instances and abnormal instances cannot be isolated, and further the flow type computing power and the waste of memory can be caused. In view of the above problems, embodiments of the present application provide a streaming data processing method, which can be used for processing streaming data. The streaming data processing method is described in detail below with several specific implementations.
Fig. 1 is a schematic application scenario diagram of a streaming data processing method according to a first embodiment of the present application. As shown in fig. 1, the streaming data processing method according to the present embodiment may be applied to a streaming computing system, where the streaming computing system may include a plurality of nodes. For example, a streaming computing system may include: an upstream data node 110, a current compute node 120, and a downstream data node 130.
Fig. 2 is a diagram illustrating a limited capacity blocking queue relationship according to a second embodiment of the present application. As shown in fig. 2, in the current computing node 120, a data pull receiving component (Transporter overflow, Tpin for short) pulls (fetch) data from a data delivery service component (Transporter out, Tpout for short) of the upstream data node 110 through a Remote Procedure Call (Remote Procedure Call, PRC for short). And, the pulled data is stored in the data pull reception component data storage structure (Tpin stream data).
Then, the data is calculated and processed at the current computing node 120, and the data output to the downstream data node 130 is stored in the data delivery service component data storage structure (Tpout stream data), so as to provide data service for the downstream data node 130.
It should be noted that, the data in Tpout stream data is stored according to data slicing (Kg), for example: out-kg-1, out-kg-2, and out-kg-M. In addition, the existing data fragment queue (key group queue) in the Tpout stream data can be set as an overflow data fragment queue (spillable queue), and when it is not desirable to cache too much data in the memory (buffer) and further the memory index is wasted, the data can be cached in a storage medium with a larger capacity, for example: a Solid State Disk (SSD) or a Hard Disk Drive (HDD). Therefore, the data of Tpout stream data can overflow according to the granularity of data fragmentation. And caching the data fragments corresponding to the instances with the abnormal operating states into a storage medium, so that the condition that the overall processing capacity is reduced due to short-term abnormality of the operator instances at a certain level is avoided.
With continued reference to FIG. 2, the downstream compute node may obtain an instance via a component (Tp-source) for pulling an upstream single instance, while the data processing central control component (Task Main) is for cooperatively controlling user traffic logic, data acquisition, data output, state storage, etc. For the State management component (State manager), it is used to restore and backup the system State. For the blocking queue (Dedup queue) for storing the deduplication marks, the blocking queue (Journal queue) for storing the incremental system state, and the blocking queue (Production queue) for storing the data processing results, data push (push) and pop (pop) queues may be available. Further, by Spillout or Spillable, it is understood that the queue capacity is limited, and if the amount of data exceeds the queue capacity, some data is stored in another medium, such as the data overflow capability of the local disk.
Fig. 3 is a schematic flow chart of a streaming data processing method according to a third embodiment of the present application. As shown in fig. 3, the streaming data processing method provided in this embodiment includes:
s101, acquiring the running state of the instance corresponding to each data fragment in the first data storage container.
In this step, the operation state of the instance corresponding to each data slice in the first data storage structure may be obtained, where the first data storage structure includes a plurality of data slices, and the cache data of the data issuing component is stored in the first data storage structure according to the granularity of the data slices, as shown with continued reference to fig. 2, the first data storage structure in this step may be Tpout stream data as described above.
And S102, determining the overflow data fragments according to the running state.
After the operation states of the instances corresponding to the respective data fragments are determined, the overflow data fragments may also be determined according to the operation states, where the overflow data fragments are data fragments corresponding to the instances whose operation states are abnormal states, which is worth explaining, in this embodiment, the specific form of the abnormal state is not limited, and the abnormal state may be understood as any behavior that occupies a Tpout stream data space.
S103, caching the data in the overflow data fragments into a storage medium.
Optionally, for caching data in the overflow data segment in the storage medium, it may be understood that data in the data unit is cached in a destage data file of the storage medium, and the overflow data segment includes at least one data unit, and the data unit stores data according to a preset data structure.
Fig. 4 is a schematic diagram of a data mapping relationship of the data slice in the embodiment shown in fig. 3. As shown in fig. 4, the preset data structure may include a data identifier (Unique ID), a data recording unit (Record), and landing file meta information (spill info), where the landing file meta information (spill info) includes storage location information (offset) and file information (file).
Specifically, the data identifier is a unique identifier of each unit of data, for example, referring to an example (x, x, x) in fig. 4, the unique identifier may correspond to (operator sequence number, key group sequence number, data sequence number). And the data recording unit is used for storing the data in the unit data, when the data in the unit data is cached in a landing data file of the storage medium, the unit data is set as an empty unit, the storage position information is the position information of the landing data file, and the file information is the position information of the data in the unit data in the landing data file.
In this embodiment, the running state of the instance corresponding to each data fragment in the first data storage structure is obtained, then the overflow data fragment corresponding to the instance in which the running state is the abnormal state is determined according to the running state, and then the data in the overflow data fragment is cached in the storage medium, so that the data in the first data storage structure can be overflowed according to the data fragment granularity, so as to cache the data fragment corresponding to the instance in which the running state is the abnormal state in the storage medium, so that the abnormal data fragment and the normal data fragment are separately processed, and the situation that the overall processing capability is reduced due to short-term abnormality of an operator instance at a certain level is avoided.
On the basis of the above embodiments, the storage medium for the above may be a solid state disk and/or a mechanical hard disk. When the downstream computing node has sent the data, the upstream computing node may also send Acknowledgement (ACK) in various ways.
In one possible implementation, the first data in the data fragment may be processed first, and if the first data is cached in the storage medium, the acknowledgement information is sent, where the acknowledgement information is used to instruct the upstream node to recycle the first data.
The above acknowledgement feedback mode may be defined as a strong data processing result feedback mode (strong processing mode), where when the data processing result (processing) already exists in a database (table, for example, a key-value database) in the form of an incremental system state (journel), the stored data (scan) is scanned out when the processing process exits from restart (failover) due to an exception, and then the data is directly used and cached in the local disk without recovery.
In another possible implementation, the first data in the data fragment may be processed first to generate the second data, and when it is determined that the second data is completely consumed by the downstream node, the acknowledgement information may be sent, where the acknowledgement information is used to instruct the upstream node to recycle the first data.
The above acknowledgement feedback mode may be defined as a weak data processing result feedback mode (weak production mode), and when failover occurs, data buffered in the local disk does not need to be recovered by means of upstream playback.
For At most once (At-most-once), when the Tpout stream data is full, the previous data is cancelled (drop) before pushing (push), so that the problem of the At-most-once can be avoided.
Fig. 5 is a schematic flow chart of a streaming data processing method according to a fourth embodiment of the present application. As shown in fig. 5, the streaming data processing method provided in this embodiment includes:
s201, acquiring the running state of the corresponding instance of each data fragment in the first data storage container.
And S202, determining the overflow data fragments according to the running state.
And S203, caching the data in the overflow data fragments into a storage medium.
It should be noted that, in this embodiment, the implementation manners of S201 to S203 are similar to those of S101 to S103 in the embodiment shown in fig. 3, and in this embodiment, the implementation manners are not particularly limited.
S204, acquiring a pointer resetting instruction.
In this step, when a restart or network jitter occurs downstream, a Synchronization (SYNC) command is sent to the feedback pointer.
And S205, resetting the identification pointer of the data fragment according to the pointer resetting instruction.
In this step, fig. 6 is a schematic diagram of data slice granularity overflow in the embodiment shown in fig. 5. Referring to fig. 6, after receiving SYNC, the identifier pointers of the data fragments are reset according to a pointer reset instruction, where the identifier pointers include a first pointer (last) and a second pointer (next), the first pointer is used to point to the tail of the last pulled data, and the second pointer is used to point to the tail of the current pulled data.
In addition, the occurrence frequency of the pointer reset instruction may also be acquired, and if the occurrence frequency is less than the first frequency threshold, after the storage capacity proportion of the cache data in the first data storage structure exceeds the first proportion threshold, the data fragment corresponding to the pulled data is determined as the overflow data fragment. Specifically, in consideration of the App running stably on the line, the restart rate is low, the number of SYNC times should be extremely low, and if the frequency of SYNC is extremely low and the capacity reaches 80%, the data fragment corresponding to the pulled data is determined as the overflow data fragment.
However, if the SYNC frequency is too high, frequent reading and writing may occur, and in this case, buffering of data in the overflow data segment into the storage medium needs to be cancelled. Specifically, when the occurrence frequency is greater than the second frequency threshold, the buffering of the data in the overflow data slice into the storage medium is cancelled.
After receiving SYNC (including downstream restart + network jitter), reset by the cursor, and read the data needed by the memory according to the spollout info, where a small number of pieces of data can be read at a time and cached in the memory.
If the Tpout spullout has too much data and a congestion queue (same as the Production queue) for storing the data processing result has limitation, before scan data in the table, a data sink Production _ notification _ queue consuming thread may be started, and the state management component may block if the Production notification queue is full.
Fig. 7 is a schematic structural diagram of a streaming data processing apparatus according to a fifth embodiment of the present application. As shown in fig. 7, an embodiment provides a streaming data processing apparatus 300, including:
an obtaining module 301, configured to obtain an operating state of an instance corresponding to each data fragment in a first data storage container, where the first data storage container includes multiple data fragments, and cache data of a data issuing component is stored in the first data storage container according to a data fragment granularity;
a processing module 302, configured to determine an overflow data fragment according to the running state, where the overflow data fragment is a data fragment corresponding to an instance in which the running state is an abnormal state;
a buffer module 303, configured to buffer data in the overflow data slice into a storage medium.
In a possible design, the cache module 303 is specifically configured to:
and caching data in the data units into a destaging data file of the storage medium, wherein the overflowing data fragment comprises at least one data unit, and the data units store the data according to a preset data structure.
In one possible design, the preset data structure includes a data identifier, a data recording unit, and landing file meta information, where the landing file meta information includes storage location information and file information;
the data identifier is a unique identifier of each unit of data;
the data recording unit is used for storing data in the unit data, and when the data in the data unit is cached in a landing data file of the storage medium, the data unit is an empty unit;
the storage position information is the position information of the landing data file;
the file information is the position information of the data in the unit data in the landing data file.
In a possible design, the obtaining module 301 is further configured to obtain a pointer resetting instruction;
the processing module is further configured to reset an identifier pointer of the data segment according to the pointer resetting instruction, where the identifier pointer includes a first pointer and a second pointer, the first pointer is used to point to the tail of the last pulled data, and the second pointer is used to point to the tail of the current pulled data.
In one possible design, the processing module 302 is specifically configured to:
acquiring the occurrence frequency of the pointer resetting instruction;
if the occurrence frequency is less than a first frequency threshold, determining the data fragment corresponding to the pulled data as the overflow data fragment when the storage capacity proportion of the cache data in the first data storage container exceeds a first proportion threshold.
In a possible design, the processing module 302 is further configured to cancel caching of data in the overflow data slice in the storage medium if the occurrence frequency is greater than a second frequency threshold.
In one possible design, the storage medium is a solid state disk and/or a mechanical hard disk.
In one possible design, the processing module 302 is specifically configured to:
processing first data in the data fragment;
and if the first data is cached in the storage medium, sending confirmation information, wherein the confirmation information is used for indicating an upstream node to recycle the first data.
In one possible design, the processing module 302 is specifically configured to:
processing the first data in the data fragment to generate second data;
and if the second data is determined to be consumed by the downstream node, sending confirmation information, wherein the confirmation information is used for indicating the upstream node to recycle the first data.
It should be noted that the streaming data processing apparatus provided in the embodiment shown in fig. 7 can be used to execute the method provided in any of the above embodiments, and the specific implementation manner and the technical effect are similar, and are not described again here.
According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.
Fig. 8 is a block diagram of an electronic device for implementing the streaming data processing method according to the embodiment of the present application. As shown in FIG. 8, the electronic device is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.
As shown in fig. 8, the electronic apparatus includes: one or more processors 401, memory 402, and interfaces for connecting the various components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). In fig. 8, one processor 401 is taken as an example.
Memory 402 is a non-transitory computer readable storage medium as provided herein. Wherein the memory stores instructions executable by at least one processor to cause the at least one processor to perform the streaming data processing method provided herein. The non-transitory computer-readable storage medium of the present application stores computer instructions for causing a computer to perform the streaming data processing method provided by the present application.
The memory 402, as a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules (e.g., the acquisition module 301, the processing module 302, and the cache module 303 shown in fig. 7) corresponding to the streaming data processing method in the embodiment of the present application. The processor 401 executes various functional applications of the server and data processing by running non-transitory software programs, instructions, and modules stored in the memory 402, that is, implements the streaming data processing method in the above-described method embodiments.
The memory 402 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area can store data created by use of the electronic device provided according to the embodiment of the application, and the like. Further, the memory 402 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 402 may optionally include memory located remotely from processor 401, which may be connected to provide electronic devices in accordance with embodiments of the present application via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The electronic device provided by the embodiment of the application may further include: an input device 403 and an output device 404. The processor 401, the memory 402, the input device 403 and the output device 404 may be connected by a bus or other means, and fig. 8 illustrates an example of a connection by a bus.
The input device 403 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic device provided by embodiments of the present application, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointing stick, one or more mouse buttons, a track ball, a joystick, and other input devices. The output devices 404 may include a display device, auxiliary lighting devices (e.g., LEDs), and haptic feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.
Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
According to the technical scheme of the embodiment of the application, the running states of the instances corresponding to the data fragments in the first data storage structure are obtained, the overflow data fragments corresponding to the instances with the running states in the abnormal states are determined according to the running states, and the data in the overflow data fragments are cached in the storage medium, so that the data in the first data storage structure can be overflowed according to the granularity of the data fragments, the data fragments corresponding to the instances with the running states in the abnormal states are cached in the storage medium, the abnormal data fragments and the normal data fragments are processed separately, and the condition that the overall processing capacity is reduced due to short-term abnormality of the operator instances at a certain level is avoided.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.
The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (20)

1. A streaming data processing method, comprising:
acquiring the running state of an instance corresponding to each data fragment in a first data storage structure, wherein the first data storage structure comprises a plurality of data fragments, and cache data of a data issuing component is stored in the first data storage structure according to the granularity of the data fragments;
determining an overflow data fragment according to the running state, wherein the overflow data fragment is a data fragment corresponding to an instance of which the running state is an abnormal state;
and caching the data in the overflow data fragments into a storage medium.
2. The streaming data processing method according to claim 1, wherein the buffering the data in the overflow data slice to a storage medium comprises:
and caching data in the data units into a destaging data file of the storage medium, wherein the overflowing data fragment comprises at least one data unit, and the data units store the data according to a preset data structure.
3. The streaming data processing method according to claim 2, wherein the preset data structure includes a data identifier, a data recording unit, and landing file meta information, the landing file meta information including storage location information and file information;
the data identifier is a unique identifier of each unit of data;
the data recording unit is used for storing data in the unit data, and when the data in the data unit is cached in a landing data file of the storage medium, the data unit is an empty unit;
the storage position information is the position information of the landing data file;
the file information is the position information of the data in the unit data in the landing data file.
4. The streaming data processing method according to claim 3, further comprising, after the buffering the data in the overflow data slice to a storage medium:
acquiring a pointer resetting instruction;
and resetting the identification pointer of the data fragment according to the pointer resetting instruction, wherein the identification pointer comprises a first pointer and a second pointer, the first pointer is used for pointing to the tail of the last pulled data, and the second pointer is used for pointing to the tail of the current pulled data.
5. The streaming data processing method of claim 4, wherein determining overflow data slices according to the running state comprises:
acquiring the occurrence frequency of the pointer resetting instruction;
if the occurrence frequency is less than a first frequency threshold, determining the data fragment corresponding to the pulled data as the overflow data fragment when the storage capacity proportion of the cache data in the first data storage structure exceeds a first proportion threshold.
6. The streaming data processing method according to claim 5, further comprising, after the obtaining the frequency of occurrence of the pointer reset instruction:
and if the occurrence frequency is greater than a second frequency threshold, cancelling the caching of the data in the overflow data fragment in the storage medium.
7. A streaming data processing method according to any of claims 1-6, wherein the storage medium is a solid state disk and/or a mechanical hard disk.
8. The streaming data processing method according to any one of claims 1 to 6, further comprising:
processing first data in the data fragment;
and if the first data is cached in the storage medium, sending confirmation information, wherein the confirmation information is used for indicating an upstream node to recycle the first data.
9. The streaming data processing method according to any one of claims 1 to 6, further comprising:
processing the first data in the data fragment to generate second data;
and if the second data is determined to be consumed by the downstream node, sending confirmation information, wherein the confirmation information is used for indicating the upstream node to recycle the first data.
10. A streaming data processing apparatus, comprising:
the system comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring the running state of an instance corresponding to each data fragment in a first data storage structure, the first data storage structure comprises a plurality of data fragments, and cache data of a data issuing assembly is stored in the first data storage structure according to the granularity of the data fragments;
the processing module is used for determining an overflow data fragment according to the running state, wherein the overflow data fragment is a data fragment corresponding to an instance of which the running state is an abnormal state;
and the cache module is used for caching the data in the overflow data fragments into a storage medium.
11. The streaming data processing apparatus according to claim 10, wherein the cache module is specifically configured to:
and caching data in the data units into a destaging data file of the storage medium, wherein the overflowing data fragment comprises at least one data unit, and the data units store the data according to a preset data structure.
12. The streaming data processing apparatus of claim 11, wherein the preset data structure includes a data identifier, a data recording unit, and landing file meta information, the landing file meta information including storage location information and file information;
the data identifier is a unique identifier of each unit of data;
the data recording unit is used for storing data in the unit data, and when the data in the data unit is cached in a landing data file of the storage medium, the data unit is an empty unit;
the storage position information is the position information of the landing data file;
the file information is the position information of the data in the unit data in the landing data file.
13. The streaming data processing apparatus of claim 12, wherein the fetch module is further configured to fetch a pointer reset instruction;
the processing module is further configured to reset an identifier pointer of the data segment according to the pointer resetting instruction, where the identifier pointer includes a first pointer and a second pointer, the first pointer is used to point to the tail of the last pulled data, and the second pointer is used to point to the tail of the current pulled data.
14. The streaming data processing apparatus according to claim 13, wherein the processing module is specifically configured to:
acquiring the occurrence frequency of the pointer resetting instruction;
if the occurrence frequency is less than a first frequency threshold, determining the data fragment corresponding to the pulled data as the overflow data fragment when the storage capacity proportion of the cache data in the first data storage container exceeds a first proportion threshold.
15. The streaming data processing apparatus of claim 14, wherein the processing module is further configured to cancel buffering data in the overflow data slice to the storage medium if the occurrence frequency is greater than a second frequency threshold.
16. A streaming data processing device according to any of claims 10-15, whereby the storage medium is a solid state disk and/or a mechanical hard disk.
17. The streaming data processing apparatus according to any one of claims 10 to 15, wherein the processing module is specifically configured to:
processing first data in the data fragment;
and if the first data is cached in the storage medium, sending confirmation information, wherein the confirmation information is used for indicating an upstream node to recycle the first data.
18. The streaming data processing apparatus according to any one of claims 10 to 15, wherein the processing module is specifically configured to:
processing the first data in the data fragment to generate second data;
and if the second data is determined to be consumed by the downstream node, sending confirmation information, wherein the confirmation information is used for indicating the upstream node to recycle the first data.
19. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-9.
20. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-9.
CN201910953104.7A 2019-10-09 2019-10-09 Streaming data processing method and device, electronic equipment and storage medium Pending CN110716959A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910953104.7A CN110716959A (en) 2019-10-09 2019-10-09 Streaming data processing method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910953104.7A CN110716959A (en) 2019-10-09 2019-10-09 Streaming data processing method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN110716959A true CN110716959A (en) 2020-01-21

Family

ID=69212331

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910953104.7A Pending CN110716959A (en) 2019-10-09 2019-10-09 Streaming data processing method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN110716959A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111666077A (en) * 2020-04-13 2020-09-15 北京百度网讯科技有限公司 Operator processing method and device, electronic equipment and storage medium
CN111930748A (en) * 2020-08-07 2020-11-13 北京百度网讯科技有限公司 Data tracking method, device, equipment and storage medium for streaming computing system
CN112019605A (en) * 2020-08-13 2020-12-01 上海哔哩哔哩科技有限公司 Data distribution method and system of data stream
CN112860719A (en) * 2021-02-05 2021-05-28 北京百度网讯科技有限公司 Data processing method and device and electronic equipment

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111666077A (en) * 2020-04-13 2020-09-15 北京百度网讯科技有限公司 Operator processing method and device, electronic equipment and storage medium
CN111930748A (en) * 2020-08-07 2020-11-13 北京百度网讯科技有限公司 Data tracking method, device, equipment and storage medium for streaming computing system
CN111930748B (en) * 2020-08-07 2023-08-08 北京百度网讯科技有限公司 Method, device, equipment and storage medium for tracking data of streaming computing system
CN112019605A (en) * 2020-08-13 2020-12-01 上海哔哩哔哩科技有限公司 Data distribution method and system of data stream
CN112019605B (en) * 2020-08-13 2023-05-09 上海哔哩哔哩科技有限公司 Data distribution method and system for data stream
CN112860719A (en) * 2021-02-05 2021-05-28 北京百度网讯科技有限公司 Data processing method and device and electronic equipment
CN112860719B (en) * 2021-02-05 2023-09-29 北京百度网讯科技有限公司 Data processing method and device and electronic equipment

Similar Documents

Publication Publication Date Title
CN110716959A (en) Streaming data processing method and device, electronic equipment and storage medium
US11010171B2 (en) Efficient out of process reshuffle of streaming data
WO2012076380A1 (en) Dynamic administration of event pools for relevent event and alert analysis during event storms
CN112181683A (en) Concurrent consumption method and device for message middleware
CN111782365B (en) Timed task processing method, device, equipment and storage medium
CN113364877B (en) Data processing method, device, electronic equipment and medium
JP2023507273A (en) Abnormal stack processing method, system, electronic device, and storage medium
CN110765075A (en) Storage method and equipment of automatic driving data
CN110619008A (en) Database query method and device, electronic equipment and storage medium
US9754004B2 (en) Asynchronous notification method for data storage systems
CN111782357A (en) Label control method and device, electronic equipment and readable storage medium
CN112565356A (en) Data storage method and device and electronic equipment
CN111831752A (en) Distributed database space arrangement method, device, equipment and storage medium
CN111782147A (en) Method and apparatus for cluster scale-up
CN111966471B (en) Access method, device, electronic equipment and computer storage medium
CN111046074B (en) Streaming data processing method, device, equipment and medium
US20210208976A1 (en) Backup management
CN112306413A (en) Method, device, equipment and storage medium for accessing memory
CN112084204A (en) Browsing data processing method, device, terminal and storage medium
CN113742308A (en) Method and device for storing log
CN111752695A (en) Offline method, device, equipment and storage medium
CN113821232A (en) Model updating method and device
CN113127512B (en) Multi-data stream data splicing triggering method and device, electronic equipment and medium
KR101887741B1 (en) Adaptive Block Cache Management Method and DBMS applying the same
CN115629918A (en) Data processing method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination