CN107818012B

CN107818012B - Data processing method and device and electronic equipment

Info

Publication number: CN107818012B
Application number: CN201610818710.4A
Authority: CN
Inventors: 刘峰
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2016-09-12
Filing date: 2016-09-12
Publication date: 2021-08-27
Anticipated expiration: 2036-09-12
Also published as: CN107818012A

Abstract

The application provides a data processing method, a data processing device and electronic equipment, wherein the data processing method comprises the following steps: putting a data reading task generated by the partition into a task queue; when the number of the reading threads in the thread pool does not reach the preset upper limit, extracting a data reading task from the task queue, establishing a reading thread according to the extracted task, and putting the reading thread into the thread pool; the thread pool is used for storing reading threads which occupy processing resources in turn. According to the method and the device, under the conditions of more partitions and larger partition throughput difference, the reading efficiency of the partition data can be improved.

Description

Data processing method and device and electronic equipment

Technical Field

The present invention relates to the field of computers, and in particular, to a data processing method and apparatus, and an electronic device.

Background

Currently, a data source (or data transit) of a cloud computing big data module, which is relatively universal, is generally implemented by Kafka (or a product similar to Kafka, such as MetaQ or Loghub). Kafka is a high-throughput distributed publish-subscribe messaging system, where publication (publish) of messages is referred to as producer (producer) and subscription (subscribe) of messages is referred to as consumer (consumer). The MetaQ is a high performance, highly available, scalable distributed message middleware. LogHub is a service of a journal product that provides Kafka-like business functions.

The data source has two characteristics, namely, the data source is divided into a plurality of partitions, and each partition can be consumed by only one thread which is usually a reading thread; under these two features, more efficient throughput may be supported. Wherein, the subarea is a basic unit for distributed processing of Kafka and similar products, and has the following characteristics: a partition holding first-in-first-out logic of data; only one thread can consume in one partition; each datum has an offset (Cursor/offset) record; each reading is based on the returned data, and the information of the current reading position is also returned.

In the big data processing scenario, Kafka has more services, and the share/Partition has more partitions. On the premise that the data generated by different services are different in quantity, the frequency and the capacity of data generated by each partition are different. In this case, the processing of massive data needs to be done with reasonable resources, which is a great deal of resource and background waste.

Each partition respectively generates a data reading task corresponding to the partition, each data reading task corresponds to a reading thread, and the reading thread is used for reading the data of the corresponding partition when being executed by the CPU. For data reading of a plurality of partitions, the related art includes the following three modes.

Mode one, a mode of multithread independent non-interval processing.

As shown in fig. 1, in this mode, multiple read threads contend for limited CPU resources, and each read thread makes one cycle of non-terminated read attempts. The mode comprises the following three characteristics:

1. each read thread attempts to read every moment, whether read or not.

2. In an extreme case, when a certain read thread cannot read available data all the time, it is equivalent to a dead loop, and at least the time of all user uses (user) of a CPU is consumed.

3. If a small amount of data occurs on average in a partition, the corresponding read thread will also read a small amount of data at a time, but the number of reads is large.

All three characteristics can cause the server responsible for data reading and the server of Kafka to be always in a high-voltage state, waste processing resources and reduce data throughput. In this case the user consumes far more data processing servers than theoretically needed.

Mode two, a mode of increasing the loop reading of Sleep.

In this mode, the issue of multi-threaded reads is generally optimized as a queue, as shown in FIG. 2. The main thread will have a certain Sleep action to release the occupation of the CPU. In this mode, all read threads are cycled at intervals, with a pause between two cycles.

In this mode, if a part of the partitions with a small data size is allocated to the read thread CPU time corresponding to the partition during each loop execution, the CPU resources will be wasted because no more data needs to be read, and if some partitions with a large data size are not allocated to the CPU time enough, the data cannot be read completely, resulting in accumulation.

And in the third mode, Sleep under multithreading releases the mode of CPU resources.

This mode integrates the above two modes, with multiple read threads contending for limited CPU resources, and each read thread having a fixed Sleep cycle, as shown in fig. 3.

In the mode, when the read line is excessive, the CPU can be seriously seized; in addition, switching to another thread by the CPU requires a context switch, that is: the running environment of the current thread is saved and the running environment of the thread to be switched to is restored, so when too many threads are read, the frequency of context switching of the CPU is very high, and a large amount of computing resources are consumed in the context switching of the CPU.

Disclosure of Invention

The application provides a data processing method, a data processing device and an electronic device, which can improve the reading efficiency of partition data under the conditions of more partitions and larger partition throughput difference.

The technical scheme is as follows.

A method of data processing, comprising:

putting a data reading task generated by the partition into a task queue;

when the number of the reading threads in the thread pool does not reach the preset upper limit, extracting a data reading task from the task queue, establishing a reading thread according to the extracted task, and putting the reading thread into the thread pool; the thread pool is used for storing reading threads which occupy processing resources in turn.

Optionally, the placing the data reading task generated by the partition into the task queue includes:

generating a time stamp for the data reading task generated by the partition, wherein the time stamp is used for indicating the moment of starting to execute the data reading task; putting the data reading task carrying the timestamp into the task queue;

the data reading task extracted from the task queue comprises the following steps:

and extracting the data reading task with the time indicated by the timestamp being prior to or equal to the extraction time from the task queue.

Optionally, the generating the timestamp for the data reading task generated by the partition includes:

for the data reading task generated by the partition, the current time is added with the determined delay length to obtain the predicted execution time, and the information representing the predicted execution time is used as the timestamp of the data reading task; the delay length is determined according to the data volume read by the last data reading task of the partition; the larger the amount of data, the shorter the delay length.

Optionally, the determining, by the delay length according to the data amount read by the last data reading task of the partition, includes:

and determining the delay length according to the interval to which the data volume read by the last data reading task of the partition belongs and the corresponding relation between the preset delay length and the interval of the data volume.

Optionally, in the task queue, the data reading tasks are sorted from first to last according to the time indicated by the carried timestamp;

the step of putting the data reading task generated by the partition into the task queue further comprises the following steps:

and according to the time indicated by the timestamp carried by the data reading task, putting the data reading task into a corresponding position in a task queue.

Optionally, the task queue is a priority queue, and a timestamp carried by the data reading task is used as the priority of the data reading task; the earlier the time indicated by the timestamp, the higher the priority.

Optionally, the predetermined upper limit of the number of threads in the thread pool is twice the number of CPUs for executing the read threads in the thread pool.

Optionally, the data processing method further includes:

and executing the reading threads in the thread pool in turn.

A data processing apparatus comprising:

the queue management module is used for putting the data reading task generated by the partition into a task queue;

the extraction module is used for extracting a data reading task from the task queue when the number of the reading threads in the thread pool does not reach a preset upper limit, and establishing a reading thread according to the extracted task and putting the reading thread into the thread pool; the thread pool is used for storing reading threads which occupy processing resources in turn.

Optionally, the queue management module puts the data reading task generated by the partition into a task queue, where the task queue includes:

the queue management module generates a time stamp for a data reading task generated by a partition, and the time stamp is used for indicating the moment when the data reading task starts to be executed; putting the data reading task carrying the timestamp into the task queue;

the extraction module extracts the data reading task from the task queue, and the extraction module comprises the following steps:

the extracting module extracts a data reading task from the task queue, wherein the time indicated by the time stamp is prior to or equal to the extracting time.

Optionally, the generating, by the queue management module, a timestamp for the data reading task generated by the partition includes:

the queue management module obtains an estimated execution time by adding the determined delay length to the current time for the data reading task generated by the partition, and takes the information representing the estimated execution time as a timestamp of the data reading task; the delay length is determined according to the data volume read by the last data reading task of the partition; the larger the amount of data, the shorter the delay length.

the queue management module puts the data reading task generated by the partition into the task queue, and the method further comprises the following steps:

and the queue management module puts the data reading task into a corresponding position in the task queue according to the time indicated by the timestamp carried by the data reading task.

Optionally, the data processing apparatus further includes:

and the reading module is used for executing the reading threads in the thread pool in turn.

An electronic device for data processing, comprising: a memory and a processor;

the memory is used for storing programs for data processing; the program for data processing, when read and executed by the processor, performs the following operations:

putting a data reading task generated by the partition into a task queue;

The application includes the following advantages:

in at least one embodiment of the application, under the condition of more partitions and larger partition throughput difference, data generated by the partitions can be efficiently read, and CPU resources are reasonably utilized. On one hand, the number of threads preempting the CPU resource is controlled through the task queue, so that the data reading task of each partition can not occupy the CPU resource, and the data reading task can only occupy the CPU resource after the threads are established according to the reading task and placed in the thread pool. The number of the threads in the thread pool is less than that of the partitions, so that large-scale CPU preemption conflict cannot occur, and the context switching frequency of the CPU is also reduced, so that the resources spent by the CPU in context switching can be reduced, and the processing efficiency is improved. On the other hand, because the number of threads occupying the CPU in turn is reduced, the time of the CPU for distributing each thread can be longer, so that the CPU can read more data from the partition each time. And the data reading task is extracted from the task queue to supplement the thread in the thread pool, so that the CPU resource can be used as much as possible without waste.

In an implementation manner of the embodiment of the application, a timestamp is added to a data reading task in a task queue, and the data reading task is extracted according to the timestamp when being extracted from the task queue, so that the task can be ensured to call a thread pool for use according to the service urgency degree by adjusting the timestamp of the data reading task. Optionally, the timestamp of the data reading task generated by a partition is determined according to the data volume read from the partition last time, so that intelligent learning of consumption data (i.e. the read data volume) can be realized, and the trial period can be adaptively adjusted, that is, the period for setting up the thread to be put into the thread pool according to the data reading task of the partition. Optionally, the data reading tasks are sequenced in the task queue according to the sequence of the time indicated by the time stamps, so that when the data reading tasks are extracted from the task queue, the time stamps of all the data reading tasks in the task queue do not need to be checked.

Of course, it is not necessary for any product to achieve all of the above-described advantages at the same time for the practice of the present application.

Drawings

FIG. 1 is a schematic diagram of a mode of multi-thread independent non-spaced processing in the related art;

fig. 2 is a schematic diagram of a pattern of increasing the loop reading of Sleep in the related art;

FIG. 3 is a diagram illustrating a Sleep, CPU resource release mode under multithreading in the related art;

FIG. 4 is a flowchart of a data processing method according to the first embodiment;

FIG. 5 is a schematic diagram of an implementation of an example of the first embodiment;

FIG. 6 is a diagram illustrating extraction tasks in an example of the first embodiment;

FIG. 7 is a diagram illustrating the addition of tasks to a queue in an example of the first embodiment;

fig. 8 is a schematic diagram of a data processing apparatus according to a second embodiment.

Detailed Description

The technical solutions of the present application will be described in more detail below with reference to the accompanying drawings and embodiments.

It should be noted that, if not conflicted, the embodiments and the features of the embodiments can be combined with each other and are within the scope of protection of the present application. Additionally, while a logical order is shown in the flow diagrams, in some cases, the steps shown or described may be performed in an order different than here.

In one configuration, a computing device for data processing may include one or more processors (CPUs), input/output interfaces, network interfaces, and memories (memories).

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium. The memory may include module 1, module 2, … …, and module N (N is an integer greater than 2).

Computer-readable media include both non-transitory and non-transitory, removable and non-removable media storage media that can implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media (transient media), such as modulated data signals and carrier waves.

In an embodiment, as shown in fig. 4, a data processing method includes steps S110 to S120:

s110, putting a data reading task generated by partitioning into a task queue;

s120, when the number of the reading threads in the thread pool does not reach a preset upper limit, extracting a data reading task from the task queue, establishing a reading thread according to the extracted task, and putting the reading thread into the thread pool; the thread pool is used for storing reading threads which occupy processing resources in turn.

In the embodiment, under the conditions of more partitions and larger partition throughput difference, the data generated by the partitions can be efficiently read, and the CPU resources are reasonably utilized. On one hand, the number of threads preempting the CPU resource is controlled through the task queue, so that the data reading task of each partition can not occupy the CPU resource, and the data reading task can only occupy the CPU resource after the threads are established according to the reading task and placed in the thread pool. In a big data processing scene, the number of partitions is very large, and the number of threads in a thread pool is necessarily smaller than the number of partitions, so that large-scale CPU preemption conflict cannot occur, the context switching frequency of the CPU is also reduced, resources spent by the CPU in context switching can be reduced, and the processing efficiency is improved. On the other hand, because the number of threads occupying the CPU in turn is reduced, the time of the CPU for distributing each thread can be longer, so that the CPU can read more data from the partition each time. And the data reading task is extracted from the task queue to supplement the thread in the thread pool, so that the CPU resource can be used as much as possible without waste.

In this embodiment, the thread pool is a multi-thread processing form, and the read threads in the thread pool may all be background threads. Each read thread may use a default stack size, run at a default priority, and be in a multi-threaded unit. If a read thread is idle in the managed code (e.g., waiting for an event), the thread pool will insert another helper thread to keep all processors busy. If all the read threads in the thread pool remain busy all the time, but the queue contains pending work, the thread pool will create another helper thread after a period of time, but the number of read threads never exceeds the predetermined upper limit. Read threads that exceed a predetermined upper limit may be queued until other read threads in the thread pool are completed and started.

In this embodiment, the read threads in the thread pool may occupy processing resources in turn by way of preemption, polling, and the like; for example, time slices may be allocated to different read threads, and the read threads may occupy processing resources in the allocated time slices.

In this embodiment, the processing resources occupied by the reading threads in the thread pool in turn may refer to CPU resources used for executing the reading threads in the thread pool.

In this embodiment, each partition generates a data reading task; since a partition can only be consumed by a thread, a new data read task is not generated until the data read task of a partition is not completed. A data reading task generated by a partition is firstly put into a task queue, and after being extracted, a reading thread is correspondingly established and put into the thread pool; the read thread will read data for the corresponding partition. For example, if the read thread is established according to the data read task of the partition a, the data of the partition a is read; after the read thread is executed, the data read task of the partition A is completed; and if the partition A still needs to read the data, generating a new data reading task and putting the new data reading task into the task queue.

In one implementation, the data processing method may be executed by a computing device (which may be, but is not limited to, a server) for reading data of each partition, and the computing device allocates a part of processing resources to execute the steps S110 to S120, where the part of processing resources may be regarded as a central control system; and all or part of the rest processing resources are used for executing the reading threads in the thread pool.

In one implementation, there may be another independent computing device in addition to the computing device for reading the data of each partition to perform the above steps S110 to S120, a thread pool is maintained for the computing device for reading the data of each partition, and the computing device for reading the data of each partition executes the read thread in the thread pool.

In one implementation, data read tasks may be periodically extracted from a task queue; after extraction, when the number of the reading threads in the thread pool is insufficient (namely the number of the reading threads in the thread pool does not reach a preset upper limit), the extracted data reading task is converted into the reading thread to be placed in the thread pool, namely the established reading thread is started.

In an implementation manner, when the number of the read threads in the thread pool is insufficient, the data read task is extracted from the task queue, and the read thread is established and placed in the thread pool. Or may: and (4) periodically extracting the tasks at ordinary times, and if the number of the read threads in the thread pool is insufficient when the extraction period is not reached, immediately extracting.

In one implementation, the predetermined upper limit on the number of threads in the thread pool may be twice the number of CPUs for executing the threads in the thread pool.

If, for example, 4 CPUs are used to execute the read threads in the thread pool, the predetermined upper limit of the number of threads in the thread pool is 8, that is, there may be 8 read threads at most.

In other implementations, the predetermined upper limit of the number of threads in the thread pool may be designed in other sizes, and may be determined empirically or experimentally. Typically, the predetermined upper limit is less than the number of partitions from which data is to be read.

In one implementation, the placing the data reading task generated by the partition into the task queue may include:

the data reading task is extracted from the task queue and comprises the following steps:

In this implementation, the extraction time may be, but is not limited to, a time when the data reading task starts to be extracted from the task queue; the time indicated by the timestamp is equal to the extraction time, namely the time indicated by the timestamp is the extraction time, the time indicated by the timestamp is prior to the extraction time, and the time indicated by the timestamp is before the extraction time; for example, if the extraction time is 6 o 'clock 18 min 55 sec on a certain day, and the time indicated by the timestamp is 6 o' clock 18 min 55 sec on the same day, the extraction time is equal to the extraction time, and if the extraction time is 6 o 'clock 18 min 54 sec, 6 o' clock 17 min 12 sec on the same day, the extraction time is prior to the extraction time.

In the implementation mode, the timestamp of the data reading task can be adjusted, and the data reading task is ensured to call the thread pool to use according to the service urgency degree. For example, for a data reading task generated by a partition corresponding to the urgent service, a time stamp indicating that the time is earlier is generated.

In other implementation manners, the order of extracting the data reading tasks may also be determined in other manners, for example, the data reading tasks may be extracted according to a first-in first-out rule, and for example, before being put into a task queue, a priority may be added to the data reading tasks, where the priority is determined according to the urgency of the corresponding service of the partition, and the more urgent the priority is, the higher the priority is; and during extraction, extracting from the task queue according to the sequence of the priority from high to low.

In an alternative of this implementation, generating the time stamp for the data reading task generated by the partition may include:

for the data reading task generated by the partition, the current time is added with the determined delay length to obtain the predicted execution time, and the information representing the predicted execution time is used as the timestamp of the data reading task; the delay length may be determined according to the data size read by the last data reading task of the partition (i.e., the total data size read in the execution process of the reading thread established by the data reading task); the larger the amount of data, the shorter the delay length.

In this alternative, the current time may refer to, but is not limited to, a time when the data reading task is added to the task queue, a time when the data reading task is received, a time when a timestamp is generated for the data reading task, or the like. The information indicating the expected execution time may be a numerical value, a numerical sequence, or the time itself.

In this alternative, for example, when the amount of data read from the partition a last time is small, and when the partition a regenerates the data reading task, the time indicated by the timestamp generated for the data reading task will be relatively late, that is, the time is more than a period of time for reading the data from the partition a again. If the amount of data read from the partition a is large last time, when the partition a reproduces a data reading task, the time indicated by the timestamp generated for the data reading task is earlier, that is, the data reading is performed on the partition a as soon as possible.

In this alternative, the sequence of the time indicated by the time stamp of the data reading task of a partition depends on the size of the data volume read from the partition last time, so intelligent learning of the consumption data (i.e. the read data volume) can be realized to predict the waiting time in the future, and according to the data volume read last time, the next trial period, that is, the period for setting up the thread to be put into the thread pool according to the data reading task of the partition, is increased or decreased as appropriate.

In this alternative, the correspondence between the interval of the read data amount and the delay length may be established in advance; the falling interval can be determined according to the data amount read by the last task, and the delay length corresponding to the interval is used as the determined delay length.

The determining of the delay length according to the data amount read by the last data reading task of the partition may include:

In other alternatives, the timestamp may be determined in another manner; for example, the time stamp is determined according to the priority of the data reading task, the expected data volume and the like.

In this alternative, for the first task of a partition, information indicating the time at which the task is put may be used as the timestamp of the task; the predicted execution time may be obtained by adding a predetermined delay length to the time at which the task is put, and information indicating the predicted execution time may be used as the time stamp of the task. In an alternative scheme of this implementation, in the task queue, the data reading tasks may be sorted from first to last according to the time indicated by the carried timestamp; namely: the earlier the time indicated by the timestamp, the earlier the task is in the queue.

The step of putting the data reading task generated by the partition into the task queue may further include:

For example, if the time indicated by the time stamp of the data reading task T1 is 3 o 'clock 15 min 20 sec on a certain day, and the time indicated by the time stamp of the data reading task T2 is 3 o' clock 15 min 28 sec on the same day, the data reading task T2 is ordered after the data reading task T1 in the task queue. If the time indicated by the time stamp of the data read task T3 to be enqueued is 3 o' clock, 15 min, 23 sec on the same day, then the data read task T3 is placed in the task queue after the data read task T1 and before the data read task T2.

In the alternative scheme, when the data reading tasks are extracted from the task queue, one or more previous data reading tasks in the task queue are extracted, and the time stamps of all the data reading tasks in the task queue do not need to be checked; as long as the time stamps of the data reading tasks are checked in the task queue in the order from first to last, the time stamp of the data reading task which is sequenced later is not checked again as soon as the time indicated by the time stamp of one data reading task does not precede the extraction time.

In this alternative, if the time indicated by the timestamps of exactly multiple data reading tasks in the task queue are the same, the multiple tasks may be ordered according to the order of joining the task queue or other conditions.

In this alternative, the queue may be regarded as a priority queue, and a timestamp carried by the data reading task is used as the priority of the data reading task; the earlier the time indicated by the timestamp, the higher the priority. The priority queue has the behavior characteristic of the first-in (first-out), and the data reading task with the highest priority is arranged at the forefront in the priority queue and is extracted first.

An example of this embodiment is shown in fig. 5, and may be applied to a complex large data processing scenario, for example, a single server needs to process reading of data of more than 3000 partitions, and the data volume of each partition is different. In this example, the central control system in the server executes the steps S110 to S120.

In this example, the data read tasks for all partitions are placed in a priority queue, where only one data read task exists for a partition at a time. Each data reading task carries a time stamp indicating the time at which the data reading task starts to be executed. In the priority queue, a timestamp carried by the data reading task is used as the priority of the data reading task, and the higher the time indicated by the timestamp is, the higher the priority of the data reading task is.

The central control system is responsible for checking the priority queue, extracting data reading tasks from the priority queue in sequence, establishing a new reading thread according to the extracted data reading tasks and placing the new reading thread into the thread pool, and the CPU executes each reading thread in the thread pool in a rotating mode. The number of the reading threads in the thread pool has an upper limit, the finished reading threads can be deleted from the thread pool, and new reading threads can be put in or started only when the number of the threads in the thread pool does not reach the upper limit. For example, if a thread can be put into the current thread pool, the central control system can establish a new read thread to be put into the thread pool according to the data read task arranged at the first position in the priority queue.

The central control system can periodically extract the data reading tasks from the priority queue, and extract all the data reading tasks of which the time indicated by the time stamp is prior to or equal to the extraction time.

In the case of periodic extraction, a certain extraction process in this example is as shown in fig. 6, and it is assumed that each timestamp is a value representing a time, and the value is smaller as the represented time is earlier, that is: if the value representing 8:30 am on a day is X and the value representing 9:00 am on the same day is Y, then Y > X; the numerical value of 8:30 in the morning of the following day is Z, and Z > Y > X. Assuming that the time stamps of the 6 tasks in the priority queue are 10, 20, 30, 40, 50, and 60, respectively, and the value indicating the extraction time is 32, the time indicated by the

time stamps

10, 20, and 30 is prior to the extraction time, and the extraction time stamps are the tasks of 10, 20, and 30. For convenience of understanding, the time stamp is a relatively simple value in fig. 6, and the time stamp in practical application is not limited to the example in fig. 6.

Under the condition of periodic extraction, the central control system can firstly establish a reading thread for the extracted data reading task, but does not put the reading thread into a thread pool, namely: the read thread is not started first; and sequencing the un-started reading threads according to the arrangement sequence of the data reading tasks in the priority queue, and when a new reading thread can be put into the thread pool, sequentially putting the established reading threads into the thread pool by the central control system according to the arrangement sequence (namely, sequentially starting the established reading threads). For example, the central control system extracts 3 data reading tasks, and when the thread pool can be put into a new thread, the central control system puts a reading thread established according to the data extraction task arranged at the first position of the task queue during extraction into the thread pool; and subsequently, when the thread pool can be put into a new thread, the central control system puts a reading thread established according to the data reading task arranged at the second position of the task queue during extraction into the thread pool.

The central control system can also extract a corresponding number of data reading tasks from the priority queue in sequence according to the number of the reading threads which can be put in when the threads in the thread pool are insufficient, and establish the reading threads. When a data reading task is extracted, only a data reading task whose time indicated by the timestamp precedes the extraction time may be extracted, for example, as shown in fig. 6, only 3 data reading tasks are extracted no matter 3 or 4 or more new threads can be established at present.

In the case of extracting only tasks whose time indicated by the time stamp is earlier than or equal to the extraction time, tasks whose time indicated by the time stamp is later than the extraction time are not processed, so that the effect of Sleep can be achieved.

When a new data reading task is generated in the partition, the central control system generates a time stamp for the data reading task, and the data reading task is placed at a proper position in the priority queue according to the time stamp. As shown in fig. 7, assuming that the time stamps of the 6 data reading tasks existing in the priority queue are 10, 20, 30, 40, 50, and 60, respectively, and the time stamp of the new data reading task is 35, the central control system puts the data reading task after the data reading task with the time stamp of 30 and before the data reading task with the time stamp of 40. If there are exactly multiple data read tasks with exactly equal time stamps, the multiple data read tasks may be ordered in order of being added to the priority queue. For convenience of understanding, the time stamp is a relatively simple value in fig. 7, and the time stamp in practical application is not limited to the example in fig. 7.

The central control system can take a numerical value representing the current moment as a time stamp of a first data reading task of one partition; the predicted execution time may be obtained by adding a predetermined time length to the current time, and a value representing the predicted execution time may be used as the time stamp of the data reading task. The current time refers to, for example, but not limited to, a time when the data reading task is added to the priority queue, a time when the data reading task is received, a time when a timestamp is generated for the data reading task, and the like.

The central control system can obtain the predicted execution time by adding the determined delay length to the current time for the non-first data reading task of one partition, and takes the numerical value representing the predicted execution time as the time stamp of the data reading task. The delay length is determined according to the data volume read by a data reading task on the partition (namely, the data volume read by a reading thread established by the data reading task); the larger the amount of data, the shorter the delay length. In this example, table one is used to determine the delay length. The current time refers to, for example, but not limited to, a time when the data reading task is added to the priority queue, a time when the data reading task is received, a time when a timestamp is generated for the data reading task, and the like.

Table one, corresponding relation table of read data quantity and delay length

Last read data volume (bit)	Length of delay	Remarks for note
			0	5 seconds	May not have data
1000 or less	1 second
			Greater than 1000 and less than 5000	500 milliseconds
Greater than 5000 and less than 10000	200 milliseconds
			Greater than 10000	50 milliseconds	Highest priority

In a second embodiment, a data processing apparatus, as shown in fig. 8, includes:

the queue management module 81 is configured to place a data reading task generated by a partition into a task queue;

the extracting module 82 is configured to extract a data reading task from the task queue when the number of the reading threads in the thread pool does not reach a predetermined upper limit, and establish a reading thread according to the extracted task and place the reading thread into the thread pool; the thread pool is used for storing reading threads which occupy processing resources in turn.

In this embodiment, the queue management module 81 is a part of the data processing apparatus that is responsible for adding a data reading task to a task queue, and may be software, hardware, or a combination of the two.

In this embodiment, the extracting module 82 is a part of the data processing apparatus responsible for generating a read thread according to a data read task in a task queue, and may be software, hardware, or a combination of the two.

In one implementation, the data processing apparatus is integrated in a computing device (such as but not limited to a server) that reads partition data, and the data processing apparatus may further include:

The reading module is a part of the data processing apparatus responsible for executing the reading thread to read data from the partition, and may be software, hardware, or a combination of the two.

In another implementation, the data processing apparatus may be independent of the computing device for reading the partition data, and the computing device for reading the partition data may execute the read thread in the thread pool.

In one implementation, the queue management module may place the data reading task generated by the partition into a task queue, and the method may include:

In an alternative of this implementation, the generating, by the queue management module, a timestamp for the data reading task generated by the partition may include:

In this alternative, the determining, according to the data amount read by the last data reading task of the partition, the delay length may include:

In an alternative scheme of this implementation, in the task queue, the data reading tasks may be sorted from first to last according to the time indicated by the carried timestamp;

the queue management module may further include:

In this alternative, the task queue may be a priority queue, and a timestamp carried by the data reading task may be a priority of the data reading task; the earlier the time indicated by the timestamp, the higher the priority.

In one implementation, the predetermined upper limit on the number of threads in the thread pool may be twice the number of CPUs for executing read threads in the thread pool.

Operations performed by the modules of the apparatus of the present embodiment correspond to steps S110 to S120 of the first embodiment, and other details of each module can be found in the first embodiment.

In a third embodiment, an electronic device for data processing includes: a memory and a processor;

putting a data reading task generated by the partition into a task queue;

The operations performed by the program for data processing in this embodiment when the program is read and executed by the processor correspond to steps S110 to S120 in the first embodiment, and other details of the operations performed by the program can be found in the first embodiment.

It will be understood by those skilled in the art that all or part of the steps of the above methods may be implemented by instructing the relevant hardware through a program, and the program may be stored in a computer readable storage medium, such as a read-only memory, a magnetic or optical disk, and the like. Alternatively, all or part of the steps of the above embodiments may be implemented using one or more integrated circuits. Accordingly, each module/unit in the above embodiments may be implemented in the form of hardware, and may also be implemented in the form of a software functional module. The present application is not limited to any specific form of hardware or software combination.

There are, of course, many other embodiments of the invention that can be devised without departing from the spirit and scope thereof, and it will be apparent to those skilled in the art that various changes and modifications can be made herein without departing from the spirit and scope of the invention.

Claims

1. A method of data processing, comprising:

putting a data reading task generated by the partition into a task queue;

when the number of the reading threads in the thread pool does not reach the preset upper limit, extracting a data reading task from the task queue, establishing a reading thread according to the extracted task, and putting the reading thread into the thread pool; the thread pool is used for storing reading threads which occupy processing resources in turn; the number of threads in the thread pool is less than the number of partitions, and each partition only generates one data reading task at a time.

2. The data processing method of claim 1, wherein the placing the data reading tasks generated by the partitions into the task queue comprises:

3. The data processing method of claim 2, wherein generating the time stamp for the partition generated data read task comprises:

4. The data processing method of claim 3, wherein the determining of the delay length according to the amount of data read by the last data reading task of the partition comprises:

5. The data processing method of claim 2, wherein: in the task queue, the data reading tasks are sequenced from first to last according to the time indicated by the carried time stamps;

6. The data processing method of claim 5, wherein:

the task queue is a priority queue, and a timestamp carried by the data reading task is taken as the priority of the data reading task; the earlier the time indicated by the timestamp, the higher the priority.

7. The data processing method of claim 1, wherein:

the predetermined upper limit of the number of threads in the thread pool is twice the number of CPUs for executing read threads in the thread pool.

8. The data processing method of any of claims 1 to 7, further comprising: and executing the reading threads in the thread pool in turn.

9. A data processing apparatus, comprising:

the extraction module is used for extracting a data reading task from the task queue when the number of the reading threads in the thread pool does not reach a preset upper limit, and establishing a reading thread according to the extracted task and putting the reading thread into the thread pool; the thread pool is used for storing reading threads which occupy processing resources in turn; the number of threads in the thread pool is less than the number of partitions, and each partition only generates one data reading task at a time.

10. The data processing apparatus of claim 9, wherein the queue management module placing the partition-generated data read tasks into a task queue comprises:

11. The data processing apparatus of claim 10, wherein the queue management module generating timestamps for partition-generated data read tasks comprises:

12. The data processing apparatus of claim 11, wherein the determining of the delay length based on the amount of data read by the last data reading task of the partition comprises:

13. The data processing apparatus of claim 10, wherein: in the task queue, the data reading tasks are sequenced from first to last according to the time indicated by the carried time stamps;

14. The data processing apparatus of claim 13, wherein:

15. The data processing apparatus of claim 9, wherein:

16. The data processing apparatus of any of claims 9 to 15, further comprising:

17. An electronic device for data processing, comprising: a memory and a processor;

the method is characterized in that: the memory is used for storing programs for data processing; the program for data processing, when read and executed by the processor, performs the following operations:

putting a data reading task generated by the partition into a task queue;