CN110990357A

CN110990357A - Data processing method, device and system, electronic equipment and storage medium

Info

Publication number: CN110990357A
Application number: CN201911311681.2A
Authority: CN
Inventors: 韦皓诚; 赵伟
Original assignee: Shenzhen Sensetime Technology Co Ltd
Current assignee: Shenzhen Sensetime Technology Co Ltd
Priority date: 2019-12-18
Filing date: 2019-12-18
Publication date: 2020-04-10

Abstract

The disclosure relates to a data processing method, a data processing device, a data processing system, an electronic device and a storage medium. The method is applied to a data processing system, the data processing system comprises a plurality of clusters, at least part of the plurality of clusters are used for storing target data to be processed, the plurality of clusters comprise a first cluster and a second cluster, and the target data comprises first target data; the method comprises the following steps: acquiring first index data, and acquiring first target data from a storage area of a second cluster according to the first index data; caching the first target data into a storage area of the first cluster, and generating second index data of the target data; and acquiring first target data from the storage area of the first cluster according to the second index data so as to process the target data through hardware resources of the first cluster.

Description

Data processing method, device and system, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of data storage, and in particular, to a data processing method, apparatus, system, electronic device, and storage medium.

Background

With the rapid development and wide application of artificial intelligence technologies such as computer vision, the business of computer vision companies is also iterating rapidly. In order to deal with larger and more complex business scenes, companies continuously increase their own computing power and purchase a large number of Graphics Processing Unit (GPU) machines for training computation. However, with the increase of business requirements, in the industry and practice, a cluster is not expanded without limitation, and therefore, a data processing method is needed to meet the business requirements of mass data processing without expanding a single cluster.

Disclosure of Invention

The present disclosure presents a data processing scheme.

According to an aspect of the present disclosure, a data processing method is provided, which is applied to a data processing system, the data processing system including a plurality of clusters, at least a part of the plurality of clusters being used for storing target data to be processed, the plurality of clusters including a first cluster and a second cluster, the target data including first target data; the method comprises the following steps: acquiring first index data, and acquiring first target data from a storage area of the second cluster according to the first index data; caching the first target data into a storage area of the first cluster, and generating second index data of the target data; and acquiring the first target data from the storage area of the first cluster according to the second index data so as to process the target data through the hardware resources of the first cluster.

In one possible implementation, the target data includes second target data, and the method further includes: and acquiring the second target data from the storage areas of the clusters except the first cluster in the plurality of clusters according to the second index data.

In one possible implementation manner, the storage area includes a memory area and a cache area; the acquiring, according to the first index data, first target data from the storage area of the second cluster includes: and acquiring the first target data from the memory area of the second cluster according to the first index data and the capacity of the cache area of the first cluster, wherein the data volume of the first target data is less than or equal to the capacity of the cache area of the first cluster.

In a possible implementation manner, the storage area includes a cache area, and before the obtaining, according to the first index data, first target data from the storage area of the second cluster, the method further includes: determining the second cluster from the plurality of clusters according to a transmission bandwidth between each of the other clusters and the first cluster and/or a capacity of a buffer area of each of the other clusters, wherein the other clusters include the clusters other than the first cluster.

In one possible implementation manner, the determining the second cluster from the plurality of clusters according to a transmission bandwidth between each of the other clusters and the first cluster and a capacity of a buffer area of each of the other clusters includes: respectively obtaining a proportion parameter of each cluster in the plurality of clusters, wherein the proportion parameter comprises a ratio of the capacity of a cache region to the available bandwidth in a transmission bandwidth; and determining the second cluster according to the proportion parameter, so that the sum of the capacities of the cache areas of all clusters in the second cluster is smaller than or equal to the capacity of the cache area of the first cluster, and the sum of the proportion parameter of all clusters in the second cluster is larger than or equal to a threshold value.

In one possible implementation, the storage area includes a memory area; prior to the obtaining the first index data, the method further comprises: acquiring a storage space of a memory area of each cluster in the plurality of clusters; according to the storage space, storing data to be stored in a distributed mode into a memory area of at least part of the clusters in the plurality of clusters, and generating index data of the data to be stored, wherein the index data of the data to be stored comprises the first index data.

In one possible implementation, the data processing system includes a storage device for storing the first index data and/or the second index data.

In one possible implementation, the first index data includes a storage way of the target data, and the second index data includes a cache way of at least part of the target data; the storage path of the target data comprises a storage identifier, a cluster identifier used for storing the target data, a data block identifier to which the target data belongs, and the target data identifier; the cache way of the target data comprises a cache identifier, a data block identifier to which the target data belongs, and the target data identifier.

According to an aspect of the present disclosure, a data processing apparatus is provided, where the apparatus is applied to a data processing system, where the data processing system includes a plurality of clusters, at least part of the plurality of clusters are used for storing target data to be processed, the plurality of clusters include a first cluster and a second cluster, and the target data includes first target data; the device comprises: the acquisition module is used for acquiring first index data and acquiring first target data from the storage area of the second cluster according to the first index data; the generating module is used for caching the first target data into a storage area of the first cluster and generating second index data of the target data; and the processing module is used for acquiring the first target data from the storage area of the first cluster according to the second index data so as to process the target data through the hardware resources of the first cluster.

In one possible implementation, the target data includes second target data, and the apparatus is further configured to: and acquiring the second target data from the storage areas of the clusters except the first cluster in the plurality of clusters according to the second index data.

In one possible implementation manner, the storage area includes a memory area and a cache area; the acquisition module is configured to: and acquiring the first target data from the memory area of a second cluster in the plurality of clusters according to the first index data and the capacity of the cache area of the first cluster, wherein the data volume of the first target data is less than or equal to the capacity of the cache area of the first cluster.

In a possible implementation manner, the storage area includes a cache area, and before the obtaining module, the apparatus further includes a determining module, where the determining module is configured to: determining the second cluster from the plurality of clusters according to a transmission bandwidth between each of the other clusters and the first cluster and/or a capacity of a buffer area of each of the other clusters, wherein the other clusters include the clusters other than the first cluster.

In one possible implementation, the determining module is further configured to: respectively obtaining a proportion parameter of each cluster in the plurality of clusters, wherein the proportion parameter comprises a ratio of the capacity of a cache region to the available bandwidth in a transmission bandwidth; and determining the second cluster according to the proportion parameter, so that the sum of the capacities of the cache areas of all clusters in the second cluster is smaller than or equal to the capacity of the cache area of the first cluster, and the sum of the proportion parameter of all clusters in the second cluster is larger than or equal to a threshold value.

In one possible implementation, the storage area includes a memory area; before the obtaining module, the apparatus is further configured to: acquiring a storage space of a memory area of each cluster in the plurality of clusters; according to the storage space, storing data to be stored in a distributed mode into a memory area of at least part of the clusters in the plurality of clusters, and generating index data of the data to be stored, wherein the index data of the data to be stored comprises the first index data.

According to an aspect of the present disclosure, there is provided a data processing system comprising a plurality of clusters, at least a part of the plurality of clusters being used for storing target data to be processed, the plurality of clusters comprising a first cluster and a second cluster, the target data comprising first target data; the second cluster is used for storing the first target data through a storage area of the second cluster; the first cluster is used for caching the first target data stored in the storage area of the second cluster into the storage area of the first cluster; the data processing system further comprises a data processing apparatus as described in any above to perform the method as described in any above.

In one possible implementation manner, the system includes a storage device, and the storage device is configured to store first index data and/or second index data, where the first index data includes a storage way of the target data, and the second index data includes a cache way of at least part of the target data.

According to an aspect of the present disclosure, there is provided an electronic device including:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to: the above-described data processing method is performed.

According to an aspect of the present disclosure, there is provided a computer-readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described data processing method.

In the embodiment of the present disclosure, by using a data processing system including a plurality of clusters, target data to be processed may be stored in the plurality of clusters in a distributed manner, and a storage location of the target data is obtained by using first index data, and at the same time, by caching the first target data in a storage area of a first cluster, second index data is generated, and the first target data is obtained from the storage area of the first cluster according to the second index data, so as to process the target data by using hardware resources of the first cluster.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure. Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure.

Fig. 1 shows a flow diagram of a data processing method according to an embodiment of the present disclosure.

Fig. 2 shows a block diagram of a data processing apparatus according to an embodiment of the present disclosure.

FIG. 3 shows a block diagram of a data processing system according to an embodiment of the present disclosure.

Fig. 4 shows a schematic diagram of an application example according to the present disclosure.

Fig. 5 shows a schematic diagram of an application example according to the present disclosure.

Fig. 6 illustrates a block diagram of an electronic device in accordance with an embodiment of the disclosure.

Fig. 7 shows a block diagram of an electronic device in accordance with an embodiment of the disclosure.

Detailed Description

Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.

Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present disclosure.

Fig. 1 shows a flowchart of a data processing method according to an embodiment of the present disclosure, which may be applied to a terminal device, a server or other processing device, and the like. The terminal device may be a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a Personal Digital Assistant (PDA), a handheld device, a computing device, a vehicle-mounted device, a wearable device, or the like.

In some possible implementations, the data processing method may also be implemented by the processor calling computer readable instructions stored in the memory.

The technical scheme provided by the embodiment of the application can be realized through a data processing system. In one implementation, the data processing system may include a plurality of clusters, the plurality of clusters including a first cluster and a second cluster, at least a portion of the plurality of clusters to store target data to be processed, the target data including first target data.

For convenience of description, in the embodiment of the present application, the target data may be processed by the hardware resources of the first cluster, and the data storage, the cache, and the like involved in the data processing system are described with the purpose of processing the target data by the hardware resources of the first cluster. It should be noted that, in the case that some specific data are processed by hardware resources of a cluster other than the first cluster, the corresponding data processing process may also be implemented by using the technical solution provided in the embodiment of the present application, which is not described herein again.

As shown in fig. 1, in one possible implementation manner, the data processing method may include:

step S11, acquiring the first index data, and acquiring the first target data from the storage area of the second cluster according to the first index data.

In step S12, the first target data is cached in the storage area of the first cluster, and second index data of the target data is generated.

Step S13, according to the second index data, obtaining the first target data from the storage area of the first cluster, so as to process the target data through the hardware resource of the first cluster.

The target data may be data to be subjected to data processing, and the type of the target data is not limited, and may be picture, text, audio or video data, and the like. The application category of the target data can be flexibly selected according to actual situations, and can include, but is not limited to, the following exemplified situations. In a possible implementation manner, the target data may be training data, that is, the data processing method proposed in the embodiment of the present disclosure may be applied to a process of processing training data of a neural network. In another possible implementation manner, the target data may also be verification data or application data, that is, the data processing method proposed in the embodiment of the present disclosure is applied to a verification or application process of a neural network.

It is further provided in the above-mentioned disclosed embodiment that the target data includes first target data, and as can be seen from steps S11 to S13, the first target data may be data cached to the first cluster in the data processing system. The data size of the first target data is not limited in the embodiment of the present disclosure, and considering that the first target data is data to be cached in the first cluster, the data size of the first target data may also be changed correspondingly for different capacities of the storage areas of the first cluster. In a possible implementation manner, in a case where the capacity of the storage area of the first cluster is large enough, the first target data may be all data in the target data, and in a possible implementation manner, in a case where the capacity of the storage area of the first cluster is limited, the first target data may also be partial data in the target data, specifically which partial data is, or may be flexibly determined according to an actual situation of the storage area of the first cluster, which is described in detail in the following disclosure embodiments, and is not expanded here.

It can be seen from the foregoing disclosure that the data processing method provided in the present disclosure may be applied to a data processing system including multiple clusters, where the multiple clusters are used to store target data, and it can be seen that the data processing system may perform distributed processing on the target data through the multiple clusters, so that a specific implementation manner of the data processing system may be flexibly determined according to an actual situation, and any storage system having a distributed storage function may be used as an implementation manner of the data processing system. The number of clusters included in the data processing system may be flexibly determined according to actual situations, and the implementation manners of the clusters may be the same or different, and are not limited in the embodiments of the present disclosure.

In one possible implementation, the data processing system may be a Ceph file storage system, a Lustre file storage system, or other file storage systems that can store data in different clusters in a distributed manner. The implementation of the various clusters included in a data processing system may vary depending on the implementation of the data processing system. In the embodiments of the present disclosure, the number of clusters in the Ceph file storage system is not limited herein, and may be flexibly selected according to actual situations.

It is further provided in the foregoing disclosed embodiment that the plurality of clusters may include a first cluster and a second cluster, in this disclosed embodiment, the first cluster may be a cluster that processes target data, the second cluster may be a cluster that stores the first target data, and the foregoing "first" and "second" are only used to distinguish different roles of the clusters, and do not limit specific implementation forms and numbers of the clusters. For example, the second cluster may be one of a plurality of clusters or a plurality of clusters, and the number of clusters included in the second cluster is not limited in the embodiment of the present disclosure. The first cluster may also vary depending on the location of the target data processing. For example, the plurality of clusters included in the data processing system may be respectively referred to as ClusterA, ClusterB … clusterg, and the like, and when the target data is trained by the data processing system once, the target data may be trained by using hardware resources on ClusterA, where the first Cluster in the data processing system is ClusterA. In the next training process, the target data may be trained by using the hardware resource of Cluster B, and the first Cluster at this time may be Cluster B. Similarly, as the first cluster changes, the specific cluster included in the second cluster may also change.

Although the second cluster is a cluster for storing the first target data, it is also possible for the second cluster to store data other than the first target data. In a possible implementation manner, all of the first target data stored in the second cluster may be the first target data, or only a part of the first target data, and the data amount of the first target data stored and the category to which other stored data belong are not limited in the embodiment of the present disclosure, and may be flexibly selected according to an actual situation.

Based on the above-mentioned roles of the first cluster and the second cluster, it can be seen that, in a possible implementation manner, the second cluster may be configured to store the first target data through a storage area of the second cluster; the first cluster may be configured to cache the first target data stored in the storage area of the second cluster into the storage area of the first cluster.

As can be seen from steps S11 to S12 in the above-described disclosed embodiment, in the process of performing data processing on target data, the first index data may be acquired, so that the first target data is acquired from the storage area of the second cluster according to the first index data, and then the first target data is cached in the storage area of the first cluster, and second index data of the target data is generated. In the above process, the first index data may be used to record a storage way of the target data in the data processing system, that is, a position of each target data in the data processing system may be recorded; the second index data may be used to record a storage path of the target data after the target data is cached in the data processing system, that is, a location of each target data in the data processing system after the caching process occurs may be recorded, and specifically may include a cache path of the cached data and a storage path of the uncached data.

Based on the content of the second index data record, the first target data may be acquired from the storage area of the first cluster according to the second index data to process the target data through the hardware resource of the first cluster through step S13. In a possible implementation manner, in a case that the capacity of the storage area of the first cluster is sufficient, all the target data may be taken as the first target data, and at this time, the first target data stored in the storage area of the second cluster may be all cached in the storage area of the first cluster according to the first index data, and the second index data may be generated. In this way, in the process of data processing, all the target data may be acquired from the storage area of the first cluster according to the second index data, so that the target data may be processed by the hardware resource of the first cluster.

Through the data processing system comprising the plurality of clusters, target data to be processed can be stored in the plurality of clusters in a distributed mode, the storage position of the target data is obtained through the first index data, meanwhile, the second index data is generated by caching the first target data to the storage area of the first cluster, the first target data is obtained from the storage area of the first cluster according to the second index data, the target data are processed through hardware resources of the first cluster, and through the process, the first target data can be cached in the storage area of the first cluster, so that the processing efficiency of the first cluster when the target data are processed is improved.

It has been proposed in the above-mentioned disclosed embodiments that the first target data may be all target data or may be part of the target data. When the capacity of the storage area of the first cluster is limited, it cannot accommodate all target data, so that a part of the target data at this time may be cached as the first target data to the first cluster, and therefore, in one possible implementation, the target data may include second target data, and the method proposed in the embodiment of the present disclosure may further include:

in step S14, second target data is acquired from the storage areas of the clusters other than the first cluster among the plurality of clusters based on the second index data.

Wherein the second target data may be data of the target data that is not cached in the first cluster. In the above disclosure, it is proposed that the second index data may be used to record a storage path of the target data after the caching occurs in the data processing system, that is, may record a location of each target data in the data processing system after the caching occurs. Since the second target data is not cached, the content recorded in the second index data may include a cache way after the first target data is cached and a storage way of the second target data.

In addition, it is also proposed in the foregoing disclosure that the second cluster may store all the first target data or may store part of the first target data, and therefore, when the second cluster stores part of the first target data, the remaining space may be used to store the second target data, and therefore, when the second target data is obtained according to the second index data, the second target data may be located in a cluster other than the first cluster and the second cluster in the data processing system, or may be located in the second cluster, and may be determined flexibly according to an actual storage location.

By acquiring the second target data from the storage areas of the clusters other than the first cluster in the plurality of clusters according to the second index data, only a part of the target data can be cached in the first cluster under the condition that the storage capacity of the first cluster is insufficient, the cached part of the target data can be read from the first cluster respectively during data processing, and the rest of the target data can be read from the rest of the clusters, so that the efficiency of data processing is improved while the processing of all the target data is realized.

In a possible implementation manner, the obtaining, by the storage area, first target data from a storage area of a second cluster in the plurality of clusters according to the first index data may include: and acquiring first target data from a memory area of a second cluster in the plurality of clusters according to the first index data and the capacity of the cache area of the first cluster, wherein the data volume of the first target data is less than or equal to the capacity of the cache area of the first cluster.

As can be seen from the above disclosed embodiments, the target data is stored in a cluster and may be cached in a first cluster, and thus, for each cluster, the storage area can be further divided into a cache area and a memory area, wherein the memory area can be used for storing the target data, while the cache area may be used to cache (i.e., temporarily store) the target data, in one possible implementation, with the data written to each cluster, it can write to the memory area of each cluster, and when it is desired to have a cluster as the first cluster, in the case of processing target data, the data may be cached in a cache area of the first cluster, the target data in the cache region can be erased or replaced as the data processing is completed, but the target data in the memory region is not affected.

The specific implementation manners of the memory area and the cache area are not limited in the embodiment of the present disclosure, and may be flexibly determined according to the specific implementation form of the cluster. In one example, the memory region may be a Ceph cluster. In one example, the cache region may be a MemCache cache.

Based on the above division of the storage area, according to the first index data, the process of acquiring the first target data from the storage area of the second cluster in the plurality of clusters may be: according to the capacity of the cache region of the first cluster, selecting part or all of target data from the clusters except the first cluster as first target data to enable the data quantity of the first target data to be smaller than or equal to the capacity of the cache region of the first cluster, then determining the storage route of the first target data in the clusters except the first cluster according to the first index data, and then obtaining the first target data from the memory regions of the clusters.

Through the process, the data amount cached to the cache region of the first cluster can be within the capacity of the cache region of the first cluster, so that the condition of cache overflow is reduced, and the stability of the data processing process is improved.

Further, in addition to the selection of the first target data needing to consider the data amount, the storage route of the first target data before caching may also affect the efficiency of the caching process, and then affect the efficiency of the data processing process. Therefore, in a possible implementation manner, before acquiring, according to the first index data, the first target data from the storage area of the second cluster in the plurality of clusters, the method may further include:

and determining a second cluster from the plurality of clusters according to the transmission bandwidth between each cluster in the other clusters and the first cluster and/or the capacity of the buffer area of each cluster in the other clusters, wherein the other clusters comprise clusters in the plurality of clusters except the first cluster.

In the above-described disclosed embodiment, the plurality of clusters are clusters included in the data processing system, and the other clusters are remaining clusters of the plurality of clusters excluding the first cluster. In the embodiment of the present disclosure, the second cluster determined from the plurality of clusters may be regarded as an optimal subset of the plurality of clusters, and the first target data stored by the cluster in the optimal subset is cached to the first cluster, so that the caching effect is better than that of other subsets.

For the caching process, the caching effect can be considered from two aspects, on one hand, the caching space, that is, the caching area of the first cluster can be fully used to exert the performance advantage of the caching, and on the other hand, the bandwidth usage among the clusters reaches a balanced state, so that the occurrence of congestion due to the fact that the pressure of a network access between two clusters is too large is reduced as much as possible, and the situation that the other access is loose is achieved, that is, the network bandwidth among different clusters can be fully utilized.

Based on the above two considerations, the second cluster may be determined according to the transmission bandwidth between each cluster and the first cluster, and the capacity of the buffer area of each cluster. Specifically, when determining the second cluster, two factors, namely the transmission bandwidth and the capacity, may be considered at the same time, or when the difference between the two factors is too large, only one of the two factors may be selected for consideration, which factor is specifically considered may be flexibly determined according to the actual situation, and if there are other factors affecting the effect of the cache, the factors may also be considered, and the method is not limited to the following disclosed embodiments.

The second cluster is determined from the plurality of clusters according to the transmission bandwidth between each cluster and the first cluster in other clusters and/or the capacity of the cache region of each cluster, and through the process, the data in which clusters are selected can be selectively determined and selected to serve as the first target data to be cached in the first cluster, so that the caching effect is effectively improved, and the effect of the data processing process is further improved.

In one possible implementation, the second cluster may be determined by considering both the bandwidth and the capacity, that is, determining the second cluster from the plurality of clusters according to the transmission bandwidth between each of the other clusters and the first cluster and the capacity of the buffer area of each of the other clusters, and this process may include:

and respectively acquiring a proportion parameter of each cluster in the plurality of clusters, wherein the proportion parameter comprises the ratio of the capacity of the cache region to the available bandwidth in the transmission bandwidth.

And determining the second cluster according to the proportion parameter, so that the sum of the capacities of the cache areas of all the clusters in the second cluster is less than or equal to the capacity of the cache area of the first cluster, and the sum of the proportion parameters of all the clusters in the second cluster is greater than or equal to the threshold value.

In this embodiment of the present disclosure, a proportion parameter of an ith cluster may be denoted as v (i), an available bandwidth (i.e., a remaining bandwidth of a network path between the ith cluster and the first cluster) in a transmission bandwidth between the ith cluster and the first cluster may be denoted as L (a, i), where a represents the first cluster, and a cache capacity of the ith cluster is denoted as p (i). It can be seen from the above considerations that, for the ith cluster, the larger the available bandwidth between the ith cluster and the first cluster, the lower the bandwidth utilization rate of the cluster, and therefore, the inverse correlation between L (a, i) and the caching effect of the cluster is obtained. In addition, for the ith cluster, the larger the cache capacity is, the more the data stored in the cluster is cached in the first cluster and then processed, and compared with the data processing after the data is acquired from the cluster, the advantage of the cache area of the first cluster can be fully played, and the cache effect is improved, so that the forward correlation between the cache effect of the cluster and the cache effect of the p (i) is realized. Thus, in one possible implementation, the ratio parameter v (i) may satisfy the following formula: v (i) ═ p (i)/L (a, i). It should be noted that the above formula is only a referential calculation method of the proportional parameter, the specific calculation process is not limited to the above formula, and other calculation methods, such as a process of opening a root and calculating a logarithm, may also be used for calculation, and are not limited herein.

After the proportional parameter of each cluster is obtained, a subset meeting a preset condition can be selected from a set formed by a plurality of clusters, so that a better caching effect is achieved when the cluster in the subset is used as a second cluster for caching. The condition content of the preset condition can be flexibly set according to the requirement, and in a possible implementation manner, the preset condition may be: the sum of the capacities of the cache areas of all the clusters in the second cluster is smaller than or equal to the capacity of the cache area of the first cluster, and the sum of the proportion parameters of all the clusters in the second cluster is larger than or equal to the threshold value. It should be noted that the threshold of the sum of the proportional parameters in the embodiment of the present disclosure may be a fixed value that is set, or may be a dynamic value that is continuously updated according to the difference of the calculated sum of the proportional parameters. In the case that the threshold of the sum of the proportional parameters is a dynamic value, the preset condition can be regarded as: the sum of the capacity of the cache areas of all the clusters in the second cluster is less than or equal to the capacity of the cache area of the first cluster, and the sum of the proportion parameters of all the clusters in the second cluster reaches the maximum.

In an example, when the preset condition includes that the sum of the proportional parameters of all clusters in the second cluster reaches the maximum, the process of determining the second cluster may be converted into a solution process of an unsegmentable knapsack problem, in this process, the capacity of the cache area of the first cluster may be regarded as an upper limit of knapsack weight, each cluster other than the first cluster is used as an article to be loaded into a knapsack, the weight of the article is the data amount of target data stored in each cluster, and the cost function may be defined as v (i) provided in the above-described disclosed embodiment.

The method comprises the steps of respectively obtaining a proportion parameter determined by the ratio of the capacity of each cluster in a plurality of clusters passing through a cache region to the available bandwidth in the transmission bandwidth, and determining a second cluster according to the proportion parameter, so that the sum of the capacities of the cache regions of all clusters in the second cluster is smaller than or equal to the capacity of the cache region of the first cluster, and the sum of the proportion parameters of all clusters in the second cluster is larger than or equal to a threshold value. Through the process, in the process of realizing data processing through the cache, the storage resources of the target storage unit can be fully utilized, and the bandwidth occupied by each storage unit in the data reading process can be balanced, so that the cache efficiency and the performance are further improved.

In an actual application process, the data processing system first writes the target data, and then can perform processing according to the target data, and therefore, in a possible implementation manner, before acquiring the first index data, the method proposed in the embodiment of the present disclosure may further include:

the method comprises the steps of obtaining a storage space of a memory area of each cluster in a plurality of clusters.

According to the storage space, storing the data to be stored in a distributed mode into the memory area of at least part of the clusters in the plurality of clusters, and generating index data of the data to be stored, wherein the index data of the data to be stored comprises first index data.

In the above-described disclosed embodiment, the data to be stored may all be target data that needs to be processed subsequently, or may include some content that is not related to the target data, and is determined according to an actual data storage requirement, which is not limited in the embodiment of the present disclosure. Therefore, when the data to be stored is written into a plurality of clusters of the data processing system in a distributed manner, the generated index data of the data to be stored can record the storage paths of the data to be processed and the storage paths of the data not to be processed, and therefore, the index data can include the first index data and can also include index data of other data unrelated to the data to be processed. In one example, in a case where the data to be stored is all as the data to be processed, the index data may be the first index data.

It can be seen from the foregoing disclosure that, when writing data to be stored into a data processing system, which data is specifically written into which clusters can be flexibly determined according to a storage space of a memory area of each cluster, and storing the data to be stored into the data processing system in such a manner can make data distribution in the data processing system more balanced, thereby improving the utilization efficiency of the data processing system and also improving the data writing effect.

In one possible implementation, the data processing system may further include a storage device for storing the first index data and/or the second index data.

In the embodiments of the present disclosure, a plurality of clusters in a data processing system may be used to store target data, and may also be used to store first index data, second index data, or both index data. Because a plurality of clusters exist in the data processing system, particularly which cluster the index data is stored in can be flexibly determined according to actual conditions.

However, in some possible implementations, it is also contemplated to add a storage device to the data processing system for storing the first index data and/or the second index data. The implementation manner of the storage device is not limited in the embodiments of the present disclosure, and any device that can be used to store data may be used as an implementation form of the storage device.

The index data is stored through the additional storage device, so that the capacity of the to-be-processed data which can be stored by the cluster in the data processing system can be increased on one hand, and on the other hand, the index data and the to-be-processed data can be respectively stored, and the possibility of the situation of index confusion is greatly reduced under the condition that the target data is obtained through the index data.

It has been proposed in the above-mentioned disclosure that the first index data may be used to record a storage path of the target data in the data processing system, that is, a location of each target data in the data processing system may be recorded; the second index data may be used to record a storage path of the target data after the target data is cached in the data processing system, that is, a location of each target data in the data processing system after the caching process occurs. That is, the first index data may record the storage location of the target data in the memory area of each cluster after the target data is written into the data processing system, and the second index data may record the storage location of the target data in the cache or memory area of each cluster after the first target data is cached in the first cluster.

Thus, in one possible implementation, the first index data includes a storage way of the target data, and the second index data includes a cache way of at least a portion of the target data.

In the above-described disclosed embodiment, when all the target data are the first target data, that is, when all the target data are cached in the cache region of the first cluster, the second index data may include all the cache ways of the target data, and when part of the target data is the first target data, that is, part of the target data is cached in the first cluster, and the rest of the target data is still stored as the second target data in the memory region of the original cluster, at this time, the second index data may include both the cache ways of the first target data and the storage ways of the second target data.

Specifically, how the index data records the storage way and the cache way of the target data can be flexibly determined according to actual requirements. In one possible implementation manner, the storage route of the target data may include a storage identifier, a cluster identifier for storing the target data, a data block identifier to which the target data belongs, and a target data identifier; the cache way of the target data can comprise a cache identification, a data block identification to which the target data belongs, and a first target data identification.

In the above disclosed embodiment, the storage identifier may be used to indicate that the target data is stored in the memory area, and a specific expression form of the storage identifier may be flexibly determined according to an implementation manner of the memory area.

By the storage path comprising the storage identifier and the cache path comprising the cache identifier, the corresponding identifier can be directly modified to obtain the index data of the cached first target data under the condition that the target data is cached into the cache region of the first cluster from the memory region of the second cluster. Through the process, the second index data can be obtained based on the modification of the first index data, and the convenience degree of obtaining the second index data is greatly improved. Meanwhile, in the process of acquiring the target data, the target data can be acquired in the corresponding storage area directly according to whether the identification is the storage identification or the cache identification, and the convenience degree of data acquisition is also improved.

Further, the storage path of the target data further includes a cluster identifier of the target data, and since the data processing system includes a plurality of clusters, it is possible to determine which cluster the target data is specifically located in by the cluster identifier, and then further obtain a location of the target data in the cluster. In the data processing process, the target positions of the cache are all cache areas of the first cluster, so that the cluster identification of the target data can be omitted when the cache way of the target data is recorded.

In addition, the storage path and the cache path of the target data each further include a data block identifier to which the target data belongs, that is, a data block to which the target data specifically belongs in a certain cluster, and a specific representation form of the data block identifier may be flexibly determined according to an implementation form of the cluster.

Similarly, the storage path and the cache path of the target data each further include a target data identifier of the target data, which is used to indicate specific data content of the target data, and a specific implementation form of the target data identifier is not limited.

Through the implementation manner of the storage approach and the cache approach in the above-described disclosed embodiment, when the target data is obtained through the second index data, the position of the target data may be found according to the storage approach or the cache approach of the target data recorded by the second index data, and when the recorded target data includes the storage identifier, the cluster to which the target data belongs may be determined according to the cluster identifier thereof, and the position of the target data in the cluster to which the target data belongs is located through the data block identifier and the target data identifier; in the case that the recorded target data includes the cache identifier, the location of the recorded target data in the first cluster may be located according to the data block identifier and the target data identifier thereof, so as to achieve the acquisition of the target data.

According to the embodiment of the disclosure, the target data stored in the cache area and the storage area can be recorded in different identification forms, so that the generation efficiency of the index data and the search efficiency of the target data are improved.

Since the target data may be stored in the memory area of the cluster and may also be stored in the cache area of the cluster, different implementations of the memory area and the cache area may exist. Therefore, in a possible implementation manner, the corresponding interface may be selected according to the different areas to which the target data belongs, so as to obtain the target data. The implementation form of the interface may be correspondingly determined according to the implementation form of the storage area. In one example, when the cache region is MemCache, the interface may be an interface of MemCache; in one example, when the memory region is a Ceph cluster, the interface may be an S3 interface, i.e., a Restful interface accessible via http protocol. Further, as can be seen from the disclosure in the above disclosed embodiments, when the target data contains a cache identifier, it may be located in the first cluster by default, so that in one example, the target data may be directly obtained from the first cluster through the interface of MemCache; when the target data includes the memory identifier, the location of the target data may be first determined according to the remaining identifiers, and the target data may be read through the S3 interface.

Target data in different storage areas are read through different interfaces, so that the training data can be read in different storage areas at high efficiency, and the efficiency of the whole training data reading process is improved.

Based on the foregoing disclosure embodiments, after data processing is completed, there may be a portion of target data cached in the cache region of the first cluster, and in a possible implementation manner, storage resources in the cache region of the first cluster may be manually released after data processing is completed. In a possible implementation manner, the cache region of the first cluster may automatically replace the previously cached target data by a Least Recently Used (LRU) replacement policy in case of next data processing or reapplication of the cache region of the first cluster without releasing the storage resource in the cache region of the first cluster.

Fig. 2 shows a block diagram of a data processing apparatus according to an embodiment of the present disclosure. As shown, the apparatus 20 may be applied to a data processing system, where the data processing system includes a plurality of clusters, at least a part of the plurality of clusters are used for storing target data to be processed, the plurality of clusters include a first cluster and a second cluster, and the target data includes first target data;

the apparatus 20 comprises:

an obtaining module 21, configured to obtain first index data, and obtain first target data from a storage area of a second cluster according to the first index data;

the generating module 22 is configured to cache the first target data in the storage area of the first cluster, and generate second index data of the target data;

and the processing module 23 is configured to obtain the first target data from the storage area of the first cluster according to the second index data, so as to process the target data through the hardware resource of the first cluster.

In one possible implementation, the target data includes second target data, and the apparatus is further configured to: and acquiring second target data from the storage areas of the clusters except the first cluster in the plurality of clusters according to the second index data.

In one possible implementation, the storage area includes a memory area and a cache area; the acquisition module is used for: and acquiring first target data from a memory area of a second cluster in the plurality of clusters according to the first index data and the capacity of the cache area of the first cluster, wherein the data volume of the first target data is less than or equal to the capacity of the cache area of the first cluster.

In a possible implementation manner, the storage area includes a cache area, and before the obtaining module, the apparatus further includes a determining module, where the determining module is configured to: and determining a second cluster from the plurality of clusters according to the transmission bandwidth between each cluster in the other clusters and the first cluster and/or the capacity of the buffer area of each cluster in the other clusters, wherein the other clusters comprise clusters in the plurality of clusters except the first cluster.

In one possible implementation, the determining module is further configured to: respectively obtaining a proportion parameter of each cluster in a plurality of clusters, wherein the proportion parameter comprises a ratio of the capacity of a cache region to the available bandwidth in a transmission bandwidth; and determining the second cluster according to the proportion parameter, so that the sum of the capacities of the cache areas of all the clusters in the second cluster is less than or equal to the capacity of the cache area of the first cluster, and the sum of the proportion parameters of all the clusters in the second cluster is greater than or equal to the threshold value.

In one possible implementation, the storage area includes a memory area; before the obtaining module, the apparatus is further configured to: acquiring a storage space of a memory area of each cluster in a plurality of clusters; according to the storage space, storing the data to be stored in a distributed mode into the memory area of at least part of the clusters in the plurality of clusters, and generating index data of the data to be stored, wherein the index data of the data to be stored comprises first index data.

In one possible implementation, the first index data includes a storage way of the target data, and the second index data includes a cache way of at least part of the target data; the storage path of the target data comprises a storage identifier, a cluster identifier for storing the target data, a data block identifier to which the target data belongs, and a target data identifier; the caching path of the target data comprises a caching identifier, a data block identifier to which the target data belongs, and a target data identifier.

FIG. 3 shows a block diagram of a data processing system according to an embodiment of the present disclosure. As shown, the system 30 may include a plurality of clusters, at least a portion of the plurality of clusters being used for storing target data to be processed, the plurality of clusters including a first cluster 31 and a second cluster 32, the target data including first target data;

a second cluster 32 for storing the first target data through a storage area of the second cluster;

a first cluster 31 for caching first target data stored in a storage area of a second cluster into a storage area of the first cluster;

the data processing system 30 further comprises a data processing apparatus 20 as described in any of the above to perform the data processing method as described in any of the above.

Fig. 4 to 5 are schematic diagrams illustrating an application example according to the present disclosure, wherein fig. 5 illustrates a data processing system according to an application example of the present disclosure, based on which processing of target data, such as training or verification of the target data, can be implemented.

As can be seen from fig. 4, the data processing system is mainly composed of two parts, which may be referred to as a cluster set and a storage device, respectively, in the application example of the present disclosure, the structure of the cluster set is shown in the lower half of fig. 4, as shown, the cluster set may be composed of a plurality of clusters, which may be referred to as clusterira, B, and C … G, respectively, in the application example of the present disclosure, and target data may be written to and read from the clusters in a distributed manner when training is required. In the disclosed example, these clusters may be implemented with Ceph clusters, each of which may provide an S3 interface to the outside.

The structure of the storage device is shown in the upper half of fig. 4, and as shown in the figure, the storage device may be composed of a complete Redis file storage system, and in an application example of the present disclosure, this Redis file storage system may be used to cache a file list of all target data, that is, record storage locations of the target data in each cluster.

Based on the data processing system, the disclosed example provides a method for writing target data, so that the target data can be stored in the data processing system. In one example, the write process of the target data may be: the method comprises the steps of automatically determining which Ceph Cluster target data are stored in according to the storage capacity of each Ceph Cluster, giving a storage path (BucketName/FileName) of the target data in the Cluster, storing the target data into the Cluster, adding a prefix representing a certain Cluster to the path, and storing the path into a file list of Redis, for example, the target data written into Cluster A can contain Cluster A in the prefix, the target data written into Cluster E can contain Cluster E in the prefix. To this end, the target data is written into the data processing system, and at the same time, the specific storage location of each target data is also recorded within the data processing system.

After the writing of the target data is completed, when the target data needs to be read to train the neural network model, which Cluster's computing resource is used for computation may be selected, for example, when it is determined to use GPU resource on Cluster a for model training and the target data is distributed in a plurality of clusters such as a, B, and C … G, data of other clusters such as B and C … G needs to be read to Cluster a for training during the training process, and the target data included in the training set needs to be read and written randomly for a plurality of rounds during the training process. To improve the reading efficiency of target data, MemCache may be deployed on each cluster to provide memory-level caching for the cluster. The target data is read from the MemCache, so that the data reading efficiency of the whole system can be improved.

Fig. 5 shows a schematic structural diagram of each Cluster according to an application example of the present disclosure, where Cluster a is taken as an example, and it can be seen from the diagram that, for each Cluster, it may include two storage regions, namely, a cache and MemCache and Ceph, where, as mentioned in the above application example, Ceph is taken as a storage region and is mainly used for storing written target data, and target data in MemCache is used for achieving efficient reading of the target data, and therefore, when the target data is written, it is basically written into Ceph of each Cluster, and therefore, a prefix summary of the written training data may include a storage region identifier Ceph where it is located, and in one example, taking the target data written into Cluster a as an example, a location format stored in Redis: ceph, Cluster A/BucketName/FileName.

In one example, if the memory cache resources on Cluster a are sufficient, all the target data in the remote Cluster can be read into the cache region of Cluster a, and during the training process, the device performing data reading, such as a processor, can perform data reading from the MemCache of Cluster a. The specific process may be that a file list stored in the Redis is traversed, a prefix is analyzed, and target data is read from a corresponding Cluster, so that the target data is cached to MemCache of Cluster A, and the prefix is changed into MC and then written into a temporary file list used for training.

However, since the memory resources are often limited, it is difficult to cache all data in the MemCache of Cluster a. Thus, in one possible implementation, some number of remote Ceph Cluster training data may be selected and copied to Cluster a's MemCache. In order to improve the reading performance, on one hand, the cache area can be fully utilized to exert the performance advantage of the cache to the maximum, and on the other hand, the bandwidth usage among the clusters can reach a balanced state as much as possible, so that the situation that the network access pressure between two clusters is too large to cause congestion and the other access is very loose is reduced as much as possible, and the situation is easy to cause that the network bandwidth among different clusters cannot be fully used.

In one possible implementation, the above problem can be abstracted as a knapsack problem (also called an unsegmentable knapsack problem) in a computer algorithm, the total amount of cache data that can be finally realized in the cache can be regarded as an upper limit of knapsack weight, each remote cluster can be regarded as an item to be loaded into the knapsack, and the weight of the item can be regarded as a target amount of data in each remote cluster. The cost function may be defined according to the two goals that are desired to be achieved.

Further, the above solution process may be explained as that, assuming that the total amount of the finally achievable cache data in the cache is P, the amount of the training data of each remote cluster is Pb, Pc, … Pg, and the revenue function of the training data cached in one remote cluster is V, in the application example of the present disclosure, a subset Z of { B, C, …, G } needs to be found, so that the revenue sum of elements in Z obtains a maximum value in the case that the sum of the amounts of the training data of each element Z1 and Z2 … Zi in Z is less than P.

For the revenue function V, based on the two objectives proposed in the above disclosed application example, it can be defined as:

V(i)＝P(i)/L(a,i)

where L (a, I) is the remaining bandwidth of the network path between cluster a and cluster I at the beginning of training. P (i) represents the performance gain available from caching the cluster data, and L (a, i) is used as the denominator to make the gain smaller when the network paths in the two clusters are busy.

The non-segmentable knapsack problem can be solved quickly by using dynamic programming, so that an optimal set U can be obtained by using a dynamic programming algorithm, and target data of a Cluster in the set U is cached in MemCache of the Cluster Cluster A by using the method and the steps. Therefore, the problem that when the cache resources are insufficient, which remote clusters are selected for caching can be solved. After the above process is completed, a temporary file list may be generated, the prefix of the target data cached in the MemCache of Cluster a may be modified to MC, and the prefixes of other data may still be Ceph and Cluster name, that is, the storage format of the target data located in the MemCache of Cluster a may be: the MC/BucketName/FileName, the storage format of the target data in each cluster can still be Ceph: cluster A/BucketName/FileName.

After training is completed, the cache resources stored in the MemCache can be released without manual operation, and when training or reapplying the cache space next time, the MemCache can automatically replace the previously cached data through the LRU replacement strategy.

Through the process, the target data written into the system can be redistributed through the cache region, so that efficient reading is realized. In an application example of the present disclosure, after the training data in the cache region is updated through the above process, when training is performed, the position of each target data may be obtained from the temporary file list, if the prefix is MC, the interface of MemCache is used to read data from MemCache, if the prefix is Ceph, the cluster where the target data is located is continuously analyzed, and then the interface of S3 is used to obtain data on the corresponding Ceph cluster.

Through the process, the training data can be stored and trained by utilizing multiple clusters, so that better expandability is provided; meanwhile, the network bandwidth between the cache and the cluster can be fully utilized in the cross-cluster training process to provide better training performance.

It is understood that the above-mentioned method embodiments of the present disclosure can be combined with each other to form a combined embodiment without departing from the logic of the principle, which is limited by the space, and the detailed description of the present disclosure is omitted.

It will be understood by those skilled in the art that in the method of the present invention, the order of writing the steps does not imply a strict order of execution and any limitations on the implementation, and the specific order of execution of the steps should be determined by their function and possible inherent logic.

Embodiments of the present disclosure also provide a computer-readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the above-mentioned method. The computer readable storage medium may be a volatile computer readable storage medium or a non-volatile computer readable storage medium.

An embodiment of the present disclosure further provides an electronic device, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured as the above method.

In practical applications, the memory may be a volatile memory (RAM); or a non-volatile memory (non-volatile memory) such as a ROM, a flash memory (flash memory), a Hard Disk (Hard Disk Drive, HDD) or a Solid-State Drive (SSD); or a combination of the above types of memories and provides instructions and data to the processor.

The processor may be at least one of ASIC, DSP, DSPD, PLD, FPGA, CPU, controller, microcontroller, and microprocessor. It is understood that the electronic devices for implementing the above-described processor functions may be other devices, and the embodiments of the present disclosure are not particularly limited.

The electronic device may be provided as a terminal, server, or other form of device.

Based on the same technical concept of the foregoing embodiments, the embodiments of the present disclosure also provide a computer program, which when executed by a processor implements the above method.

Fig. 6 is a block diagram of an electronic device 800 according to an embodiment of the disclosure. For example, the electronic device 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, or the like terminal.

Referring to fig. 6, electronic device 800 may include one or more of the following components: processing component 802, memory 804, power component 806, multimedia component 808, audio component 810, input/output (I/O) interface 812, sensor component 814, and communication component 816.

The processing component 802 generally controls overall operation of the electronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operations at the electronic device 800. Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power supply component 806 provides power to the various components of the electronic device 800. The power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the electronic device 800.

The multimedia component 808 includes a screen that provides an output interface between the electronic device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the electronic device 800 is in an operation mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the electronic device 800. For example, the sensor assembly 814 may detect an open/closed state of the electronic device 800, the relative positioning of components, such as a display and keypad of the electronic device 800, the sensor assembly 814 may also detect a change in the position of the electronic device 800 or a component of the electronic device 800, the presence or absence of user contact with the electronic device 800, orientation or acceleration/deceleration of the electronic device 800, and a change in the temperature of the electronic device 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate wired or wireless communication between the electronic device 800 and other devices. The electronic device 800 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast related personnel information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium, such as the memory 804, is also provided that includes computer program instructions executable by the processor 820 of the electronic device 800 to perform the above-described methods.

Fig. 7 is a block diagram of an electronic device 1900 according to an embodiment of the disclosure. For example, the electronic device 1900 may be provided as a server. Referring to fig. 7, electronic device 1900 includes a processing component 1922 further including one or more processors and memory resources, represented by memory 1932, for storing instructions, e.g., applications, executable by processing component 1922. The application programs stored in memory 1932 may include one or more modules that each correspond to a set of instructions. Further, the processing component 1922 is configured to execute instructions to perform the above-described method.

The electronic device 1900 may also include a power component 1926 configured to perform power management of the electronic device 1900, a wired or wireless network interface 1950 configured to connect the electronic device 1900 to a network, and an input/output (I/O) interface 1958. The electronic device 1900 may operate based on an operating system stored in memory 1932, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, or the like.

In an exemplary embodiment, a non-transitory computer readable storage medium, such as the memory 1932, is also provided that includes computer program instructions executable by the processing component 1922 of the electronic device 1900 to perform the above-described methods.

The present disclosure may be systems, methods, and/or computer program products. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied thereon for causing a processor to implement various aspects of the present disclosure.

The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

The computer program instructions for carrying out operations of the present disclosure may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, the electronic circuitry, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA), can execute computer-readable program instructions to implement various aspects of the present disclosure by utilizing state personnel information of the computer-readable program instructions to personalize the electronic circuitry.

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A data processing method is applied to a data processing system, wherein the data processing system comprises a plurality of clusters, at least part of the plurality of clusters are used for storing target data to be processed, the plurality of clusters comprise a first cluster and a second cluster, and the target data comprises first target data;

the method comprises the following steps:

acquiring first index data, and acquiring first target data from a storage area of the second cluster according to the first index data;

caching the first target data into a storage area of the first cluster, and generating second index data of the target data;

and acquiring the first target data from the storage area of the first cluster according to the second index data so as to process the target data through the hardware resources of the first cluster.

2. The method of claim 1, wherein the target data comprises second target data, the method further comprising:

and acquiring the second target data from the storage areas of the clusters except the first cluster in the plurality of clusters according to the second index data.

3. The method according to claim 1 or 2, wherein the storage area comprises a memory area and a cache area;

the acquiring, according to the first index data, first target data from the storage area of the second cluster includes:

and acquiring the first target data from the memory area of the second cluster according to the first index data and the capacity of the cache area of the first cluster, wherein the data volume of the first target data is less than or equal to the capacity of the cache area of the first cluster.

4. The method according to any one of claims 1 to 3, wherein the storage area comprises a cache area, and before the obtaining of the first target data from the storage area of the second cluster according to the first index data, the method further comprises:

determining the second cluster from the plurality of clusters according to a transmission bandwidth between each of the other clusters and the first cluster and/or a capacity of a buffer area of each of the other clusters, wherein the other clusters include the clusters other than the first cluster.

5. The method of claim 4, wherein the determining the second cluster from the plurality of clusters according to the transmission bandwidth between each of the other clusters and the first cluster and the capacity of the buffer area of each of the other clusters comprises:

respectively obtaining a proportion parameter of each cluster in the plurality of clusters, wherein the proportion parameter comprises a ratio of the capacity of a cache region to the available bandwidth in a transmission bandwidth;

and determining the second cluster according to the proportion parameter, so that the sum of the capacities of the cache areas of all clusters in the second cluster is smaller than or equal to the capacity of the cache area of the first cluster, and the sum of the proportion parameter of all clusters in the second cluster is larger than or equal to a threshold value.

6. The method of any one of claims 1 to 5, wherein the storage area comprises a memory area;

prior to the obtaining the first index data, the method further comprises:

acquiring a storage space of a memory area of each cluster in the plurality of clusters;

according to the storage space, storing data to be stored in a distributed mode into a memory area of at least part of the clusters in the plurality of clusters, and generating index data of the data to be stored, wherein the index data of the data to be stored comprises the first index data.

7. The method according to any one of claims 1 to 6, wherein the data processing system comprises a storage device for storing the first index data and/or the second index data.

8. The method of any of claims 1 to 7, wherein the first index data comprises a storage way of the target data, and the second index data comprises a cache way of at least a portion of the target data;

the storage path of the target data comprises a storage identifier, a cluster identifier used for storing the target data, a data block identifier to which the target data belongs, and the target data identifier;

the cache way of the target data comprises a cache identifier, a data block identifier to which the target data belongs, and the target data identifier.

9. A data processing device is applied to a data processing system, the data processing system comprises a plurality of clusters, at least part of the plurality of clusters are used for storing target data to be processed, the plurality of clusters comprise a first cluster and a second cluster, and the target data comprises first target data;

the device comprises:

the acquisition module is used for acquiring first index data and acquiring first target data from the storage area of the second cluster according to the first index data;

the generating module is used for caching the first target data into a storage area of the first cluster and generating second index data of the target data;

and the processing module is used for acquiring the first target data from the storage area of the first cluster according to the second index data so as to process the target data through the hardware resources of the first cluster.

10. The apparatus of claim 9, wherein the target data comprises second target data, the apparatus further configured to:

11. The apparatus according to claim 9 or 10, wherein the storage area comprises a memory area and a cache area;

the acquisition module is configured to:

and acquiring the first target data from the memory area of a second cluster in the plurality of clusters according to the first index data and the capacity of the cache area of the first cluster, wherein the data volume of the first target data is less than or equal to the capacity of the cache area of the first cluster.

12. The apparatus according to any one of claims 9 to 11, wherein the storage area comprises a cache area, and before the obtaining module, the apparatus further comprises a determining module configured to:

13. The apparatus of claim 12, wherein the determining module is further configured to:

14. The apparatus of any one of claims 9 to 13, wherein the storage area comprises a memory area;

before the obtaining module, the apparatus is further configured to:

15. The apparatus of any of claims 9 to 14, wherein the data processing system comprises a storage device to store the first index data and/or the second index data.

16. The apparatus of any of claims 9 to 15, wherein the first index data comprises a storage way of the target data, and the second index data comprises a cache way of at least a portion of the target data;

17. A data processing system, wherein the data processing system comprises a plurality of clusters, at least some of the plurality of clusters being configured to store target data to be processed, the plurality of clusters comprising a first cluster and a second cluster, the target data comprising first target data;

the second cluster is used for storing the first target data through a storage area of the second cluster;

the first cluster is used for caching the first target data stored in the storage area of the second cluster into the storage area of the first cluster;

the data processing system further comprises a data processing apparatus according to any of claims 9 to 16 to perform the method of any of claims 1 to 8.

18. The system of claim 17, comprising a storage device configured to store first index data and/or second index data, wherein the first index data comprises a storage way of the target data, and wherein the second index data comprises a cache way of at least a portion of the target data.

19. An electronic device, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to invoke the memory-stored instructions to perform the method of any one of claims 1 to 8.

20. A computer readable storage medium having computer program instructions stored thereon, which when executed by a processor implement the method of any one of claims 1 to 8.