CN118170737A

CN118170737A - Data processing method and device and related equipment

Info

Publication number: CN118170737A
Application number: CN202410579112.0A
Authority: CN
Inventors: 姜康
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Suzhou Software Technology Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Suzhou Software Technology Co Ltd
Priority date: 2024-05-11
Filing date: 2024-05-11
Publication date: 2024-06-11

Abstract

The application provides a data processing method, a device and related equipment, wherein the method comprises the following steps: acquiring a plurality of data to be written and a plurality of state information corresponding to a plurality of clusters; generating cache information according to the plurality of state information and the plurality of data to be written; storing each data to be written in a corresponding first storage position based on the cache information; and under the condition that the storage time of the data to be written in the corresponding first storage position exceeds the preset time, migrating the data to be written in from the first storage position to the corresponding second storage position. According to the method and the device, after the plurality of state information corresponding to the plurality of clusters are determined, corresponding cache information is distributed for the plurality of data to be written based on the plurality of state information, so that the clusters for storing are accurately distributed for each data to be written, and the data processing efficiency is improved.

Description

Data processing method and device and related equipment

Technical Field

The embodiment of the application relates to the field of cloud computing, in particular to a data processing method, a data processing device and related equipment.

Background

The elastiscearch is an excellent search engine tool in the industry and is generally used for writing, storing and searching logs in a log scene. However, due to the self architecture design of the elastic search, when the number of the single cluster fragments reaches tens of thousands and the number of the data nodes reaches hundreds, node separation often occurs, so that cluster network burrs are caused, the query performance is obviously reduced under the scenes of concurrent query, aggregation, script and the like, normal access of data can be even affected when serious, and the maintenance cost of the cluster is obviously increased.

Disclosure of Invention

The embodiment of the application provides a data processing method, a data processing device and related equipment, which are used for solving the problem of low data writing efficiency in the prior art.

To solve the above problems, the present application is achieved as follows:

In a first aspect, an embodiment of the present application provides a data processing method, where the method includes:

acquiring a plurality of data to be written and a plurality of state information corresponding to a plurality of clusters, wherein the plurality of clusters are in one-to-one correspondence with the plurality of state information, and the state information is used for indicating the current load state of the corresponding clusters;

Generating cache information according to the plurality of state information and the plurality of data to be written, wherein the cache information is used for indicating storage information of each data to be written in the plurality of data to be written, and the storage information of target data comprises: the time point when the target data is written into a target cluster and the first storage position when the target data is written into the target cluster, wherein the target data is any one of the plurality of data to be written, and the target cluster is the cluster with the highest matching degree with the target data in the plurality of clusters;

Storing each data to be written in a corresponding first storage position based on the cache information;

And under the condition that the storage time of the data to be written in the corresponding first storage position exceeds the preset time, migrating the data to be written in from the first storage position to the corresponding second storage position, wherein the query performance of the first storage position corresponding to the target data is higher than that of the second storage position corresponding to the target data.

Optionally, the generating cache information according to the plurality of state information and the plurality of data to be written includes:

calculating the plurality of state information and the plurality of data to be written according to a load dynamic evaluation algorithm to obtain a first calculation result, wherein the first calculation result is used for indicating a first storage position corresponding to the data to be written;

Determining a second calculation result according to the first calculation result, wherein the second calculation result is used for indicating a writing time point corresponding to the data to be written;

and generating the cache information according to the first calculation result and the second calculation result.

Optionally, the storing each data to be written in a corresponding first storage location based on the cache information includes:

determining a plurality of target time points based on the cache information, wherein the target time points correspond to the data to be written one by one, and the target time points are time points when the corresponding data to be written is written into a cluster;

Generating a time wheel according to the plurality of target time points, wherein the time wheel comprises the plurality of target time points which are sequentially arranged according to a time sequence;

And storing each data to be written in a corresponding first storage position according to the time wheel and the cache information.

Optionally, the storing each data to be written in the corresponding first storage location according to the time wheel and the cache information includes:

Generating a plurality of metadata mapping tables based on the clusters and the cache information, wherein the metadata mapping tables are in one-to-one correspondence with the clusters, and the metadata mapping tables are used for indicating the positions and the storage time of the stored contents of the corresponding clusters;

and storing each data to be written in a corresponding first storage position according to the time wheel and the metadata mapping table information.

Optionally, the storing each data to be written in a corresponding first storage location according to the time wheel and the metadata mapping tables includes:

Under the condition that the index corresponding to the target data is other than a preset index, determining a first cluster in the plurality of clusters, wherein the first cluster is the cluster with the lowest current load in the plurality of clusters;

writing the target data into the first cluster according to the time wheel and the metadata mapping tables;

Updating the time wheel to obtain a target time wheel under the condition that the index corresponding to the target data is a preset index and the load index of the target cluster is larger than a preset threshold;

Storing the target data in the corresponding first storage position according to the target time wheel and the metadata mapping tables;

Under the condition that the index corresponding to the target data is a preset index and the load index of the target cluster is smaller than a preset threshold, determining a second cluster in the clusters, wherein the second cluster is the cluster with the lowest current load in the clusters;

And writing the target data into the second cluster for storage according to the time wheel and the metadata mapping table information.

Optionally, after the migration of the data to be written from the first storage location to the corresponding second storage location in the case that the storage duration of the data to be written stored in the corresponding first storage location exceeds the preset duration, the method further includes:

Acquiring a data query request, wherein the data query request is used for querying first data in the clusters;

Acquiring storage information in the plurality of clusters, wherein the storage information is used for indicating at least one storage position of the first data in the plurality of clusters;

Screening the at least one storage position according to preset conditions to determine a target storage position, wherein the preset conditions comprise at least one of the following: whether the storage position is within the preset duration, whether the storage position is positioned at the first storage position and whether the storage position is positioned at the second storage position;

and acquiring the first data according to the target storage position.

In a second aspect, an embodiment of the present application further provides a data processing apparatus, including:

the system comprises an acquisition module, a storage module and a storage module, wherein the acquisition module is used for acquiring a plurality of data to be written and a plurality of state information corresponding to a plurality of clusters, the clusters are in one-to-one correspondence with the state information, and the state information is used for indicating the current load state of the corresponding cluster;

the generating module is configured to generate cache information according to the plurality of state information and the plurality of data to be written, where the cache information is used to indicate storage information of each data to be written in the plurality of data to be written, and the storage information of the target data includes: the time point when the target data is written into a target cluster and the first storage position when the target data is written into the target cluster, wherein the target data is any one of the plurality of data to be written, and the target cluster is the cluster with the highest matching degree with the target data in the plurality of clusters;

The writing module is used for storing each data to be written in a corresponding first storage position based on the cache information;

The migration module is used for migrating the data to be written from the first storage position to the corresponding second storage position under the condition that the storage time of the data to be written in the corresponding first storage position exceeds the preset time, wherein the query performance of the first storage position corresponding to the target data is higher than that of the second storage position corresponding to the target data.

In a third aspect, an embodiment of the present application further provides an electronic device, including: a transceiver, a memory, a processor, and a program stored on the memory and executable on the processor; the processor is configured to read a program in the memory to implement the steps in the method according to the foregoing first aspect.

In a fourth aspect, embodiments of the present application also provide a readable storage medium storing a program which, when executed by a processor, implements the steps of the method as described in the foregoing first aspect.

In a fifth aspect, embodiments of the present application also provide a computer program product stored in a storage medium, the computer program product being executable by at least one processor to implement the steps in the method according to the first aspect.

The application provides a data processing method, a device and related equipment, wherein the method comprises the following steps: acquiring a plurality of data to be written and a plurality of state information corresponding to a plurality of clusters, wherein the plurality of clusters are in one-to-one correspondence with the plurality of state information, and the state information is used for indicating the current load state of the corresponding clusters; generating cache information according to the plurality of state information and the plurality of data to be written, wherein the cache information is used for indicating storage information of each data to be written in the plurality of data to be written, and the storage information of target data comprises: the time point when the target data is written into a target cluster and the first storage position when the target data is written into the target cluster, wherein the target data is any one of the plurality of data to be written, and the target cluster is the cluster with the highest matching degree with the target data in the plurality of clusters; storing each data to be written in a corresponding first storage position based on the cache information; and under the condition that the storage time of the data to be written in the corresponding first storage position exceeds the preset time, migrating the data to be written in from the first storage position to the corresponding second storage position, wherein the query performance of the first storage position corresponding to the target data is higher than that of the second storage position corresponding to the target data. According to the method and the device, after the plurality of state information corresponding to the plurality of clusters are determined, corresponding cache information is distributed for the plurality of data to be written based on the plurality of state information, so that the clusters for storing are accurately distributed for each data to be written, and the data processing efficiency is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments of the present application will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort to a person of ordinary skill in the art.

FIG. 1 is a schematic flow chart of a data processing method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a server according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a data writing process according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a second embodiment of a data writing process;

FIG. 5 is a schematic diagram of a data query flow provided in an embodiment of the present application;

FIG. 6 is a second schematic diagram of a server according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a data processing apparatus according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The terms "first," "second," and the like in embodiments of the present application are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. Furthermore, the use of "and/or" in the present application means at least one of the connected objects, such as a and/or B and/or C, means 7 cases including a alone a, B alone, C alone, and both a and B, both B and C, both a and C, and both A, B and C.

Referring to fig. 1, fig. 1 is a flow chart of a data processing method according to an embodiment of the present application. The data processing method shown in fig. 1 may be performed by a server.

As shown in fig. 1, the data processing method may include the steps of:

Step 101, acquiring a plurality of data to be written and a plurality of state information corresponding to a plurality of clusters, wherein the plurality of clusters are in one-to-one correspondence with the plurality of state information, and the state information is used for indicating the current load state of the corresponding clusters.

In this embodiment, the data method used in the present application can be applied to an elastesearch search engine, where elastesearch is an excellent search engine tool in the industry, and is generally used for writing, storing and searching logs in a log scene. The method can conveniently enable a large amount of data to have the capabilities of searching, analyzing and exploring. The horizontal scalability of the elastomer search is fully utilized, enabling the data to become more valuable in a production environment. The implementation principle of the elastic search is mainly divided into the following steps, firstly, a user submits data to an elastic search database, then a word segmentation controller is used for word segmentation of corresponding sentences, the weight and word segmentation results are stored in the data together, when the user searches the data, the results are ranked according to the weight, scoring is carried out, and then the returned results are presented to the user.

Specifically, as shown in fig. 2, fig. 2 is a schematic structural diagram of a server in this embodiment, where an elastic search includes a plurality of clusters, data written from a client is written into a distributed cache queue first, and writing traffic is distributed according to dynamic balancing of bearing capacities of the clusters, nodes, and slices by means of directional routing and dynamic load evaluation in a multi-cluster controller. And the accurate management of index data is realized through the management and control of the life cycle management capability data of the elastic search.

The data stored in different clusters may be different, for example, the cluster 1 corresponds to the data with indexes A3, B1 and C2, the cluster 2 corresponds to the data with indexes A2, and the cluster 3 corresponds to the data with indexes A1, B2, C3, B3 and C1, which should be explained that the different clusters include the cold node and the hot node, and the query performance of the cold node is lower than that of the hot node. Therefore, data having a larger number of queries is generally stored in the hot node, and data having a smaller number of queries is generally stored in the cold node.

The state information corresponding to the plurality of clusters is used for indicating a load state in the plurality of clusters, for example, when the load state is higher, the pressure of the cluster writing data is higher, and when the load state is lower, the pressure of the cluster writing data is lower.

Step 102, generating cache information according to the plurality of state information and the plurality of data to be written, where the cache information is used to indicate storage information of each data to be written in the plurality of data to be written, and the storage information of the target data includes: the method comprises the steps that target data are written into a target cluster at a time point and a first storage position of the target cluster, wherein the target data are any one of a plurality of data to be written, and the target cluster is the cluster with the highest matching degree with the target data in the plurality of clusters.

In this embodiment, cache information is generated through a plurality of state information and index information corresponding to a plurality of data to be written, where the cache information indicates storage information of each data to be written stored in a cluster. Specifically, a first storage location where each data to be written exists in a certain cluster and a time point where each data to be written is stored in a certain cluster are both included in storage information, so that a plurality of storage information are generated, cache information is generated through the plurality of storage information as a whole, and storage paths of the plurality of data to be written are indicated through the cache information.

In this embodiment, the target data is defined, where the target data is any one of a plurality of data to be written, and the target data is written into a target cluster corresponding to the target data, where the target cluster is a cluster with the highest matching degree with the target data, where the matching degree may be that the index of the target data is the same as that of the target cluster, and the current load state of the target cluster is not high, and the data may be written.

And step 103, storing each data to be written in a corresponding first storage position based on the cache information.

In this embodiment, each data to be written is stored in a first storage location of the corresponding cluster through the determined cache information, where the first storage location is a hot node, so that the newly written data is generally stored in the hot node preferentially.

And 104, under the condition that the storage time of the data to be written in the corresponding first storage position exceeds the preset time, migrating the data to be written in from the first storage position to the corresponding second storage position, wherein the query performance of the first storage position corresponding to the target data is higher than that of the second storage position corresponding to the target data.

In this embodiment, the preset duration is a life cycle of the data storage, and the second storage location is a cold node. After the data is stored for a life cycle, the number of queries for the default data becomes low, thus moving the data from the hot node to the cold node. Specifically, the cold-hot cluster architecture scheme is cold-hot data separation. The essence is to separate and store the data with different access frequencies, so that the data with high access quantity is stored in a disk with better performance, and more reasonable resource allocation and scheduling are realized. The life cycle is mainly used for business requirements like a log system, the stored data gradually reduces the retrieval requirements along with time, the early data needs to be compressed and deleted, and the elastic search life management provides a management means of overtime data for the scene.

The application provides a data processing method, which comprises the following steps: acquiring a plurality of data to be written and a plurality of state information corresponding to a plurality of clusters, wherein the plurality of clusters are in one-to-one correspondence with the plurality of state information, and the state information is used for indicating the current load state of the corresponding clusters; generating cache information according to the plurality of state information and the plurality of data to be written, wherein the cache information is used for indicating storage information of each data to be written in the plurality of data to be written, and the storage information of target data comprises: the time point when the target data is written into a target cluster and the first storage position when the target data is written into the target cluster, wherein the target data is any one of the plurality of data to be written, and the target cluster is the cluster with the highest matching degree with the target data in the plurality of clusters; storing each data to be written in a corresponding first storage position based on the cache information; and under the condition that the storage time of the data to be written in the corresponding first storage position exceeds the preset time, migrating the data to be written in from the first storage position to the corresponding second storage position, wherein the query performance of the first storage position corresponding to the target data is higher than that of the second storage position corresponding to the target data. According to the method and the device, after the plurality of state information corresponding to the plurality of clusters are determined, corresponding cache information is distributed for the plurality of data to be written based on the plurality of state information, so that the clusters for storing are accurately distributed for each data to be written, and the data processing efficiency is improved.

In some possible implementations, optionally, the generating cache information according to the plurality of state information and the plurality of data to be written includes:

In this embodiment, as shown in fig. 3, fig. 3 is a flowchart of data writing in the present application, specifically, with an index as a dimension, writing data into a buffer queue of a message. The collected running condition indexes of the whole cluster are analyzed, wherein the running condition indexes comprise information such as cluster availability, JVM, CPU, memory, number of fragments, writing flow, thread pool, node state, node type and the like. And calculating the cluster, the node and the fragments written by the current index recommendation, namely a first calculation result, by using a load dynamic evaluation algorithm. And in addition, calculating the cluster time of recommended writing to obtain a second calculation result.

The load dynamic evaluation algorithm performs weighted dispersion standardization on indexes such as CPU core number, JVM usage, fragment number, writing amount, writing thread, disk usage and the like of the hot node, and converts and maps the characteristic value into a value between 0 and 1, so that comparison of thresholds is facilitated when time is allocated:

The cur is data recorded in the current time, the min is initialization data, and the max is the maximum theoretical allowed value of the current index. For example, the disk cluster initialization value is 0.05, the maximum peak value may be set to 0.85, and when exceeding 0.85, the management limit of the elastic search on the disk is triggered. Thus, a weighted value is given to each index, and each parameter is calculated, so that a data is finally obtained for threshold comparison before time round allocation.

In this embodiment, a time wheel is allocated to a write thread that obtains priority, where a time period includes a plurality of write time points of data to be written, that is, target time points. Thus, the write task in the time round periodically reads data low from the message cache queue. Data is written to the designated cluster, node, and shard locations until the assigned time wheel expires. Recording metadata information in a metadata mapping table, wherein the metadata information comprises information such as time ranges, clusters, nodes, fragments, life cycle stages and the like; performing consumption playback on the data which is failed to be written according to the generation time of the log data, and writing the data into the target position fragments and indexes to be executed according to the metadata mapping table; after the end of one time round period, the next execution time round period judgment is carried out.

In this embodiment, the basic composition logic of the metadata mapping table information is a multi-linked list structure, the metadata information of each cluster is stored by a linked list, the linked list records the metadata Map pointer of each index according to the time period sequence, the Map uses the index name as Key, and the information of the cluster, node, fragment, life cycle and the like where the index is located.

It should be noted that, the data structure designed above can accomplish addressing and management of the write data very efficiently: the metadata information storage of a plurality of clusters is managed in a centralized way, so that the problems of synchronization and inconsistency of information are avoided; when the index of the time wheel writing priority is newly obtained, an information bit is inserted into the head of the linked list, so that the quick recording of the data is realized; when data needs to be replayed, the linked list can be traversed sequentially, corresponding time is found, corresponding writing position records in the Map are queried, and time sequence writing of the data is realized; after the data enter the deletion period, the index information in the Map can be directly deleted according to the life cycle strategy stored in the multi-cluster management data, and if the data needing to be reserved does not exist in the whole time period of the linked list node, the whole node information is deleted.

In this embodiment, in the case where the target data is written into the target cluster, it needs to be determined whether the target data is a preset index, where the preset index is an index already included in the current server, and thus, as shown in fig. 4, whether the target data is a new index is determined; if the index is a new index, directly distributing a time wheel, and selecting a cluster, a node and a fragment with the lowest load for writing; if the target position is the stock index, preferentially judging whether the load of the currently written target position exceeds the load; if the load is exceeded, the time wheel is redistributed, and the route is modified to be written into a new target position; if the load is not exceeded, automatically extending a time wheel, and continuously writing to the original target position; repeating the steps until the writing task is finished.

and acquiring the first data according to the target storage position.

In this embodiment, for the performance optimization algorithm of the multi-cluster scene, the algorithm prunes the request by the information recorded in the metadata mapping table, cuts unnecessary query request links, improves the utilization efficiency of the whole plurality of cluster links, and reasonably sets the cache and improves the query hit rate of the cold node according to the characteristic that the cold data is not changed generally. In addition, an external processing engine is arranged on the cross-node and cross-cluster scene, so that the pressure of the elastic search coordination node is reduced.

Specifically, as shown in fig. 5, when query data query is performed, pruning is directly performed on high-load unavailable target nodes, and the request does not traverse to the positions of the nodes any more; according to the time range record in the metadata mapping table, if the query time exists, continuing to execute the query; judging whether the request crosses clusters or not according to index records in the metadata mapping table; if the request crosses the cluster, the request is distributed and searched; if the request is carried out in the cluster, matching is carried out according to the time range and the life cycle of the current index, and whether the request falls in a hot node interval is judged; if the request falls within the hot interval, distributing the request to the hot node for searching; if the request falls in the cold interval, querying a cold node history query cache, and increasing response efficiency of the request; after the filtering in the steps, the rest requests are requests with larger influence ranges such as cross-life cycle nodes and cross-clusters, if the cluster load is judged to be higher, the operations of merging, deduplication and sequencing of the results can be selectively executed in the external engine, so that the situation that the elastic search cluster generates query refusing service is avoided.

It should be noted that the purpose of multi-cluster management is to make a plurality of elastic search clusters resemble a single cluster from the perspective of a user, so that the use and learning costs of the clusters can be reduced. By adding a front multi-cluster controller to the elastic search multi-cluster, the controller is uniformly packaged into externally exposed controller addresses. In addition, the multi-cluster controller is also responsible for the establishment and distribution of routing rules of clusters, the management and distribution of life cycle strategies, metadata mapping information and the management of data acquisition strategies.

As shown in fig. 6, fig. 6 is a schematic structural diagram of a data processing method according to the present application, where the data processing method may be specifically divided into four modules, namely, a write control layer, a query control layer, a storage control layer, and a management control layer, and specifically, the storage control layer includes monitoring data, metadata, and a distributed cache queue. The monitoring data stores information such as JVM, I/O, memory, disk capacity, index writing flow, query concurrency, cluster node state, elastic search node state, fragmentation and the like of each elastic search cluster, and is used for multi-cluster concurrency writing a decision basis for distributing time rounds; the metadata mapping table stores information such as life cycle strategies, cluster states, routing rules, index and time slice mapping relations and the like; and the distributed cache queue is used for buffering the writing flow and preventing the overload of the instantaneous data, the writing of task consumption data of the acquisition time wheel into a corresponding routing cluster and the data replay of writing failure.

And the writing control layer judges the load expectation of cluster writing by analyzing the current index history monitoring data, realizes the flow dislocation writing of different clusters according to a time round algorithm, and is in seamless connection with a life cycle strategy, thereby ensuring the dynamic load balance of cluster writing and realizing the optimal state. Meanwhile, by using the index and time slice mapping relation table stored in the metadata, the buffer data playback of the write-failure log is realized, the write-failure log is written into the index of the appointed cluster according to time, and the time sequence characteristic of the log is ensured.

The query control layer comprises three management modules, namely time window filtering, slow query pruning and an external processing engine, and in a log scene, data are stored in a layered mode by using a cold and hot architecture, and historical data are migrated from a hot node to a cold node for storage by setting a life cycle strategy. Aiming at the management of multiple clusters, the method provides that a plurality of linked lists are used for respectively managing and connecting in series according to time ranges, so that the access of the time sequence log is facilitated. Pruning query is carried out through the dimensionalities of nodes, indexes, fragments, life cycle strategies and the like managed in metadata, the query range is fixed in a specific range, the query request is prevented from covering the full quantity of nodes, the load of the whole cluster is increased, and therefore the cluster reading and writing capability is affected. In addition, after the data enters the cold node, the data basically does not change any more, and the query request hit on the cold node can be cached in full quantity at the moment, and can be directly read from the cache when the next query condition hits. When the query request spans the hot node, the cold node and even the cluster, the merging, the deduplication and the sequencing of the search results can be optionally executed through an external processing engine, so that the condition that the load of the coordination node is too high to influence the concurrent execution efficiency of the whole request is avoided.

The management control layer comprises three parts, namely rule definition, unified entry and life cycle policy distribution, wherein the rule definition is responsible for defining writing, inquiring and index route definition; the unified entry is used for unified encapsulation of the externally exposed address; and the lifecycle policy distribution, unified distribution, modification and deletion of the control policies comprise the lifecycle policies, the routing policies, the data acquisition policies and the like.

In the application, a multi-cluster management scheme is established on the basis of an elastic search life cycle and a cold-hot architecture, a write-in end introduces an index writing method of a load dynamic evaluation algorithm and time round dislocation, so that the synchronous, concurrent and cross-link multiplexing writing of data is ensured, a cache queue replay mechanism is increased, and the consistency and the time sequence of log data writing are ensured. And the data distribution condition of the whole cluster can be observed through writing the life cycle and the index into the metadata record of the target position. At the query end, pruning, caching and external processing engine technologies aiming at the multi-cluster scene are provided, the coverage range of each request is limited, and the resource consumption of a single request is reduced, so that the concurrency query performance and capability of the multi-cluster are integrally improved. And finally, designing a complete multi-cluster system management scheme and system based on the scheme.

In addition, the product of the application is put on line, the technical scheme is applied to cloud products at present, and about 200+ multi-cluster nodes are hosted in an Internet company log scene. In some service fields, there is a multi-cluster online demand, 180TB data is expected to be accessed in one period, and the resource pool is already built at present, i.e. the service access period is about to be entered. There is also business access outside the mobile, for example: the multi-cluster capability of the application can assist the cloud log management capability (the bottom calling ES) to extend from 30TB to 500TB.

According to the method and the device, after the plurality of state information corresponding to the plurality of clusters are determined, corresponding cache information is distributed for the plurality of data to be written based on the plurality of state information, so that the clusters for storing are accurately distributed for each data to be written, and the data processing efficiency is improved.

Referring to fig. 7, fig. 7 is a block diagram of a data processing apparatus according to an embodiment of the present application. As shown in fig. 7, the data processing apparatus 700 includes:

The obtaining module 710 is configured to obtain a plurality of data to be written and a plurality of state information corresponding to a plurality of clusters, where the plurality of clusters are in one-to-one correspondence with the plurality of state information, and the state information is used to indicate a current load state of the corresponding cluster;

The generating module 720 is configured to generate cache information according to the plurality of state information and the plurality of data to be written, where the cache information is used to indicate storage information of each data to be written in the plurality of data to be written, and the storage information of the target data includes: the time point when the target data is written into a target cluster and the first storage position when the target data is written into the target cluster, wherein the target data is any one of the plurality of data to be written, and the target cluster is the cluster with the highest matching degree with the target data in the plurality of clusters;

A writing module 730, configured to store each data to be written in a corresponding first storage location based on the cache information;

And the migration module 740 is configured to migrate the data to be written from the first storage location to the corresponding second storage location if the storage duration of the data to be written stored in the corresponding first storage location exceeds a preset duration, where the query performance of the first storage location corresponding to the target data is higher than the query performance of the second storage location corresponding to the target data.

Optionally, the generating module 720 includes:

The computing sub-module is used for computing the plurality of state information and the plurality of data to be written according to a load dynamic evaluation algorithm to obtain a first computing result, wherein the first computing result is used for indicating a first storage position corresponding to the data to be written;

The first determining submodule is used for determining a second calculation result according to the first calculation result, and the second calculation result is used for indicating a writing time point corresponding to the data to be written;

And the first generation sub-module is used for generating the cache information according to the first calculation result and the second calculation result.

Optionally, the writing module 730 includes:

The second determining submodule is used for determining a plurality of target time points based on the cache information, the target time points are in one-to-one correspondence with the data to be written, and the target time points are time points when the corresponding data to be written are written into the cluster;

the second generation sub-module is used for generating a time wheel according to the plurality of target time points, and the time wheel comprises the plurality of target time points which are sequentially arranged according to a time sequence;

And the writing sub-module is used for storing each data to be written in a corresponding first storage position according to the time wheel and the cache information.

Optionally, the writing submodule includes:

the generating unit is used for generating a plurality of metadata mapping tables based on the clusters and the cache information, the metadata mapping tables are in one-to-one correspondence with the clusters, and the metadata mapping tables are used for indicating the positions and the storage time of the stored contents of the corresponding clusters;

And the writing unit is used for storing each data to be written in the corresponding first storage position according to the time wheel and the metadata mapping table information.

Optionally, the writing unit is further configured to:

Optionally, the method further comprises:

The query module is used for acquiring a data query request, wherein the data query request is used for querying first data in the plurality of clusters;

An information acquisition module for acquiring storage information in the plurality of clusters, wherein the storage information is used for indicating at least one storage position of the first data in the plurality of clusters;

The screening module is used for screening the at least one storage position according to preset conditions to determine a target storage position, wherein the preset conditions comprise at least one of the following: whether the storage position is within the preset duration, whether the storage position is positioned at the first storage position and whether the storage position is positioned at the second storage position;

And the data acquisition module is used for acquiring the first data according to the target storage position.

The embodiment of the application also provides electronic equipment. Referring to fig. 8, the electronic device may include a processor 801, a memory 802, and a program 8021 stored on the memory 802 and executable on the processor 801.

The program 8021, when executed by the processor 801, may implement any of the steps in the corresponding method embodiment of fig. 1:

and acquiring the first data according to the target storage position.

The embodiment of the application also provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor, implements the respective processes of the above-mentioned data processing method embodiment, and can achieve the same technical effects, and in order to avoid repetition, the description is omitted here. The computer readable storage medium is, for example, a Read-Only Memory (ROM), a random access Memory (Random Access Memory RAM), a magnetic disk or an optical disk.

The embodiment of the present application further provides a computer program product, where the computer program product is stored in a storage medium, and the computer program product is executed by at least one processor to implement each process of the above-mentioned data processing method embodiment, and the same technical effects can be achieved, so that repetition is avoided, and no redundant description is provided herein.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present application.

The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those having ordinary skill in the art without departing from the spirit of the present application and the scope of the claims, which are to be protected by the present application.

Claims

1. A method of data processing, the method comprising:

2. The method of claim 1, wherein generating cache information from the plurality of state information and the plurality of data to be written comprises:

3. The method of claim 2, wherein storing each of the data to be written in the corresponding first storage location based on the cache information comprises:

4. A method according to claim 3, wherein storing each of the data to be written in the corresponding first storage location according to the time wheel and the cache information comprises:

5. The method of claim 4, wherein storing each of the data to be written in the corresponding first storage location according to the time wheel and the plurality of metadata mapping tables, comprises:

6. The method according to claim 1, wherein, in the case where the storage duration of the data to be written stored in the corresponding first storage location exceeds a preset duration, after the data to be written is migrated from the first storage location to the corresponding second storage location, the method further comprises:

and acquiring the first data according to the target storage position.

7. A data processing apparatus, the apparatus comprising:

8. An electronic device, comprising: a memory, a processor, and a program stored on the memory and executable on the processor; -characterized in that the processor is arranged to read a program in a memory for implementing the steps in the data processing method according to any one of claims 1 to 6.

9. A readable storage medium storing a program, wherein the program when executed by a processor implements the steps in the data processing method according to any one of claims 1 to 6.

10. A computer program product, characterized in that it is stored in a storage medium, which is executed by at least one processor to implement the steps in the data processing method according to any of claims 1 to 6.