CN117008818A

CN117008818A - Data processing method, apparatus, computer device, and computer readable storage medium

Info

Publication number: CN117008818A
Application number: CN202211233181.3A
Authority: CN
Inventors: 蒋楠; 严石伟
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-10-10
Filing date: 2022-10-10
Publication date: 2023-11-07

Abstract

The present application relates to a data processing method, apparatus, computer device, computer readable storage medium and computer program product, which can be applied to various scenes such as cloud technology, artificial intelligence, intelligent traffic, assisted driving, etc. The method comprises the following steps: acquiring a database to be stored, and determining a server cluster for storing the database to be stored; dividing a database to be stored into a plurality of data sub-databases, and respectively determining a fragmentation algorithm and candidate storage nodes matched with each data sub-database according to parameter information of each data sub-database; the equipment type of the candidate storage node is matched with the database; determining storage position mapping information corresponding to each database based on a slicing algorithm matched with the database, and determining target storage nodes corresponding to each storage position mapping information from each candidate storage node; and storing each database sub-database to a target storage node corresponding to the database sub-database respectively. By adopting the method, the data processing efficiency can be improved.

Description

Data processing method, apparatus, computer device, and computer readable storage medium

Technical Field

The present application relates to the field of computer technology, and in particular, to a data processing method, apparatus, computer device, computer readable storage medium, and computer program product.

Background

With the rapid development of computer technology, the explosive growth of data size has created unprecedented challenges for data processing.

According to the traditional data processing method, a storage node of a database to be stored is determined according to identification information of the database to be stored and the mapping relation between the identification information and the storage node, and the database to be stored is stored in the storage node. With the conventional data processing method, the performance of the data processing device may be reduced due to the overlarge data size of the database, so that the problem of low data processing efficiency exists.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a data processing method, apparatus, computer device, computer-readable storage medium, and computer program product that can improve data processing efficiency.

In a first aspect, the present application provides a data processing method. The method comprises the following steps:

acquiring a database to be stored, and determining a server cluster for storing the database to be stored; the server cluster comprises a plurality of storage nodes of at least two device types;

Dividing the database to be stored into a plurality of data sub-databases, and respectively determining a slicing algorithm and candidate storage nodes which are matched with each data sub-database according to the parameter information of each data sub-database; the equipment type of the candidate storage node is matched with the database sub-database;

determining storage position mapping information corresponding to each database sub-base based on a slicing algorithm matched with the database sub-base, and determining target storage nodes corresponding to the storage position mapping information from the candidate storage nodes;

and storing each database sub-database to a target storage node corresponding to the database sub-database respectively.

In a second aspect, the present application provides a data processing apparatus. The device comprises:

the acquisition module is used for acquiring a database to be stored and determining a server cluster for storing the database to be stored; the server cluster comprises a plurality of storage nodes of at least two device types;

the system comprises a fragmentation algorithm and candidate storage node determining module, a storage node determining module and a storage node determining module, wherein the database to be stored is divided into a plurality of data sub-databases, and the fragmentation algorithm and the candidate storage node matched with each data sub-database are respectively determined according to the parameter information of each data sub-database; the equipment type of the candidate storage node is matched with the database sub-database;

The target storage node determining module is used for determining storage position mapping information corresponding to each database based on a slicing algorithm matched with the database sub-database, and determining target storage nodes corresponding to the storage position mapping information from the candidate storage nodes;

and the data storage module is used for respectively storing each database sub-base to a target storage node corresponding to the database sub-base.

In a third aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor which when executing the computer program performs the steps of:

In a fourth aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:

In a fifth aspect, the present application also provides a computer program product. The computer program product comprises a computer program which, when executed by a processor, implements the steps of:

According to the data processing method, the device, the computer equipment, the computer readable storage medium and the computer program product, the database to be stored is obtained, the server cluster comprising a plurality of storage nodes of at least two equipment types for storing the database to be stored is determined, then the database to be stored is divided into a plurality of data sub-databases, and each data sub-database is respectively stored into different storage nodes, so that the problem of performance degradation of the data processing equipment caused by overlarge data scale can be avoided to a certain extent, the data processing efficiency is improved, the corresponding slicing algorithm and the candidate storage nodes are matched for each data sub-database according to the parameter information of each data sub-database in the slicing storage process, the targeted resource deployment can be carried out according to the characteristics of the data sub-databases, the matching degree of the storage nodes and the data sub-databases can be ensured, the resource utilization rate of each storage node is improved, and the data processing efficiency is further improved.

Drawings

FIG. 1 is a diagram of an application environment for a data processing method in some embodiments;

FIG. 2 is a flow chart of a data processing method in some embodiments;

FIG. 3 is a schematic diagram illustrating a determination process of a target storage node in some embodiments without setting a new virtual node;

FIG. 4 is a schematic diagram illustrating a determination process of a target storage node in some embodiments when a new virtual node is set;

FIG. 5 is a schematic deployment diagram of a business hash ring in some embodiments;

FIG. 6 is a flow chart of a data processing method in other embodiments;

FIG. 7 is a schematic diagram of a data processing process in some embodiments;

FIG. 8 is a schematic diagram of a multi-dimensional slicing mechanism in some embodiments;

FIG. 9 is a schematic diagram of a process for a copy replication strategy in some embodiments;

FIG. 10 is a block diagram of a data processing apparatus in some embodiments;

FIG. 11 is an internal block diagram of a computer device in some embodiments.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

Cloud technology (Cloud technology) refers to a hosting technology for integrating hardware, software, network and other series resources in a wide area network or a local area network to realize calculation, storage, processing and sharing of data.

Cloud technology (Cloud technology) is based on the general terms of network technology, information technology, integration technology, management platform technology, application technology and the like applied by Cloud computing business models, and can form a resource pool, so that the Cloud computing business model is flexible and convenient as required. Cloud computing technology will become an important support. Background services of technical networking systems require a large amount of computing, storage resources, such as video websites, picture-like websites, and more portals. Along with the high development and application of the internet industry, each article possibly has an own identification mark in the future, the identification mark needs to be transmitted to a background system for logic processing, data with different levels can be processed separately, and various industry data needs strong system rear shield support and can be realized only through cloud computing.

Cloud storage (cloud storage) is a new concept that extends and develops in the concept of cloud computing, and a distributed cloud storage system (hereinafter referred to as a storage system for short) refers to a storage system that integrates a large number of storage devices (storage devices are also referred to as storage nodes) of various types in a network to work cooperatively through application software or application interfaces through functions such as cluster application, grid technology, and a distributed storage file system, so as to provide data storage and service access functions for the outside.

At present, the storage method of the storage system is as follows: when creating logical volumes, each logical volume is allocated a physical storage space, which may be a disk composition of a certain storage device or of several storage devices. The client stores data on a certain logical volume, that is, the data is stored on a file system, the file system divides the data into a plurality of parts, each part is an object, the object not only contains the data but also contains additional information such as a data Identification (ID) and the like, the file system writes each object into a physical storage space of the logical volume, and the file system records storage position information of each object, so that when the client requests to access the data, the file system can enable the client to access the data according to the storage position information of each object.

The process of allocating physical storage space for the logical volume by the storage system specifically includes: physical storage space is divided into stripes in advance according to the set of capacity measures for objects stored on a logical volume (which measures tend to have a large margin with respect to the capacity of the object actually to be stored) and redundant array of independent disks (RAID, redundant Array of Independent Disk), and a logical volume can be understood as a stripe, whereby physical storage space is allocated for the logical volume.

Public clouds (Public clouds) generally refer to clouds that third party providers provide to users that can use, and are generally available over the Internet, and may be free or low cost, with the core attribute of the Public clouds being shared resource services. There are many examples of such clouds that can provide services throughout the open public network today.

The data processing method provided by the embodiment of the application can be applied to an application environment shown in figure 1. Wherein the server 104 may communicate with the terminal 102 and the server cluster 106 over a network. The server cluster 106 is a server cluster that contains multiple storage nodes of at least two device types that can store data that the server 104 needs to process. The terminal 102 includes, but is not limited to, a mobile phone, a computer, an intelligent voice interaction device, an intelligent home appliance, a vehicle-mounted terminal, and the like. The server 104 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligence platforms, and the like. The server 102 and the terminal 104, and the server 102 and the server cluster 106 may be directly or indirectly connected through wired or wireless communication, which is not limited in this regard.

Specifically, the server 104 performs data processing during the process of: acquiring a database to be stored, and determining a server cluster comprising a plurality of storage nodes of at least two equipment types for storing the database to be stored; dividing a database to be stored into a plurality of data sub-databases, and respectively determining a fragmentation algorithm and candidate storage nodes matched with each data sub-database according to parameter information of each data sub-database; the equipment type of the candidate storage node is matched with the database; determining storage position mapping information corresponding to each database based on a slicing algorithm matched with the database, and determining target storage nodes corresponding to each storage position mapping information from each candidate storage node; and storing each database sub-database to a target storage node corresponding to the database sub-database respectively. It should be noted that, in some embodiments, the above-mentioned data processing method may also be performed by the terminal 102 through the network connection server cluster 106 in case that the data processing capability of the terminal 102 meets the requirement.

In some embodiments, as shown in fig. 2, a data processing method is provided, where the method may be performed by a terminal or a server, and may also be performed by the terminal and the server together, and the method is applied to the server 104 in fig. 1, and is illustrated as an example, and includes the following steps:

Step S202, a database to be stored is obtained, and a server cluster for storing the database to be stored is determined.

The database to be stored refers to a database which needs to be stored with data. The Database (Database), which can be considered as an electronic filing cabinet, is a place for storing electronic files, and users can perform operations such as adding, inquiring, updating, deleting and the like on the data in the files. A "database" is a collection of data stored together in a manner that can be shared with multiple users, with as little redundancy as possible, independent of the application. The type of the database to be stored can be a relational database or a non-relational database, and the database can be used for data backup only or can be a retrieval database for data retrieval. In summary, the specific type and use of the database to be stored is not limited by the present application.

A server cluster is a cluster of multiple servers that are integrated to perform the same service, that is, the server cluster includes multiple server nodes, and a higher computing speed can be obtained by performing parallel computation using the multiple server nodes. In the application, the server cluster is used for data storage and retrieval service, in particular to store a database to be stored and perform data retrieval service based on the stored database. Further, the server cluster includes a plurality of storage nodes of at least two device types, such as CPU (central processingunit ) nodes or GPU (graphics processing unit, image processor) nodes. In case the data to be stored is video data, the server cluster may further comprise a VPU (VideoProcessing Unit ) node.

Specifically, a server acquires a database to be stored, and determines a server cluster for storing the database to be stored, so that the database to be stored is stored in a subsequent partition to different storage nodes in the server cluster. The specific mode of the server for acquiring the database to be stored can be active acquisition or passive reception. The server may determine, according to the network environment in which the server itself is located, a server cluster having a network connection with the server itself as a server cluster for storing a database to be stored.

Step S204, the database to be stored is divided into a plurality of data sub-databases, and the respective matched slicing algorithm and candidate storage nodes of each data sub-database are respectively determined according to the parameter information of each data sub-database.

Wherein the parameter information of the database sub-database includes, but is not limited to, the data volume and the usage parameters of the database sub-database, etc. The usage parameters may include the frequency of usage, the type of object to which the user belongs, and so on.

Specifically, the server acquires a database to be stored, and divides the database to be stored into a plurality of database sub-databases based on a database dividing algorithm. Wherein, the library dividing algorithm can be horizontal library dividing or vertical library dividing and the like. In one embodiment, the server divides the data corresponding to each target field information into corresponding databases according to the hash calculation result or the data range corresponding to each target field information based on the target field information of each data in the database to be stored, so as to obtain a plurality of databases with different structures and data, and the union of the databases is the total database to be stored.

In one embodiment, the server divides each data into corresponding libraries according to the attribute of each data in the database to be stored and the corresponding relation between the attribute and the library identification information, and obtains the database sub-libraries. Wherein the attributes of the data include, but are not limited to, data base attributes and data traffic attributes of the data, etc. The data base attributes may include geographic attributes, format attributes, category attributes, and the like of the data. The data traffic attributes may include the type of traffic supported by the data, the type of object to which the data consumer belongs, and so on. The service type can comprise checking, inquiring, editing and the like, and the object type can be divided according to the object basic attributes such as importance degree, age, region and the like of the object, and also can be divided according to the object interaction attribute. Taking importance as an example, an object type may include a head object whose object attribute satisfies an importance condition, and a mid-long tail object whose object attribute does not satisfy the importance condition. The importance condition can be determined according to the resource transfer quantity of the user, the use frequency of the database by the user, the use requirement and other object interaction attributes. In one specific application, the database to be stored is a search database, and the user who requires search instantaneity and uses the database with a relatively large size is determined as the head object.

Further, sharding is a data storage scheme that stores one large database distribution over multiple physical nodes. The slicing algorithm is an algorithm for determining storage locations of respective databases corresponding to the database. The slicing algorithm may be a key hash slicing algorithm, a consistent hash algorithm, a table-wise slicing algorithm, and the like. Candidate storage nodes refer to nodes in the server cluster that can be used for sub-database storage, and the device types of the candidate storage nodes are matched with the database sub-databases.

Specifically, for each database, the server may determine, according to parameter information of the database, a fragmentation algorithm matched with the database, determine, according to parameter information of the database, a target device type matched with the database, and determine, from a plurality of storage nodes of the server cluster, a candidate storage node whose device type is the target device type.

Step S206, based on the slicing algorithm matched with the database sub-libraries, determining the storage position mapping information corresponding to each database sub-library, and determining the target storage node corresponding to each storage position mapping information from the candidate storage nodes.

Wherein the storage location mapping information refers to information that may characterize the location of storage nodes in the server. For example, in the case where the slicing algorithm is a consistent hash algorithm, the storage location mapping information may be location information of the mapping of the database onto the hash ring; for another example, where the slicing algorithm is a per-table slicing algorithm, the storage location mapping information may be location information of a database mapped into a lookup table.

Specifically, different storage nodes correspond to different mapping locations. Taking a CPU node as an example. For example, multiple CPU nodes may be mapped to different locations on the hash ring, respectively, and as another example, multiple CPU nodes may be mapped to different locations in the lookup table, respectively. Based on the above, the server may determine, for each database, storage location mapping information corresponding to each database based on a slicing algorithm matched with the database, and then perform consistency matching on the mapping locations corresponding to each candidate storage node and the storage location mapping information of the database, to determine target storage nodes corresponding to each storage location mapping information.

Step S208, each database sub-base is respectively stored in the corresponding target storage node of the database sub-base.

Specifically, after determining a target storage node of the database, the server stores the database in the target storage node corresponding to the database. Further, the server may store data for each database after determining the target storage nodes of all the databases, or may determine and store data for each target storage node for each database one by one. In short, the present application does not limit the storage order of the database.

According to the data processing method, the database to be stored is obtained, the server cluster comprising a plurality of storage nodes of at least two equipment types for storing the database to be stored is determined, then the database to be stored is divided into a plurality of data sub-databases, and each data sub-database is stored in different storage nodes respectively, so that the problem of performance degradation of data processing equipment caused by overlarge data scale can be avoided to a certain extent, the data processing efficiency is improved, the corresponding slicing algorithm and candidate storage nodes are matched for each data sub-database according to the parameter information of each data sub-database in the slicing storage process, the corresponding resource deployment can be performed according to the characteristics of the data sub-databases, the matching degree of the storage nodes and the data sub-databases can be ensured, the resource utilization rate of each storage node is improved, and the data processing efficiency is further improved.

In some embodiments, the parameter information of the database sub-database includes object attributes of the user of the database sub-database, and the data volume of the database sub-database. In the case of this embodiment, determining the respective matching slicing algorithm and candidate storage node for each database according to the parameter information of each database, includes: obtaining object attributes of a user of a database sub-database; if the user is a target object with object attributes meeting importance conditions, acquiring the data volume of a database; if the data quantity meets the data quantity condition, determining a table-based slicing algorithm as a slicing algorithm matched with the database sub-database, and determining an image processor node as a candidate storage node matched with the database sub-database; otherwise, the consistent hash algorithm is determined to be a sharding algorithm matched with the database sub-database, and the central processing unit node is determined to be a candidate storage node matched with the database sub-database.

The database comprises a database, wherein the database comprises a target object with object attributes meeting importance conditions and a non-target object with object attributes not meeting the importance conditions. In a specific application, the target object may be, for example, a header (KA) object, and the non-target object may be, for example, a mid-long tail object. The importance condition can be determined according to the resource transfer quantity of the user, the use frequency of the object on the database, the use requirement and the like. The data amount condition may be that the data amount is greater than the data amount threshold, or that the data amount is greater than or equal to the data amount threshold. A consistent hashing algorithm is an algorithm that uses a hash function to establish a mapping relationship from data to storage nodes. The per-table slicing algorithm is an algorithm for establishing a mapping relationship from data to storage nodes through a lookup table.

Specifically, in an actual application scenario, a target object whose object attribute satisfies an importance condition usually occupies a relatively small area, but has a relatively high requirement on database performance. Based on the above, the server acquires the object attribute of the user of the database, and if the user is a target object whose object attribute satisfies the importance condition, further acquires the data amount of the database. If the data quantity meets the data quantity condition, determining a table-based slicing algorithm as a slicing algorithm matched with the database sub-database, and determining an image processor node as a candidate storage node matched with the database sub-database; otherwise, the consistent hash algorithm is determined to be a sharding algorithm matched with the database sub-database, and the central processing unit node is determined to be a candidate storage node matched with the database sub-database.

Further, the type of the object to which the party of the database belongs may be defined by the producer of the database, or a third party different from the producer and the party, and correspondingly, the server may acquire the type of the object to which the party of the database belongs from the producer or the third party of the database. It will be appreciated that the producer and the consumer of the database may be the same or different. For example, the producer of the database may be a search service providing platform and the consumer of the database may be a customer of the search service providing platform. For another example, the database may be a search library for conducting a picture content review, containing a plurality of problem sample pictures, and the producer and the consumer of the database may be institutions for conducting picture content review.

In this embodiment, for a database in which a user is a target object and the data volume satisfies the data volume condition, a table-based slicing algorithm and GPU nodes are configured, so that the database of the target object can be routed to a designated GPU machine based on a lookup table, and the data processing instantaneity of the database of the target object can be ensured due to excellent computing performance of the GPU. Meanwhile, aiming at the database with the user as the target object and the data volume not meeting the data volume condition, as the CPU with relatively low cost can meet the performance requirement, a consistent hash algorithm and CPU nodes are configured, the retrieval database of the target object can be routed to the appointed CPU machine based on the hash algorithm, and the resources of the CPU machine can be fully mined due to the expandability of the CPU machine. The data processing mode can ensure the data processing performance and reduce the machine cost.

As described above, the user of the database may also be a non-target object whose object attributes do not satisfy the importance condition. In some embodiments, determining the respective matched slicing algorithm and candidate storage node of each database sub-base according to the parameter information of each database sub-base, further comprises: if the user is a non-target object of which the object attribute does not meet the importance condition, the consistent hash algorithm is determined to be a slicing algorithm matched with the database sub-database, and the central processing unit node is determined to be a candidate storage node matched with the database sub-database.

Herein, specific definitions of importance conditions, consistent hashing algorithms, and the like are referred to above, and are not repeated herein. Specifically, in the actual application scenario, the non-target object occupation ratio is relatively large, but the requirement on the database performance is relatively low. Based on the above, the server acquires the object attribute of the database, if the user is a non-target object whose object attribute does not meet the importance condition, the consistent hash algorithm is determined to be a sharding algorithm matched with the database, and the central processor node is determined to be a candidate storage node matched with the database.

In this embodiment, for the search sub-library corresponding to the non-target object with a relatively large duty ratio, the consistent hash algorithm and the CPU node are configured, which benefits from the scalability and low cost of the CPU machine, and can reduce the machine cost while ensuring the data processing performance.

In some embodiments, determining the respective storage location mapping information for each database based on a sharding algorithm that matches the database sub-databases includes: when the slicing algorithm is a consistent hash algorithm, acquiring the identification information corresponding to each database, and calculating the first hash value corresponding to each identification information; and determining the position information of the first hash value on the storage hash ring as storage position mapping information corresponding to the database sub-library.

The storage hash ring is determined based on respective second hash values of candidate storage nodes with device types matched with the database. Specifically, the server may map each candidate storage node to an original virtual node, obtain node identifiers corresponding to each candidate storage node, calculate respective second hash values of each node identifier based on a hash algorithm, and arrange each original virtual node according to the size sequence of the respective second hash values corresponding to each candidate storage node, so as to form a storage hash ring. The node identification may include at least one of address, machine number, etc. information of the candidate storage node. The hash Algorithm may be MD5 (MD 5Message-Digest Algorithm) or SHA (Secure Hash Algorithm ), or the like.

Further, the identification information of the database sub-database may include at least one of information such as a file name and a number of the database sub-database. Specifically, when the slicing algorithm is a consistent hash algorithm, the server acquires the identification information corresponding to each database sub-base, calculates the first hash value corresponding to each identification information based on the hash algorithm, and then determines the position information of the first hash value on the storage hash ring as the storage position mapping information corresponding to the database sub-base.

In the above embodiment, the storage location mapping information corresponding to the database sub-databases is determined based on the storage hash ring, so that when the machine is increased or decreased, the data migration between the nodes is limited to only between two nodes, the global problem is not caused, and the availability of each database sub-database can be ensured.

In a specific application, the data processing method further comprises: and configuring newly added virtual nodes corresponding to each candidate storage node on the storage hash ring. In the case of this embodiment, determining, from among the candidate storage nodes, the target storage node to which each storage location mapping information corresponds, includes: for each storage position mapping information, searching a virtual node closest to the position corresponding to the storage position mapping information according to a set direction on the storage hash ring, and determining a candidate storage node corresponding to the virtual node as a target storage node corresponding to the storage position mapping information.

The virtual nodes comprise original virtual nodes and newly added virtual nodes, the positions of the original virtual nodes on the storage hash ring are determined according to the second hash values of the candidate storage nodes, and the original virtual nodes and the newly added virtual nodes of the same candidate storage node are arranged at intervals. Further, the number of virtual nodes in the interval between the original virtual node and the newly added virtual node of the same candidate storage node may be one or two or more. The number of newly added virtual nodes of the same candidate storage node can be one or a plurality of newly added virtual nodes.

Specifically, after determining the original virtual nodes corresponding to each candidate storage node on the storage hash ring, the server sets the newly added virtual nodes corresponding to each candidate storage node on the storage Chu Haxi ring based on the respective positions of the original virtual nodes. Then, for each storage position mapping information, determining the position of the storage position mapping information mapped to the storage hash ring, searching a virtual node closest to the position on the storage Chu Haxi ring according to a set direction, and determining a candidate storage node corresponding to the virtual node as a target storage node corresponding to the storage position mapping information. The setting direction may be either clockwise or counterclockwise.

As shown in fig. 3 and 4, the corresponding position of the storage position mapping information of the database K1 on the storage hash ring is K1, the corresponding position of the storage position mapping information of the database K2 on the storage hash ring is K2, the corresponding position of the storage position mapping information of the database K3 on the storage hash ring is K3, and the corresponding position of the storage position mapping information of the database K4 on the storage hash ring is K4. The candidate storage node N1 is N1V1 as an original virtual node and N1V2 as a newly added virtual node corresponding to the storage Chu Haxi ring; the original virtual node corresponding to the candidate storage node N2 on the storage Chu Haxi ring is N2V1, and the newly added virtual node is N2V2; the candidate storage node N3 is N3V1 as an original virtual node and N3V2 as a newly added virtual node corresponding to the storage Chu Haxi ring; the candidate storage node N4 corresponds to the original virtual node N4V1 and the newly added virtual node N4V2 on the storage Chu Haxi ring.

Taking the case that the setting direction is clockwise as an example, as shown in fig. 3, under the condition that no new virtual node is set, all the target storage nodes corresponding to the databases K1, K2 and K3 are candidate storage nodes N1, the target storage node corresponding to the database K4 is candidate storage node N3, and the candidate storage nodes N2 and N4 are empty, and because the load of the candidate storage node N1 is larger, the data is possibly inclined, and the overall performance is affected. As shown in fig. 4, by setting newly added virtual nodes corresponding to each original virtual node at intervals on the storage hash ring, the target storage node corresponding to the database K1 can be changed to the candidate storage node N2, and the target storage node corresponding to the database K2 can be changed to the candidate storage node N4, so that data balance of each candidate storage node can be ensured, and data processing efficiency can be improved.

In the above embodiment, by setting newly added virtual nodes corresponding to each original virtual node at intervals on the storage hash ring, the uniformity of node distribution on the storage hash ring can be increased, so that the data uniformity among candidate storage nodes is improved, and the data processing efficiency is improved.

In some embodiments, step S206 includes: when the slicing algorithm is a table-based slicing algorithm, acquiring identification information corresponding to each database, and determining storage position mapping information corresponding to each database based on the position of the identification information in a lookup table; determining mapping positions of a plurality of candidate storage nodes, of which the equipment types are matched with the database, in a lookup table respectively; and carrying out consistency matching on the mapping positions and the storage position mapping information, and determining target storage nodes corresponding to the storage position mapping information.

The specific limitation of the identification information of the database is referred to above, and is not repeated here. Specifically, the lookup table includes a database identification field and a storage node identification field. When the slicing algorithm is a consistent hash algorithm, the server acquires the identification information corresponding to each database sub-database, respectively determines the position of each identification information in the lookup table based on the identification column of the database sub-database in the lookup table, and determines the position information corresponding to the position as storage position mapping information corresponding to the database sub-database. And then, the server acquires node identifiers corresponding to a plurality of candidate storage nodes with equipment types matched with the database, respectively determines the mapping position of each candidate storage node in the lookup table based on a storage node identifier column in the lookup table, carries out consistency matching on the mapping position and storage position mapping information, and if the mapping position corresponds to the storage position mapping information in the lookup table, determines the candidate storage node corresponding to the mapping position as a target storage node corresponding to the storage position mapping information.

In the above embodiment, the method is equivalent to determining the target storage node of the database based on the correspondence between the database and the candidate storage node in the lookup table, and has simple algorithm, thereby being beneficial to improving the data processing efficiency.

In one embodiment, the device type of each candidate storage node is an image processor. In the case of this embodiment, the data processing method further includes: determining the device type as a plurality of alternative storage nodes of the image processor; and acquiring the respective residual video memories of each candidate storage node, and determining the candidate storage nodes of which the residual video memories meet the storage conditions of the sub-database as candidate storage nodes.

Wherein the video memory is the memory of the image processor. The storage condition of the sub-database may be that the remaining video memory is greater than a set value, or that the remaining video memory is greater than or equal to the set value. The set value may be determined based on the amount of data in the sub-database. For example, the set value may be a minimum value of the data amount of each sub-database. In one specific application, the sub-database storage condition is that the remaining video memory is greater than zero.

Specifically, since the storage space of the image processor is relatively small, in order to ensure that the memory is enough, in the process of determining the candidate storage nodes, the server may determine that the device type is a plurality of candidate storage nodes of the image processor, then obtain the respective residual memory of each candidate storage node, and screen out the candidate storage nodes based on the residual memory to obtain the candidate storage nodes, and specifically determine the candidate storage nodes that the residual memory meets the storage condition of the sub-database as the candidate storage nodes.

In this embodiment, the remaining video memories of the candidate storage nodes are obtained, and the candidate storage nodes whose remaining video memories meet the storage conditions of the sub-database are determined as candidate storage nodes, so that the availability of each candidate storage node can be ensured, further, the smooth proceeding of the subsequent data storage process can be ensured, and the reliability of the data processing method can be improved.

In some embodiments, the target storage node is a master storage node. In the case of this embodiment, the data processing method further includes: determining the corresponding copy parameters of each database according to the respective use parameters of each database; determining a copy storage node corresponding to each database based on the copy parameters; and storing each database sub-base to a corresponding copy storage node of the database sub-base respectively.

The regions of the plurality of duplicate storage nodes corresponding to the same database are different. In a specific application, the regions where the original storage node and the copy storage nodes storing the same database are located are also different. The usage parameters of the database may include the frequency of usage of the database, and the type of object to which the user of the database belongs. The frequency of use may be determined empirically by the producer of the database or by the server based on historical frequency of use of the database. For specific definitions of the object type to which the user belongs, see above, and will not be described in detail here. Further, the replica parameters may include replica storage priority, number of replicas, replica storage territories, and so forth.

For each database sub-base, the server can also deploy the duplicate storage nodes of the database sub-base under the condition of determining the duplicate storage nodes of the database sub-base. Specifically, the server determines the corresponding copy parameters of each database sub-base according to the respective use parameters of each database sub-base, determines the corresponding copy storage nodes of each database sub-base based on the copy parameters, and then stores each database sub-base to the corresponding copy storage nodes of the database sub-base.

In the above embodiment, the server deploys the duplicate storage nodes across regions based on the usage parameters of the database for each database sub-database, so that the problem of database unavailability caused by abnormality of random nodes can be reduced, and the reliability of the data processing method can be improved.

In some embodiments, the usage parameters include the frequency of usage, and the type of object to which the user belongs. The copy parameters include copy storage priority and the number of copies. In the case of this embodiment, for each database sub-bank, determining, according to the respective usage parameter of each database sub-bank, the respective corresponding duplicate parameter of each database sub-bank includes: determining the number of copies of the database sub-database according to the use frequency of the database sub-database; and determining the copy storage priority of the database according to the use frequency of the database and the type of the object to which the user belongs.

It will be appreciated that the higher frequency of use database sub-libraries correspond to higher heat, with a greater likelihood of concurrent requests occurring. Based on this, the server may set the number of copies of the database sub-database whose use frequency satisfies the high-frequency use condition to a first copy number having a relatively large value, and set the number of copies of the database sub-database whose use frequency does not satisfy the high-frequency use condition to a second copy number having a relatively small value. The high-frequency use condition may refer to that the use frequency is greater than a frequency threshold, or that the use frequency is greater than or equal to the frequency threshold.

Further, since the machine resources of the server cluster are limited, the storage requirements of the copies of all the databases may not be met, based on this, the server may determine the priority of storing the copies of the databases according to the frequency of use of the databases and the object type to which the user belongs, and sequentially store the copies according to the priority order of storing the copies, so as to ensure that the databases with higher priorities can configure a sufficient number of copies.

In one specific application, the user of the database includes a head object meeting the importance condition of the object attribute and a middle-long tail object whose object attribute does not meet the importance condition; the database whose frequency of use satisfies the high-frequency use condition is called a hot database, and the database whose frequency of use does not satisfy the high-frequency use condition is called a cold database. The copy storage priority ordering is as follows: head object hot store, head object cold store, long tail object hot store, long tail object cold store.

In the above embodiment, the number of copies and the priority of storing the copies of the database are determined based on the use frequency and the object type to which the user belongs, which is equivalent to associating the copy configuration process with the application condition of the database, so that the high concurrency use requirement of the database used at high frequency can be met, the resource waste of the database used at low frequency is avoided, and the high availability of the system is improved.

In a specific application, the database to be stored is a search database, and the data processing method further includes: mapping each original storage node and each copy storage node to a service hash ring respectively; acquiring a retrieval request carrying retrieval library information; determining a target original storage node and a target copy storage node of a target retrieval library corresponding to the retrieval library information; the service node for responding to the search request is determined from the target primary storage node and the target replica storage node based on the mapping position of the search request on the service hash ring and the respective mapping positions of the target primary storage node and the target replica storage node on the service hash ring.

The mapping positions of the original storage node and the copy storage nodes of the same database on the service hash ring are at least separated by one node. Specifically, the server may map the respective node identifiers of the respective primary storage nodes and the respective replica storage nodes to the service hash ring respectively by using hash values obtained after hash calculation. With respect to specific limitation of node identification, as the regions where the plurality of copy storage nodes storing the same database are located are different, the node identification difference of each copy storage node is larger, and therefore, the mapping positions of each copy storage node on the service hash ring are at least separated by one node. In a specific application, the server may configure the regular storage nodes and the duplicate storage nodes storing the same database at intervals, so as to improve the uniformity of node distribution. As shown in fig. 5, mapping positions of two storage nodes of the database "slice-SET 2" on the service hash ring are SET2-1 and SET2-2, respectively, and mapping positions of three storage nodes of the database "slice-SET 4" on the service hash ring are SET4-1, SET4-2 and SET4-3, respectively.

Further, in the application process, the server receives a search request initiated by the terminal, acquires search library information carried in the search request, determines a target original storage node and a target copy storage node of a target search library corresponding to the search library information, and calculates a hash value corresponding to the search request based on a hash algorithm and determines a mapping position of the search request on a service hash ring according to the hash value. The server then determines a service node from the target primary storage node and the target replica storage node to respond to the retrieval request based on the mapping location of the retrieval request on the service hash ring and the respective mapping locations of the target primary storage node and the target replica storage node on the service hash ring.

Further, the server may search for a storage node mapping location closest to the mapping location of the search request according to a preset direction on the service hash ring, and determine a storage node corresponding to the storage node mapping location as a service node for responding to the search request. The server may directly determine a storage node mapping location having the shortest distance from the mapping location of the search request, and determine a storage node corresponding to the storage node mapping location as a service node for responding to the search request.

In addition, if the service node is in an available state, the server forwards the search request to the service node and acquires a search result fed back by the service node. If the service node is in an unavailable state, the server determines another service node storing the target search pool based on the rule. The service node being in an available state may mean that the storage node is idle, or that the storage node has no abnormality. As shown in fig. 5, taking the situation that the mapping position of the search request on the service Node is Q and the target search library is "slice-SET 4" as an example, the server may determine the service Node according to the clockwise direction, determine the storage Node "Node3" corresponding to the SET4-3 as the service Node, if "Node3" is unavailable, the server may determine the storage Node "Node1" corresponding to the SET4-1 as the service Node, and so on until the determined service Node is available.

After the server obtains the search request, the server may perform parameter verification on the search request, and filter out illegal search requests. The illegal search request may include that the search library information is illegal, that the user who initiated the search request does not have an access request to the search library corresponding to the search library information, and so on.

In this embodiment, each positive storage node and each duplicate storage node are mapped to the service hash ring respectively, and access configuration of the search library is performed based on the service hash ring, so that uniform access of each storage node of the same search sub-library can be realized, and the search efficiency is improved.

In some embodiments, as shown in fig. 6, the data processing method includes:

step S601, obtaining a database to be stored, and determining a server cluster for storing the database to be stored;

wherein the server cluster comprises a plurality of storage nodes of at least two device types;

step S602, dividing a database to be stored into a plurality of data sub-databases, and determining the object attribute of each user of each data sub-database;

step S603, if the user of the database is a target object whose object attribute satisfies the importance condition, acquiring the data volume of the database;

step S604, if the data volume meets the data volume condition, determining a table-based slicing algorithm as a slicing algorithm matched with the database sub-database, and determining an image processor node as a candidate storage node matched with the database sub-database;

step S605, if the data quantity does not meet the data quantity condition, determining a consistent hash algorithm as a slicing algorithm matched with the database, and determining a central processing unit node as a candidate storage node matched with the database;

Step S606, if the user of the database is a non-target object whose object attribute does not meet the importance condition, determining the consistent hash algorithm as a slicing algorithm matched with the database, and determining the central processing unit node as a candidate storage node matched with the database;

step S607, when the slicing algorithm is a table-based slicing algorithm, acquiring the respective identification information of each database sub-base, and determining the respective storage position mapping information of each database sub-base based on the position of the identification information in the lookup table;

step S608, determining mapping positions of a plurality of candidate storage nodes with equipment types matched with the database in a lookup table respectively;

step S609, carrying out consistency matching on the mapping position and the storage position mapping information, and determining target storage nodes corresponding to the storage position mapping information respectively;

step S610, when the slicing algorithm is a consistent hash algorithm, determining a storage hash ring based on respective second hash values of each candidate storage node with the equipment type matched with the database sub-base;

step S611, obtaining the identification information corresponding to each database, calculating the first hash value corresponding to each identification information, and determining the position information of the first hash value on the storage hash ring as the storage position mapping information corresponding to the database;

Step S612, configuring newly added virtual nodes corresponding to each candidate storage node on the storage Chu Haxi ring;

the method comprises the steps that an original virtual node and a newly added virtual node of the same candidate storage node are arranged at intervals; the position of the original virtual node on the storage hash ring is determined according to the second hash value of the candidate storage node;

step S613, for each storage position mapping information, searching a virtual node closest to a position corresponding to the storage position mapping information according to a set direction on the storage hash ring, and determining a candidate storage node corresponding to the virtual node as a target storage node corresponding to the storage position mapping information;

the virtual nodes comprise original virtual nodes and newly added virtual nodes;

step S614, obtaining the respective use frequency of each database and the object type of the user;

step S615, for each database sub-base, determining the number of copies of the database sub-base according to the frequency of use of the database sub-base, and determining the copy storage priority of the database sub-base according to the frequency of use of the database sub-base and the type of the object to which the user belongs;

step S616, determining the corresponding copy storage nodes of each database sub-base based on the copy storage priority and the copy number, and storing each database sub-base to the corresponding copy storage node of the database sub-base;

Wherein, the regions of a plurality of copy storage nodes corresponding to the same database are different;

step S617, mapping each positive storage node and each duplicate storage node to a service hash ring respectively;

step S618, obtaining a service request carrying database information, and determining a target original storage node and a target copy storage node of a target database corresponding to the database information;

step S619, determining a service node from the target original storage node and the target copy storage node to respond to the service request based on the mapping position of the service request on the service hash ring and the respective mapping positions of the target original storage node and the target copy storage node on the service hash ring.

In some embodiments, the data processing method provided by the application can be applied to an application scene of data backup. Applying the data processing method in the data backup scene, obtaining a database to be stored by a server, and determining a server cluster which is used for storing a plurality of storage nodes of at least two equipment types of the database to be stored; dividing a database to be stored into a plurality of data sub-databases, and respectively determining a fragmentation algorithm and candidate storage nodes matched with each data sub-database according to parameter information of each data sub-database; the equipment type of the candidate storage node is matched with the database; then, the server determines storage position mapping information corresponding to each database sub-base based on a slicing algorithm matched with the database sub-base, and determines target storage nodes corresponding to the storage position mapping information from a plurality of candidate storage nodes with equipment types matched with the database sub-base; and storing each database sub-database to a target storage node corresponding to the database sub-database respectively.

In some embodiments, the data processing method provided by the application can be applied to application scenes of content retrieval, such as scenes of content auditing, intelligent security, intelligent communities, intelligent retail, and the like. The content to be searched can be biological characteristics such as a face, sound or fingerprint, and can also be an object image.

Taking the public cloud picture content auditing scenario as an example, the data volume of the retrieval library is very large in the scenario, and high concurrent retrieval requirements may exist. If a conventional data processing algorithm is adopted, the following problems may exist:

first, the system availability is poor. For example, if a conventional hash algorithm is used to segment the search library, when the number of search cluster machines changes due to faults or expansion, the total hash value is changed, so that the routing of the historical slice data is invalid and the search is not available. For another example, if a random full permutation mechanism is used for copy replication, the number of failed nodes may increase, resulting in the full failure of a certain search pool.

Second, the resource utilization is low. For example, if the conventional hash algorithm is used for slicing the search library, the problem of data inclination easily occurs, that is, most of the search data is distributed to a small part of the machines, and the rest of the search data distributed by the large part of the machines is not large, so that the problem of idle computing power resources occurs to the large part of the machines of the search system. For another example, if the number of copies is fixed when the copies are copied based on the traditional method, the number of copies is generally set to be larger in order to meet the high concurrent request of the hot store, and the idle condition of the cold store resource may occur, so that the resource waste is caused.

Based on the above, the application provides a search library deployment mechanism based on multi-dimensional search library slicing and set multi-copy, which can be applied to public cloud picture content auditing scenes of CV (Computer Vision) technology. Specifically, efficient scheduling of the CPU and GPU heterogeneous machines can be realized through a multidimensional slicing mechanism, the resource utilization rate is improved, searching availability can be greatly ensured through cross-region derandomizing characteristics of a set multi-copy copying mechanism, and high concurrent searching requirements of a high-hot-spot searching library are fully supported through copy adjustable characteristics.

The multi-dimensional slicing comprises two slicing strategies, namely consistent hash slicing and table query slicing. The consistent hash partition maps the CPU cluster machine to the katama hash ring uniformly based on a ketama hash ring virtual node mechanism, routes retrieval data of middle-long tail clients or KA clients to corresponding CPU clusters, and ensures fault tolerance, expandability of the CPU machine and distribution uniformity of the retrieval libraries after the CPU clusters are failed so as to fully mine CPU machine resources. The KA customer database routing table is configured in the configuration center by the operation side according to the table query fragment, and the retrieval system is responsible for routing KA customer retrieval data to the designated GPU machine based on the routing table, so that the real-time retrieval of the KA customer database is ensured by utilizing the excellent computing performance of the GPU.

The method is characterized in that the method is abstracted into a copy set from a physical node and corresponds to a service retrieval library, random access is carried out in the set, the number of copies is adjustable, the hot spot retrieval library can be rapidly expanded, uniform access is carried out among the sets according to consistency hash, and the physical nodes in the set are generally configured across regions, so that the unavailability of the whole retrieval library caused by multi-machine faults is reduced as much as possible.

The method is suitable for mass scale retrieval scenes of high-availability high-concurrency head clients of the current mainstream picture auditing, and provides more available and real-time retrieval information for a public cloud picture content auditing system after the technical scheme is introduced, so that auditing users can formulate more reasonable and safe operation strategies according to auditing results retrieved in time, and further, the content security and the revenue guarantee of clients are assisted.

In one embodiment, as shown in fig. 7, in the application stage, the data access layer provides a standard interface for the external to uniformly access the data storage requests of the KA client and the middle-long tail client, and filters illegal requests; then multi-dimensional slicing is carried out according to the client type and the library size (configurable as 10W), table query slicing is carried out on the large library of KA clients, and search data are cached to the GPU cluster, so that search instantaneity is ensured by excellent GPU cards; and performing consistent hash slicing on the middle-long tail client or the small library of the KA client, caching the retrieval data into a low-cost CPU cluster, and ensuring the uniformity of the retrieval data distribution through a consistent hash principle. In addition, in order to avoid the unavailability of the retrieval service caused by the damage of the single-node deployment retrieval library and introduce a copy replication mechanism, and distinguish the conventional fixed copy number random replication mechanism, the application correlates the copy number with the service layer retrieval library, sets more copy numbers for the retrieval hot spot library, such as set3 in fig. 7, sets fewer copies for the retrieval cold storage, and randomly selects a copy set from a cross-regional machine room, thereby being capable of reducing the problem that the whole retrieval library is unavailable caused by the abnormality of random nodes as much as possible.

As shown in fig. 8, the multidimensional slicing includes two phases, node hashing and data hashing. The node hash stage is used for determining the mapping position of the CPU node on the hash ring and the mapping position of the GPU node on the retrieval routing table. In the node hash stage, the server can determine a node hash strategy according to the client type and the library size by counting the client data: and for the middle-long tail client and the KA client database, performing a node consistency hash initialization flow, and establishing a GPU routing cluster table for the KA client database. In the process of initializing and searching the routing table, the residual video memory of each GPU node is acquired and judged so as to ensure that the GPU nodes in the routing table are all available nodes. In the data hash stage, the server receives the search data of a plurality of clients and decides a data hash strategy according to the client type and the library size: searching data for the KA client database, inquiring routing information according to the routing table, and then routing to a designated GPU cluster for data caching; and consistent hash slicing is carried out on KA client database retrieval data and middle-long tail client retrieval data.

The specific procedure of consistent hash sharding is described below. Specifically, as shown in FIGS. 3 and 4, the whole hash value space is mapped into a virtual circular ring, the value range is 0-2≡32-1, and each integer value on the ring represents a node. Firstly, carrying out hash ring entry on a CPU cluster machine to be used, specifically carrying out hash function calculation according to CPU machine IP to obtain a hash value (such as 100), mapping the CPU node to a corresponding position of the hash value on the hash ring, and then carrying out hash function calculation on search data, such as based on search group_id, to obtain a hash value, wherein the nearest node in the clockwise direction of each value is the node of the search data cache.

As shown in fig. 9, after the retrieved data is stored in a sliced manner according to the multi-dimensional slicing mechanism, each of the sub-libraries of the retrieved data is actually stored in only one physical machine, and when the machine fails, the retrieval is directly disabled, so as to introduce a multi-copy mechanism for avoiding the occurrence of the situation. Firstly, counting the client types and the library access hotness, sorting the priority according to the client types and the library hotness, and copying copies of various search libraries in turn based on the priority, wherein the priority is as follows: KA customer thermal store, KA customer freezer, long tail customer thermal store and long tail customer freezer. And the number of copies of the thermal store is larger than that of the cold store, and a plurality of copies of the same cable sub-store are stored in copy nodes in different regions so as to avoid unavailability caused by machine faults in the same region. In the copy process, if the residual machine resources are insufficient, a direct alarm is returned. It should be noted that, because the size of the mid-long tail client search library is generally smaller than that of the KA client search library, as shown in fig. 9, if the current remaining machine resources are insufficient to copy copies of the KA client search library, it may be further determined whether the current remaining machine resources can meet the processing requirements of the mid-long tail client hot library. As shown in fig. 5, each storage node of the search pool may form a service hash ring to achieve uniform access of subsequent service processes.

The data processing method has the beneficial effects that:

first, high concurrent application requirements can be met. Aiming at the characteristics of high real-time searching requirement of public cloud picture auditing and searching massive searching libraries and head clients, the application reasonably fragments the searching data of different types of clients to a proper machine under the condition of limited resources of heterogeneous machines such as CPU, GPU and the like, thereby ensuring the searching performance; and the set multi-copy deployment is realized for the search library aiming at the public cloud height available search requirement, and the high concurrency and high availability of the search system are greatly improved by utilizing the characteristics of adjustable number of copies in the set, cross-region and the like.

Second, the search performance is excellent. The application segments the search library aiming at the characteristics of larger scale and obvious growth of the search library for auditing the picture content; according to the facts that more than 90% of medium-long tail client retrieval libraries in a content auditing retrieval scene are smaller in scale (< 1K) and less than 10% of KA client retrieval libraries are larger in scale (> 1K) and GPU resources are expensive, the medium-long tail client retrieval libraries are uniformly distributed to cheap CPU clusters through a consistent hash slicing strategy, the KA client retrieval large libraries are sliced to appointed GPU cluster strategies according to the table, CPU resources with priority calculation are guaranteed to the retrieval performance of common libraries, GPU resources with excellent calculation are guaranteed to the retrieval performance of large libraries, and the retrieval performance of the whole system is excellent.

Third, high availability. The application aims at the high availability requirement of public cloud service and performs multi-copy backup on the search library. A set multi-copy mechanism is formulated aiming at the problems of hot spot library in picture content auditing and multi-node faults caused by multi-copy pseudo-random distribution, a set is abstracted from a plurality of cross-region physical machines to correspond to a search library, random access is carried out in the set, the number of the copies is adjustable, the number of the copies of the refrigeration house is set to be a lower value, the number of the copies of the hot house is set to be a higher value, and the usability of the search library can be greatly improved.

Fourth, the machine cost is low. According to the application, a small amount of KA client large libraries are fragmented to expensive GPU machines through multi-dimensional fragmentation, a large amount of middle-long tail small libraries are fragmented to cheap CPU machines, and the uniformity of the retrieval library fragments is ensured through virtual nodes of the ketama consistency hash so as to fully mine computing power resources; in addition, by means of the dynamic copy number adjustment characteristic of the set multiple copies, machine idling caused by fixed copy number is avoided for the differential setting of the cold and hot bank.

It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

Based on the same inventive concept, the embodiment of the application also provides a data processing device for realizing the above related data processing method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation of one or more embodiments of the data processing device provided below may refer to the limitation of the data processing method hereinabove, and will not be repeated herein.

In some embodiments, as shown in FIG. 10, there is provided a data processing apparatus 1000 comprising: an acquisition module 1002, a sharding algorithm and candidate storage node determination module 1004, a target storage node determination module 1006, and a data storage module 1008, wherein:

an obtaining module 1002, configured to obtain a database to be stored, and determine a server cluster for storing the database to be stored; the server cluster includes a plurality of storage nodes of at least two device types;

the slicing algorithm and candidate storage node determining module 1004 is configured to divide a database to be stored into a plurality of databases, and determine, according to parameter information of each database, a slicing algorithm and candidate storage node that each database matches with; the equipment type of the candidate storage node is matched with the database;

A target storage node determining module 1006, configured to determine storage location mapping information corresponding to each database based on a slicing algorithm matched with the database, and determine, from each candidate storage node, a target storage node corresponding to each storage location mapping information;

and the data storage module 1008 is used for respectively storing each database sub-database to the target storage node corresponding to the database sub-database.

In some embodiments, the parameter information of the database sub-database includes object attributes of the user of the database sub-database, and the data volume of the database sub-database. In the case of this embodiment, the sharding algorithm and candidate storage node determination module 1004 includes: the object attribute acquisition sub-module is used for acquiring the object attribute of the user of the database; the data volume acquisition sub-module is used for acquiring the data volume of the database if the user is a target object with object attributes meeting the importance condition; the system comprises a partitioning algorithm and a candidate storage node configuration module, wherein the partitioning algorithm and the candidate storage node configuration module are used for determining the table-based partitioning algorithm as a partitioning algorithm matched with the database sub-database if the data volume meets the data volume condition, determining the image processor node as a candidate storage node matched with the database sub-database, determining the consistent hash algorithm as a partitioning algorithm matched with the database sub-database if the data volume does not meet the data volume condition, and determining the central processor node as a candidate storage node matched with the database sub-database.

In some embodiments, the sharding algorithm and candidate storage node configuration module is further to: if the user is a non-target object of which the object attribute does not meet the importance condition, the consistent hash algorithm is determined to be a slicing algorithm matched with the database sub-database, and the central processing unit node is determined to be a candidate storage node matched with the database sub-database.

In some embodiments, the target storage node determination module 1006 includes a mapping information determination sub-module for: when the slicing algorithm is a consistent hash algorithm, acquiring the identification information corresponding to each database, and calculating the first hash value corresponding to each identification information; determining the position information of the first hash value on the storage hash ring as storage position mapping information corresponding to the database sub-library; the storage hash ring is determined based on respective second hash values of the candidate storage nodes whose device types match the database.

In some embodiments, the mapping information determination submodule is further to: configuring newly added virtual nodes corresponding to each candidate storage node on a storage hash ring; setting the interval between an original virtual node and a newly added virtual node of the same candidate storage node; the location of the original virtual node on the storage hash ring is determined based on the second hash value of the candidate storage node. In the case of this embodiment, the target storage node determination module 1006 further includes a target storage node determination sub-module for: for each piece of storage position mapping information, searching a virtual node closest to a position corresponding to the storage position mapping information on the storage hash ring according to a set direction; determining a candidate storage node corresponding to the virtual node as a target storage node corresponding to the storage position mapping information; the virtual nodes include original virtual nodes and newly added virtual nodes.

In some embodiments, the target storage node determination module 1006 is specifically configured to: when the slicing algorithm is a table-based slicing algorithm, acquiring identification information corresponding to each database, and determining storage position mapping information corresponding to each database based on the position of the identification information in a lookup table; determining the mapping positions of the candidate storage nodes in the lookup table; and carrying out consistency matching on the mapping positions and the storage position mapping information, and determining target storage nodes corresponding to the storage position mapping information.

In some embodiments, the device type of each candidate storage node is an image processor. In the case of this embodiment, the slicing algorithm and candidate storage node determination module 1004 is specifically configured to: determining the device type as a plurality of alternative storage nodes of the image processor; and acquiring the respective residual video memories of each candidate storage node, and determining the candidate storage nodes of which the residual video memories meet the storage conditions of the sub-database as candidate storage nodes.

In some embodiments, the target storage node is a master storage node; the data processing apparatus 1000 further comprises a replica storage node determination module for: determining the corresponding copy parameters of each database according to the respective use parameters of each database; determining a copy storage node corresponding to each database based on the copy parameters; the regions of a plurality of copy storage nodes corresponding to the same database are different; and storing each database sub-base to a corresponding copy storage node of the database sub-base respectively.

In some embodiments, the usage parameters include the frequency of usage, and the type of object to which the user belongs; the copy parameters include copy storage priority and the number of copies. In the case of this embodiment, the duplicate storage node determination module is specifically configured to: determining the number of copies of each database sub-base according to the use frequency of the database sub-base; and determining the copy storage priority of the database according to the use frequency of the database and the type of the object to which the user belongs.

In some embodiments, the database to be stored is a search library; the data processing apparatus 1000 further comprises a service node determining module for: mapping each original storage node and each copy storage node to a service hash ring respectively; acquiring a retrieval request carrying retrieval library information; determining a target original storage node and a target copy storage node of a target retrieval library corresponding to the retrieval library information; the service node for responding to the search request is determined from the target primary storage node and the target replica storage node based on the mapping position of the search request on the service hash ring and the respective mapping positions of the target primary storage node and the target replica storage node on the service hash ring.

Each of the modules in the above-described data processing apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In some embodiments, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 11. The computer device includes a processor, a memory, an Input/Output interface (I/O) and a communication interface. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface is connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used for storing data involved in the data processing method. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a data processing method.

It will be appreciated by those skilled in the art that the structure shown in FIG. 11 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

In some embodiments, a computer device is provided, comprising a memory, in which a computer program is stored, and a processor, which implements the steps of the data processing method described above when executing the computer program.

In some embodiments, a computer readable storage medium is provided, on which a computer program is stored which, when executed by a processor, implements the steps of the data processing method described above.

In some embodiments, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the steps of the data processing method described above.

It should be noted that, the object information (including, but not limited to, object device information, object personal information, etc.) and the data (including, but not limited to, data for analysis, stored data, presented data, etc.) related to the present application are both information and data authorized by the object or sufficiently authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related countries and regions.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims

1. A method of data processing, the method comprising:

2. The method according to claim 1, wherein the parameter information of the database sub-library includes an object attribute of a user of the database sub-library and a data amount of the database sub-library;

the determining the slicing algorithm and the candidate storage node matched with each database respectively according to the parameter information of each database comprises the following steps:

acquiring object attributes of a user of the database sub-database;

if the user is a target object with object attributes meeting importance conditions, acquiring the data volume of the database;

if the data volume meets the data volume condition, determining a table-based slicing algorithm as a slicing algorithm matched with the database sub-database, and determining an image processor node as a candidate storage node matched with the database sub-database;

Otherwise, the consistent hash algorithm is determined to be a slicing algorithm matched with the database sub-database, and the central processing unit node is determined to be a candidate storage node matched with the database sub-database.

3. The method according to claim 2, wherein determining the respective matching slicing algorithm and candidate storage node for each database according to the parameter information of each database, further comprises:

and if the user is a non-target object of which the object attribute does not meet the importance condition, determining a consistent hash algorithm as a slicing algorithm matched with the database sub-database, and determining a central processing unit node as a candidate storage node matched with the database sub-database.

4. The method of claim 1, wherein said determining storage location mapping information for each respective one of said databases based on a slicing algorithm matching said database sub-databases comprises:

when the slicing algorithm is a consistent hash algorithm, acquiring identification information corresponding to each database, and calculating a first hash value corresponding to each identification information;

determining the position information of the first hash value on a storage hash ring as storage position mapping information corresponding to the database sub-base; the storage hash ring is determined based on respective second hash values of candidate storage nodes with device types matched with the database sub-database.

5. The method according to claim 4, wherein the method further comprises:

configuring newly added virtual nodes corresponding to the candidate storage nodes on the storage hash ring; setting the interval between an original virtual node and a newly added virtual node of the same candidate storage node; the position of the original virtual node on the storage hash ring is determined according to the second hash value of the candidate storage node;

the determining, from the candidate storage nodes, the target storage node corresponding to each storage location mapping information, including:

for each storage position mapping information, searching a virtual node closest to a position corresponding to the storage position mapping information on the storage hash ring according to a set direction;

determining the candidate storage node corresponding to the virtual node as a target storage node corresponding to the storage position mapping information; the virtual nodes comprise the original virtual nodes and the newly added virtual nodes.

6. The method of claim 1, wherein the determining storage location mapping information corresponding to each database based on a slicing algorithm matched with the database sub-database, and determining a target storage node corresponding to each storage location mapping information from the candidate storage nodes, comprises:

When the slicing algorithm is a table-based slicing algorithm, acquiring identification information corresponding to each database sub-base, and determining storage position mapping information corresponding to each database sub-base based on the position of the identification information in a lookup table;

determining the mapping position of each candidate storage node in the lookup table;

and carrying out consistency matching on the mapping positions and the storage position mapping information, and determining target storage nodes corresponding to the storage position mapping information.

7. The method of claim 6, wherein the device type of each of the candidate storage nodes is an image processor; the method further comprises the steps of:

determining the device type as a plurality of alternative storage nodes of the image processor;

and acquiring the respective residual video memory of each candidate storage node, and determining the candidate storage nodes of which the residual video memory meets the storage conditions of the sub-database as candidate storage nodes.

8. The method according to any one of claims 1 to 7, wherein the target storage node is a primary storage node; the method further comprises the steps of:

determining corresponding copy parameters of each database sub-base according to the respective use parameters of each database sub-base;

Determining a copy storage node corresponding to each database based on the copy parameters; the regions of the plurality of duplicate storage nodes corresponding to the same database are different;

and storing each database sub-database to a corresponding copy storage node of the database sub-database respectively.

9. The method of claim 8, wherein the usage parameters include a frequency of usage, and a type of object to which the user belongs; the copy parameters comprise copy storage priority and copy number;

the determining the copy parameter corresponding to each database according to the respective use parameter of each database comprises the following steps:

determining the number of copies of the database sub-database according to the use frequency of the database sub-database for each database sub-database;

and determining the copy storage priority of the database sub-database according to the use frequency of the database sub-database and the object type of the user.

10. The method of claim 8, wherein the database to be stored is a search library; the method further comprises the steps of:

mapping each original storage node and each copy storage node to a service hash ring respectively;

Acquiring a retrieval request carrying retrieval library information;

determining a target original storage node and a target copy storage node of a target retrieval library corresponding to the retrieval library information;

and determining service nodes used for responding to the search request from the target original storage node and the target copy storage node based on the mapping positions of the search request on the service hash ring and the respective mapping positions of the target original storage node and the target copy storage node on the service hash ring.

11. A data processing apparatus, the apparatus comprising:

12. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 10 when the computer program is executed.

13. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 10.