CN117150060B

CN117150060B - Data processing method and related device

Info

Publication number: CN117150060B
Application number: CN202311427862.8A
Authority: CN
Inventors: 刘建朋
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2023-10-31
Filing date: 2023-10-31
Publication date: 2024-02-09
Anticipated expiration: 2043-10-31
Also published as: CN117150060A

Abstract

The application provides a data processing method and a related device. The embodiment of the application can be applied to the field of cloud storage. The method comprises the following steps: acquiring a graph data set; according to the graph data set, M graph calculation indexes corresponding to the M graph data are determined; m pieces of storage attribute information corresponding to the M pieces of image data are obtained, and M distributed indexes corresponding to the M pieces of image data are determined according to the M pieces of storage attribute information; calculating indexes and M distributed indexes according to the M graphs, and generating M storage period information; and deleting all or part of M pieces of image data according to the M pieces of storage period information. According to the data processing method provided by the embodiment of the application, the storage period of the graph data is measured by combining the node centrality characteristic and the storage attribute characteristic, the data deleting strategy suitable for the graph data is determined according to the storage period information, and the graph data with the outdated storage period is deleted, so that the occupation of the storage space can be effectively reduced, the storage cost is saved, and the cache hit rate is improved.

Description

Data processing method and related device

Technical Field

The present disclosure relates to the field of cloud storage technologies, and in particular, to a data processing method and a related device.

Background

The graph data is unstructured data which is cached in the graph database as it is generated. However, since the cache size of the graph database is limited, if the data size of the graph data in the graph database is continuously increased, the graph database cannot accommodate new data, so that the cache hit rate is reduced, and the performance of the system is further affected. To avoid this, the data in the graph database needs to be obsolete in order to make room for new data.

Currently, cache elimination strategies for graph data generally include: the least recently used (Least Recently Used, LRU), greatest term first (Largest Item First, LIF), least term first (Smallest Item First, SIF), etc. algorithms, all based on spatial locality or capacity size considerations, for example, when the cache capacity is limited, if frequently accessed elements are eliminated, may result in cache jitter, i.e., frequent replacement of data in the cache, resulting in a reduced cache hit rate, thereby affecting system performance.

Disclosure of Invention

The embodiment of the application provides a data processing method and a related device, wherein storage period information of graph data is generated through graph calculation indexes representing the centrality characteristics of nodes in a graph and distributed indexes representing the storage attribute characteristics of the graph data, the storage period of the graph data is measured through two dimensions, and then a data deletion strategy suitable for the graph data is determined according to the storage period information, so that the problem of reduced cache hit rate in the prior art is solved.

One aspect of the present application provides a data processing method, including:

obtaining a graph data set, wherein the graph data set comprises M graph data, the M graph data are used for representing M nodes of a graph to be generated and side information corresponding to the M nodes, and M is an integer larger than 1;

according to the graph data set, M graph calculation indexes corresponding to the M graph data are determined, wherein the graph calculation indexes are used for representing the centrality characteristics of the nodes;

m pieces of storage attribute information corresponding to the M pieces of image data are obtained, and M pieces of distributed indexes corresponding to the M pieces of image data are determined according to the M pieces of storage attribute information, wherein the distributed indexes are used for representing storage attribute characteristics of the image data;

according to the M graph calculation indexes and the M distributed indexes, M storage period information corresponding to M graph data is generated;

and all or part of the M pieces of image data are deleted according to the M pieces of storage period information.

Another aspect of the present application provides a data processing apparatus comprising: the system comprises a graph data acquisition module, a graph calculation index generation module, a distributed index generation module, a storage period information generation module and a data deletion module; specific:

the image data acquisition module is used for acquiring an image data set, wherein the image data set comprises M image data, the M image data are used for representing M nodes and side information corresponding to the M nodes, and M is an integer greater than 1;

The graph calculation index generation module is used for determining M graph calculation indexes corresponding to the M graph data according to the graph data set, wherein the graph calculation indexes are used for representing the centrality characteristics of the nodes;

the distributed index generation module is used for acquiring M pieces of storage attribute information corresponding to the M pieces of image data, and determining M distributed indexes corresponding to the M pieces of image data according to the M pieces of storage attribute information, wherein the distributed indexes are used for representing storage attribute characteristics of the image data;

the storage period information generation module is used for generating M storage period information corresponding to the M image data according to the M image calculation indexes and the M distributed indexes;

and the data deleting module is used for deleting all or part of the M pieces of image data according to the M pieces of storage period information.

In another implementation manner of the embodiment of the present application, the graph calculation index generating module is further configured to:

performing semantic decoding on M pieces of graph data in the graph data set to generate M nodes and side information corresponding to the M nodes;

generating a corresponding graph according to M nodes and side information corresponding to the M nodes, wherein the graph comprises the M nodes and side relations of the M nodes;

and calculating the centrality values of the M nodes according to the M nodes and the edge relations of the M nodes in the graph to obtain M graph calculation indexes.

determining the number of first edge relations corresponding to M nodes according to the M nodes and the edge relations of the M nodes in the graph, wherein the first edge relations corresponding to the nodes are edge relations pointing to the nodes;

and calculating the centrality values of the M nodes according to the number of the first edge relations corresponding to the M node edges, and obtaining M graph calculation indexes by calculating the centrality values of the M nodes.

In another implementation manner of the embodiment of the present application, the distributed index generating module is further configured to:

obtaining M pieces of storage attribute information corresponding to M pieces of graph data;

n storage units for storing M pieces of image data are determined according to M pieces of storage attribute information, wherein N is an integer which is more than 1 and less than or equal to M;

acquiring N access delay information corresponding to N storage units;

according to a storage unit for storing the image data, a mapping relation list of M image data and N access delay information is established;

and determining M distributed indexes corresponding to the M pieces of graph data according to the mapping relation list.

determining a data fragment for storing each graph data, wherein the data fragment carries address information of a storage unit;

And determining N storage units for storing the M pieces of image data according to the address information of the storage units carried by the data fragments of each piece of image data in the M pieces of image data.

In another implementation manner of the embodiment of the present application, the data deleting module is further configured to:

sequencing M pieces of graph data according to M pieces of storage period information to generate a graph data sequence table;

and deleting all or part of M pieces of graph data according to the graph data sequence table.

obtaining a cache capacity value and a cache threshold value of a graph data set;

and if the cache capacity value of the graph data set is greater than or equal to the cache threshold value, deleting K graph data according to the sequence of M graph data in the graph data sequence table, wherein K is an integer greater than or equal to 1 and less than or equal to M.

In another implementation manner of the embodiment of the present application, the graph data obtaining module is further configured to obtain target graph data, where the target graph data is used to represent a target node and side information corresponding to the target node;

the map calculation index generation module is further used for determining a target map calculation index corresponding to the target map data according to the target map data and the map data set, wherein the target map calculation index is used for representing the centrality characteristic of the target node;

The distributed index generation module is further used for acquiring target storage attribute information corresponding to the target graph data and determining target distributed indexes corresponding to the target graph data according to the target storage attribute information, wherein the target distributed indexes are used for representing storage attribute characteristics of the target graph data;

the storage period information generation module is also used for generating storage period information corresponding to the target graph data according to the target graph calculation index and the target distributed index;

and the data deleting module is also used for updating the graph data sequence table according to the storage period information corresponding to the target graph data.

generating a target graph according to the target graph data and the graph data set, wherein the target graph comprises M nodes, target nodes, side relations of the M nodes and side relations of the target nodes;

determining the number of first target edge relations corresponding to the target nodes according to the edge relations of the M nodes and the first edge relations of the target nodes, wherein the first target edge relations corresponding to the target nodes are edge relations pointing to the target nodes;

and calculating the centrality value of the target node according to the number of the first target edge relations corresponding to the target node, and obtaining a target graph calculation index by calculating the centrality value of the target node.

performing semantic decoding on the target graph data to generate target nodes and side information corresponding to the target nodes;

and generating a target graph according to the target node, the side information corresponding to the target node, the M nodes and the side information corresponding to the M nodes.

In another implementation manner of the embodiment of the present application, the storage period information generating module is further configured to:

and performing weighted calculation according to the graph calculation index and the distributed index corresponding to each graph data in the M graph data, and generating M storage period information corresponding to the M graph data.

Another aspect of the present application provides a computer device comprising:

memory, transceiver, processor, and bus system;

wherein the memory is used for storing programs;

the processor is used for executing programs in the memory, and the method comprises the steps of executing the aspects;

the bus system is used to connect the memory and the processor to communicate the memory and the processor.

Another aspect of the present application provides a computer-readable storage medium having instructions stored therein which, when run on a computer, cause the computer to perform the methods of the above aspects.

Another aspect of the present application provides a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the methods provided in the above aspects.

From the above technical solutions, the embodiments of the present application have the following advantages:

the application provides a data processing method and a related device, wherein the method comprises the following steps: obtaining a graph data set, wherein the graph data set comprises M graph data, the M graph data are used for representing M nodes of a graph to be generated and side information corresponding to the M nodes, and M is an integer larger than 1; according to the graph data set, M graph calculation indexes corresponding to the M graph data are determined, wherein the graph calculation indexes are used for representing the centrality characteristics of the nodes; m pieces of storage attribute information corresponding to the M pieces of image data are obtained, and M pieces of distributed indexes corresponding to the M pieces of image data are determined according to the M pieces of storage attribute information, wherein the distributed indexes are used for representing storage attribute characteristics of the image data; according to the M graph calculation indexes and the M distributed indexes, M storage period information corresponding to M graph data is generated; and all or part of the M pieces of image data are deleted according to the M pieces of storage period information. According to the data processing method provided by the embodiment of the application, the graph calculation index for representing the centrality characteristic of the node in the graph and the distributed index for representing the storage attribute characteristic of the graph data are used for generating the storage period information of the graph data, the storage period information of the graph data is measured by combining the two dimensions of the centrality characteristic and the storage attribute characteristic of the node, the data deleting strategy suitable for the graph data is further determined according to the storage period information, the graph data in the graph database is deleted through the data deleting strategy, and the occupation of storage space can be effectively reduced by deleting the graph data with the overdue storage period, so that the storage cost is saved, and the cache hit rate is improved; over time, some of the graph data may have lost its value or have been updated, deleting these outdated graph data may ensure timeliness and quality of the data, avoiding misleading decisions; deleting expired graph data can reduce the amount of data, thereby improving query performance.

Drawings

FIG. 1 is a schematic diagram of a data processing system according to one embodiment of the present application;

FIG. 2 is a flow chart of a data processing method according to an embodiment of the present application;

FIG. 3 is a flow chart of a data processing method according to another embodiment of the present application;

FIG. 4 is a flow chart of a data processing method according to another embodiment of the present application;

FIG. 5 is a flow chart of a data processing method according to another embodiment of the present application;

FIG. 6 is a flow chart of a data processing method according to another embodiment of the present application;

FIG. 7 is a flow chart of a data processing method according to another embodiment of the present application;

FIG. 8 is a flow chart of a data processing method according to another embodiment of the present application;

FIG. 9 is a flowchart of a data processing method according to another embodiment of the present application;

FIG. 10 is a flow chart of a data processing method according to another embodiment of the present application;

FIG. 11 is a block diagram of a graph database provided in an embodiment of the present application;

FIG. 12 is a schematic diagram of a graph data generation graph provided in an embodiment of the present application;

FIG. 13 is a schematic diagram illustrating a structure of a data processing apparatus according to an embodiment of the present application;

fig. 14 is a schematic structural diagram of a server according to an embodiment of the present application.

Description of the embodiments

The embodiment of the application provides a data processing method, which is characterized in that through graph calculation indexes for representing the centrality characteristics of nodes in a graph and distributed indexes for representing the storage attribute characteristics of graph data, storage period information of the graph data is generated, the storage period of the graph data is measured by combining the two dimensions of the centrality characteristics and the storage attribute characteristics of the nodes, and then a data deleting strategy suitable for the graph data is determined according to the storage period information, the graph data in a graph database is deleted through the data deleting strategy, and the occupation of storage space can be effectively reduced by deleting the graph data with the outdated storage period, so that the storage cost is saved, and the cache hit rate is improved; over time, some of the graph data may have lost its value or have been updated, deleting these outdated graph data may ensure timeliness and quality of the data, avoiding misleading decisions; deleting expired graph data can reduce the amount of data, thereby improving query performance.

The terms "first," "second," "third," "fourth" and the like in the description and in the claims of this application and in the above-described figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be capable of operation in sequences other than those illustrated or described herein, for example. Furthermore, the terms "comprises," "comprising," and "includes" and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.

Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Cloud technology (Cloud technology) refers to a hosting technology for integrating hardware, software, network and other series resources in a wide area network or a local area network to realize calculation, storage, processing and sharing of data.

Cloud technology (Cloud technology) is based on the general terms of network technology, information technology, integration technology, management platform technology, application technology and the like applied by Cloud computing business models, and can form a resource pool, so that the Cloud computing business model is flexible and convenient as required. Cloud computing technology will become an important support. Background services of technical networking systems require a large amount of computing, storage resources, such as video websites, picture-like websites, and more portals. Along with the high development and application of the internet industry, each article possibly has an own identification mark in the future, the identification mark needs to be transmitted to a background system for logic processing, data with different levels can be processed separately, and various industry data needs strong system rear shield support and can be realized only through cloud computing.

Cloud storage (cloud storage) is a new concept that extends and develops in the concept of cloud computing, and a distributed cloud storage system (hereinafter referred to as a storage system for short) refers to a storage system that integrates a large number of storage devices (storage devices are also referred to as storage nodes) of various types in a network to work cooperatively through application software or application interfaces through functions such as cluster application, grid technology, and a distributed storage file system, so as to provide data storage and service access functions for the outside.

At present, the storage method of the storage system is as follows: when creating logical volumes, each logical volume is allocated a physical storage space, which may be a disk composition of a certain storage device or of several storage devices. The client stores data on a certain logical volume, that is, the data is stored on a file system, the file system divides the data into a plurality of parts, each part is an object, the object not only contains the data but also contains additional information such as a data Identification (ID) and the like, the file system writes each object into a physical storage space of the logical volume, and the file system records storage position information of each object, so that when the client requests to access the data, the file system can enable the client to access the data according to the storage position information of each object.

The process of allocating physical storage space for the logical volume by the storage system specifically includes: physical storage space is divided into stripes in advance according to the set of capacity measures for objects stored on a logical volume (which measures tend to have a large margin with respect to the capacity of the object actually to be stored) and redundant array of independent disks (RAID, redundant Array of Independent Disk), and a logical volume can be understood as a stripe, whereby physical storage space is allocated for the logical volume.

In order to facilitate understanding of the technical solutions provided in the embodiments of the present application, some key terms used in the embodiments of the present application are explained here:

data slicing: data slicing (partitioning) is a strategy for storing data in a distributed manner in a database, which partitions large amounts of data into smaller, more manageable portions, called "slices". Each shard is a subset of database data.

PageRank algorithm: the PageRank algorithm is an algorithm for assessing the importance of web pages, and is proposed by Larry Page, one of the originators of Google. The algorithm is based on a simple idea: the importance of one web page may be measured by the number and quality of links to it by other web pages.

Centrality algorithm: the centrality algorithm is a class of algorithms used to measure the importance of nodes in a network. They determine the centrality of the nodes by calculating their connection patterns and locations in the network. The centrality algorithm can help us understand network structure, identify key nodes, influence propagation, and the like.

Graph database: a graph database is a database that uses graph structures for semantic queries, using nodes, edges, and attributes to represent and store data.

The graph data is unstructured data which is cached in the graph database as it is generated. In order to avoid the situation that the data amount of the graph data in the graph database is continuously increased, and finally the graph database cannot accommodate new graph data, so that the cache hit rate is reduced, and the performance of the system is affected, the cached graph data in the graph database needs to be eliminated (in the embodiment of the application, "eliminating" refers to deleting the graph data, specifically deleting the cached graph data in the graph database to release the cache space of the graph database), so that space is reserved for the new data.

Currently, cache elimination strategies for graph data generally include: least recently used (Least Recently Used, LRU), maximum term priority (Largest Item First, LIF), minimum term priority (Smallest Item First, SIF), etc.

Least recently used algorithm (Least Recently Used, LRU): the LRU algorithm eliminates the least recently accessed graph data, and deletes the graph data with lower access frequency by maintaining an access record and recording the access frequency of the graph data. The disadvantage of the LRU algorithm is that it does not make efficient use of spatial locality. If recently accessed data is obsolete in the cache, while earlier accessed data is still in the cache, the efficiency of the cache may be reduced.

Maximum term preference algorithm (Largest Item First, LIF): the disadvantage of the LIF algorithm is that it may cause a "bump" effect, i.e. recently accessed data may be obsolete while earlier accessed data is still in the cache. This may lead to reduced cache efficiency and, in some cases, may lead to data loss.

Min term priority algorithm (Smallest Item First, SIF): the disadvantage of the SIF algorithm is that it may cause a "starvation" effect, i.e. data accessed earlier may remain in the cache, whereas data accessed recently may not be cached. This may lead to reduced cache efficiency and, in some cases, may lead to data loss.

The algorithms are based on the consideration of spatial locality or capacity, the characteristic attribute of the graph data graph is completely ignored, and the time for acquiring the distributed data is not analyzed.

The data processing method provided by the embodiment of the application combines the characteristics of the distributed scene and the topological structure of the graph (for convenience of description, the topological structure of the graph is simplified into the graph topological structure), and provides a strategy for cache elimination under the distributed scene with graph semantics. Specific: firstly, calculating each piece of graph data in a cache by adopting a graph algorithm mode aiming at a graph topological structure, wherein each piece of graph data can obtain a corresponding calculated value, and the calculated value is called a graph calculation index; then, adding a corresponding value for each piece of graph data according to the source of each piece of graph data in the cache, wherein the value is called a distributed index; finally, carrying out weighted summation on the graph calculation index and the distributed index to obtain a result; when the buffer capacity is full and data elimination is needed, the graph data in the buffer can be eliminated according to the order of the weighted summation result.

According to the data processing method provided by the embodiment of the application, the storage period information of the graph data is generated through the graph calculation index representing the centrality characteristic of the node in the graph and the distributed index representing the storage attribute characteristic of the graph data, the storage period information of the graph data is measured by combining the two dimensions of the centrality characteristic and the storage attribute characteristic of the node, the data deleting strategy suitable for the graph data is further determined according to the storage period information, the graph data in the graph database is deleted through the data deleting strategy, and the occupation of storage space can be effectively reduced by deleting the graph data with the overdue storage period, so that the storage cost is saved, and the cache hit rate is improved; over time, some of the graph data may have lost its value or have been updated, deleting these outdated graph data may ensure timeliness and quality of the data, avoiding misleading decisions; deleting expired graph data can reduce the amount of data, thereby improving query performance.

For ease of understanding, referring to fig. 1, fig. 1 is an application environment diagram of a data processing method in an embodiment of the present application, as shown in fig. 1, where the data processing method in the embodiment of the present application is applied to a data processing system. The data processing system includes: a server and a terminal device; the server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, a content distribution network (Content Delivery Network, CDN), basic cloud computing services such as big data and an artificial intelligent platform. The terminal may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc. The terminal and the server may be directly or indirectly connected through wired or wireless communication, which is not limited herein.

The embodiment of the application is applied to a graph data product, can be used for managing data in a graph database cache, and when the cache is full, space needs to be made for new data.

Specifically, the server first obtains a graph data set, which may be a graph data set in a graph data cache unit in a graph database. The graph data set comprises M pieces of graph data, wherein the M pieces of graph data are used for representing M nodes of a graph to be generated and side information corresponding to the M nodes; secondly, the server determines M graph calculation indexes corresponding to the M graph data according to the graph data set, wherein the graph calculation indexes are used for representing the centrality characteristics of the nodes; then, the server acquires M pieces of storage attribute information corresponding to the M pieces of image data, and determines M pieces of distributed indexes corresponding to the M pieces of image data according to the M pieces of storage attribute information, wherein the distributed indexes are used for representing storage attribute characteristics of the image data; then, the server calculates indexes and M distributed indexes according to the M images, and generates M storage period information corresponding to the M image data; and finally, the server performs all or partial deletion on the M pieces of graph data according to the M pieces of storage period information.

The data processing method in the present application will be described from the perspective of the server. Referring to fig. 2, the data processing method provided in the embodiment of the present application includes: step S110 to step S150. Specific:

s110, acquiring a graph data set.

The graph data set comprises M graph data, wherein the M graph data are used for representing M nodes of a graph to be generated and side information corresponding to the M nodes, and M is an integer larger than 1.

It will be appreciated that the graph data set may be a graph data set within a graph data cache unit in a graph database. This step may be to obtain the graph data set from a graph data cache unit in the graph database. The graph data set includes M graph data, which refers to a graph characterized by key-value key value pairs. Graph data is typically represented as a collection of unordered pairs, where each unordered pair contains one node and one or more edges (edges) pointing to other nodes. In key-value pair (key-value) storage, each node in the graph may be referred to as a key, and the edge pointing to that node may be represented as the value associated with that key. In particular, a hash table may be used to store the key-value pairs. For each node, an entry may be created in the hash table and the node may be used as a key and the edge pointing to the node may be used as a value. Thus, the representation and operation of the graph may be achieved by querying the hash table to obtain the edges associated with a particular node.

S120, determining M graph calculation indexes corresponding to the M graph data according to the graph data set.

The graph calculation index is used for representing the centrality characteristic of the node.

It can be understood that, firstly, performing semantic decoding on M pieces of graph data in the graph data set to generate M nodes and side information corresponding to the M nodes; then, generating a corresponding graph according to the M nodes and the side information corresponding to the M nodes; and then, calculating the generated graph according to a graph algorithm to obtain a calculated value of each node, wherein the calculated value of each node is the graph calculation index of the node, and M graph calculation indexes are formed by the graph calculation indexes of each node. The calculation process of the graph calculation index of the M nodes can be expressed by the following formula:

VectorG = Algorithm(SubGraph)；

wherein vector G is a set of M graph calculation indexes, algorithm is a graph Algorithm, and SubGraph is a graph generated by M nodes and side information corresponding to the M nodes.

Preferably, in the calculation of the graph calculation index, the graph calculation index of each of the M nodes may be calculated by a centrality calculation algorithm (PageRank).

PageRank is a commonly used algorithm in centrality calculations in graph algorithms. The importance of a node in the graph is measured by calculating the PageRank value for each node. Specifically, the PageRank value is the ranking score of a node throughout the graph, representing the probability that the node is visited in the network. It is calculated by iteratively assigning a page rank score for each node to its neighbors and updating the score for each node based on these assigned scores. Thus, the PageRank value may reflect the importance and impact of the node in the graph. Nodes with high PageRank values are typically important nodes in the graph, they may have more ingress and egress degrees, and may play a critical role in the graph.

S130, M pieces of storage attribute information corresponding to the M pieces of graph data are obtained, and M distributed indexes corresponding to the M pieces of graph data are determined according to the M pieces of storage attribute information.

Wherein the distributed index is used for characterizing the storage attribute characteristics of the graph data.

It can be understood that M pieces of storage attribute information corresponding to M pieces of map data are acquired; n storage units for storing M pieces of image data are determined according to the M pieces of storage attribute information, and N is an integer which is more than or equal to 1 and less than or equal to M because one or more pieces of image data are stored in one storage unit; acquiring N access delay information corresponding to N storage units; according to a storage unit for storing the image data, a mapping relation list of M image data and N access delay information is established; and determining M distributed indexes corresponding to the M pieces of graph data according to the mapping relation list. The storage attribute feature may be information of a storage unit storing the map data, and the distributed index may be an access delay of the storage unit. According to the storage attribute information of each graph data in the cache, adding a value to each graph data, wherein the value is a distributed index, and the distributed index of each node can be calculated by the following formula:

Vectord=latency (information of memory cell);

here, latency is different according to regions of the storage unit, and VectorD is a delay of access to the graph data in the storage unit, i.e., a distributed index.

And S140, calculating indexes and M distributed indexes according to the M graphs, and generating M storage period information corresponding to the M graph data.

It can be understood that the graph calculation index and the distributed index corresponding to each graph data in the M graph data are calculated by a weighted summation mode, so as to obtain M storage period information corresponding to the M graph data. The storage period information is a time period or duration that the graph data can be stored in the graph data caching unit, and when the storage time of the graph data in the graph data caching unit reaches the storage period information, the graph data is deleted from the graph data caching unit.

The storage period of each node can be calculated by the following formula:

VectorTtl = αVectorG + βVectorD；

wherein α and β are adjustable parameters representing the weights of the two indices; vector G is a set of M graph calculation indexes, and vector D is a set of M distributed indexes; vectorTtl is a set of M memory cycle information.

For the sake of understanding, the storage period information is described by way of example, for the node a in the graph, the graph calculation index is calculated according to step S120 to be 1/6, the distributed index (access delay of the storage unit storing the node a) is calculated according to step S130 to be 0.03S, the storage period information of the node a is calculated according to the formula of the storage period information to be (1/6xα+0.03xβ) ×t, where T is a set period, the calculation result of (1/6xα+0.03xβ) is a number smaller than 1, for example, when T is set to be 10 days, and the calculation result of (1/6xα+0.03xβ) is set to be 1/2, the storage period information of the node a is set to be 5 days.

S150, all or part of M pieces of image data are deleted according to M pieces of storage period information.

It will be appreciated that the M graph data sequences are ordered according to the M storage period information, resulting in a graph data sequence table. Preferably, all or part of the M pieces of map data are determined to be deleted based on the size of the map data to be cached and the remaining storage space of the cache unit. For example, if the total capacity of the graph data buffer unit is 1M, the occupied capacity of the current graph data buffer unit is 900KB, and the size of the graph data to be buffered in the graph data buffer unit is 950KB, all the graph data in the graph data buffer unit needs to be deleted at this time, so as to release enough buffer space to satisfy buffering of the graph data to be buffered with the size of 950 KB. For another example, if the total capacity of the graph data buffer unit is 1M, the occupied capacity of the current graph data buffer unit is 900KB, and the size of the graph data to be buffered in the graph data buffer unit is 300KB, only a part of the graph data (for example, K pieces of graph data are deleted, and the sum of the sizes of K pieces of graph data is 300 KB) needs to be deleted at this time, so that the requirement of storing the graph data to be buffered can be met.

In another preferred embodiment, when the capacity of the graph data buffer unit reaches a threshold value, all or part of the graph data in the M graph data is deleted according to the sequence of the graph data in the graph data sequence table. For example, the capacity of the graph data buffer unit is 900KB, the threshold is 100KB, and all or part of the graph data in the M pieces of graph data can be deleted.

In an alternative embodiment of the data processing method provided in the corresponding embodiment of fig. 2 of the present application, referring to fig. 3, step S120 further includes sub-steps S121 to S123. Specific:

s121, performing semantic decoding on M pieces of graph data in the graph data set, and generating M nodes and side information corresponding to the M nodes.

It may be understood that, in order to facilitate storage of the graph, the nodes in the graph and the side information corresponding to the nodes are semantically encoded, a graph data form of a key-value pair (key-value) is generated, and the graph data of the key-value pair (key-value) is stored in a graph cache unit in a sliced form, where each graph data corresponds to one node and the side information of the node. In step S121, after the graph data is obtained, the graph data needs to be semantically decoded, specifically, M pieces of graph data are semantically decoded, so as to obtain M nodes and side information corresponding to the M nodes.

S122, generating a corresponding graph according to the side information corresponding to the M nodes.

The graph comprises M nodes and edge relations of the M nodes.

It can be understood that the edge relationship among the M nodes is established according to the edge information corresponding to the M nodes, and a graph including the M nodes is generated according to the M nodes and the edge relationship among the M nodes.

S123, calculating the centrality values of the M nodes according to the M nodes and the edge relations of the M nodes in the graph to obtain M graph calculation indexes.

It will be appreciated that a centrality value is an indicator of the importance of a node in a graph, and is typically related to the degree, proximity, feature vector centrality, etc. of the node. Preferably, the node's centrality value may be calculated by a degree of centrality (Degree Centrality) or a near centrality (Closeness Centrality).

Degree of center (Degree Centrality): the degree-centering is a measure of centering, which measures the number of connections of one node to other nodes. The calculation method is to divide the degree of a node (the number of edges connected to the node) by the sum of the degrees of all nodes in the graph.

Near center (Closeness Centrality): the proximity centrality measures the average distance of one node from all other nodes in the graph. The calculation method is to divide the sum of the shortest path lengths from all nodes to the node by the diameter of the graph (the longest path length between any two nodes in the graph).

According to the data processing method provided by the embodiment of the application, the graph data in the graph data set are subjected to semantic decoding to generate the corresponding graph, and the graph calculation index corresponding to each graph data is calculated through calculating the centrality value of each node in the graph, so that the node centrality characteristic is calculated, and a foundation is laid for providing a reliable graph data deletion strategy subsequently.

In an alternative embodiment of the data processing method provided in the corresponding embodiment of fig. 3 of the present application, referring to fig. 4, the substep S123 further includes substeps S1231 to S1232. Specific:

s1231, determining the number of first edge relations corresponding to M nodes according to the M nodes and the edge relations of the M nodes in the graph.

The first edge relation corresponding to the node is an edge relation pointing to the node.

It will be appreciated that the edge relationships of the nodes in the graph include an in-edge relationship, which is an edge pointing to a node, and an out-edge relationship, which is an edge pointed to by a node. The first edge relationship in the embodiment of the present application is an incoming edge relationship. The number of edge relationships for each of the M nodes is determined.

S1232, calculating the centrality values of the M nodes according to the number of the first edge relations corresponding to the M node edges, and obtaining M graph calculation indexes by calculating the centrality values of the M nodes.

It can be understood that, according to the number of the edge entering relations of each node and the number of all edge relations of each node, the centrality value of the node can be calculated, and the calculated centrality value is used as a graph calculation index of the node. Specifically, the graph calculation index can be calculated by the edge entering relation/total edge relation. For ease of understanding, the calculation process of calculating the index for the above-described graph is illustrated: setting the total number of the edge relations of the node A as 5, wherein the edge entering relation is 1, and the graph calculation index of the node A is 1/5; and setting the total number of the edge relations of the node A as 4, wherein the edge entering relation is 3, and the graph calculation index of the node A is 3/4. It will be appreciated that the greater the graph calculation index, the greater the importance in the graph that represents that node.

According to the data processing method provided by the embodiment of the application, the graph calculation index of the graph data corresponding to the nodes is obtained through the edge entering relation of the nodes in the graph, so that the accuracy of graph calculation index calculation is improved, and a foundation is laid for providing a reliable graph data deletion strategy subsequently.

In an alternative embodiment of the data processing method provided in the corresponding embodiment of fig. 2 of the present application, referring to fig. 5, step S130 further includes sub-steps S131 to S135. Specific:

s131, M pieces of storage attribute information corresponding to the M pieces of map data are acquired.

It is to be understood that the storage attribute information refers to information related to storing the map data, such as a storage location, a storage manner, a storage capacity, and the like, in which the map data is stored. Such information may help the system determine how to store and access the graph data. Preferably, storage unit information of each of the M pieces of map data is acquired, and the storage attribute information is used for characterizing machine region information of the storage unit, such as ip information and the like. Since the map data is stored in a data-slicing manner, there are cases where a plurality of map data are stored in the same storage unit, that is, there are cases where M pieces of storage attribute information corresponding to M pieces of map data are the same.

S132, determining N storage units for storing M pieces of graph data according to M pieces of storage attribute information.

Wherein N is an integer greater than 1 and less than or equal to M.

It is to be understood that the storage unit storing each of the M pieces of map data is determined based on the storage unit information of each of the map data, and N is an integer of 1 or more and M or less because there is a case where one or more pieces of map data are stored in one storage unit. According to the M pieces of storage attribute information acquired in step S131, N storage units storing the M pieces of map data are determined. N is an integer greater than 1 and less than or equal to M, representing the number of memory cells storing the map data. The storage unit may be a physical storage device or a virtual storage device.

S133, N access delay information corresponding to the N storage units is obtained.

It is understood that access latency information is obtained for each of the N memory cells. The access delay information refers to the time required to read or write data from or to the memory cell. Acquiring access latency information may help the system determine the performance of the storage unit, thereby better managing and optimizing storage resources.

S134, according to a storage unit for storing the image data, a mapping relation list of M image data and N access delay information is established.

It can be understood that, according to the storage unit storing the graph data, a mapping relation list of the M graph data and the N access delay information is established. The mapping relation list records the relation between each graph data and the corresponding storage unit and access delay information. By establishing the mapping relation list, the storage position and access delay information of each graph data can be rapidly determined, so that the efficiency of accessing the graph data is improved. For ease of understanding, the above procedure will be illustrated: setting the storage attribute information corresponding to the storage of the graph data a as 123.456.789, determining that the graph data a is stored in the storage unit A according to the storage attribute information, and obtaining the access delay information of the storage unit A as Latency _A Then establish the graph data a and access delay information Latency _A And recording the mapping relation in a mapping relation list.

S135, determining M distributed indexes corresponding to the M image data according to the mapping relation list.

It can be understood that the access delay information corresponding to each of the M pieces of graph data is determined as a distributed index thereof according to the mapping relation list.

According to the data processing method, the data can be processed more accurately by acquiring the storage attribute information, determining the storage unit, acquiring the access delay information, establishing the mapping relation list and determining the distributed index, so that the efficiency and the accuracy of data processing are improved; by establishing a mapping relation list of the graph data and the corresponding access delay information, the storage position and the access delay information of each graph data can be rapidly determined, so that the data access efficiency is improved.

In an alternative embodiment of the data processing method provided in the corresponding embodiment of fig. 5 of the present application, referring to fig. 6, step S132 further includes sub-steps S1321 to S1322. Specific:

s1321, determining a data slice storing each of the M pieces of graph data.

The data fragments carry address information of the storage units.

It will be appreciated that a data slice storing each of the M pieces of map data is determined. Data slicing refers to dividing a large data set into smaller data blocks for storage and processing. Each data slice typically carries storage unit address information to facilitate determining the storage unit that stores the data slice.

S1322, determining N storage units for storing M pieces of image data according to the address information of the storage units carried by the data fragments of each piece of image data.

It can be understood that according to the data fragment of each of the M pieces of graph data, N storage units storing the M pieces of graph data are determined, and if the M pieces of graph data are stored in different storage units, then M is equal to N at this time; if more than one graph data is stored in the same memory cell, then N is less than M. Specifically, the data fragments of each graph data are stored in corresponding storage units to facilitate subsequent data processing and access.

According to the data processing method, the graph data are stored in different storage units, so that conflict of data access can be reduced, and the data access efficiency is improved. In addition, by setting corresponding storage unit address information for each data slice, the required data slice can be positioned and accessed more quickly; dividing the graph data into several data slices and storing them in different memory locations may reduce the risk of data loss or corruption. If one memory cell fails, only the data shards stored therein are affected, and the entire graph data is not affected.

In an alternative embodiment of the data processing method provided in the corresponding embodiment of fig. 2 of the present application, referring to fig. 7, step S150 further includes sub-steps S151 to S152. Specific:

s151, sequencing the M pieces of image data according to the M pieces of storage period information to generate an image data sequence table.

It will be appreciated that the M graph data sequences are ordered according to the M storage period information, resulting in a graph data sequence table. The storage period information may include information of storage time, storage location, and the like of the map data. The system will sort the graph data according to this information to determine which graph data needs to be deleted.

And S152, deleting all or part of M pieces of graph data according to the graph data sequence table.

It will be appreciated that the M pieces of graph data are deleted according to the graph data sequence table. The sequence of the map data to be deleted is listed in the map data sequence table, and the system deletes the map data one by one in this sequence. Preferably, the map data is arranged in an ascending order according to the storage period information corresponding to the map data to obtain a map data sequence table, when the map data deleting condition is met, the map data in the map data sequence table is sequentially deleted, specifically, when the storage capacity in the map data caching unit is greater than or equal to the threshold value, the map data is sequentially deleted according to the sequence of the map data in the map data sequence table until the storage capacity of the map data caching unit after deletion is less than the threshold value.

According to the data processing method provided by the embodiment of the application, the graph data are ordered according to the storage period information, the graph data sequence table is generated, and the graph data which are not needed are determined according to the ordering of the graph data in the deletion sequence table, so that the graph data can be deleted from the storage space, the storage space can be released by deleting the graph data which are not needed any more, and more space is provided for the storage of new graph data; m image data are deleted by establishing an image data sequence table, so that the storage space can be effectively cleaned, and the storage efficiency is improved.

In an alternative embodiment of the data processing method provided in the corresponding embodiment of fig. 2 of the present application, referring to fig. 8, the data processing method further includes steps S210 to S250. Specific:

s210, obtaining target graph data.

The target graph data are used for representing the target nodes and the side information corresponding to the target nodes.

It can be understood that when new map data is generated, the new map data is target map data, which is new map data to be stored in the map data caching unit. The target graph data refers to a target graph characterized by key-value key value pairs. The target graph is typically represented as an unordered pair that includes a target node (node) and one or more edges (edges) that point to other nodes. In key-value pair (key-value) storage, a target node is taken as a key, and an edge pointing to the target node is represented as a value associated with the key. In particular, a hash table may be used to store key-value pairs. For a target node, an entry may be created in the hash table and the target node may be used as a key and the edge pointing to the target node may be used as a value. Thus, the sides associated with the target nodes can be obtained by querying the hash table, so that the representation and operation of the target graph are realized.

S220, determining target graph calculation indexes corresponding to the target graph data according to the target graph data and the graph data set.

The target graph calculation index is used for representing the centrality characteristic of the target node.

It can be understood that, firstly, performing semantic decoding on M pieces of graph data in the graph data set to generate M nodes and side information corresponding to the M nodes; performing semantic decoding on the target graph data to generate a target node and side information corresponding to the target node; then, generating a target graph according to the target node, the side information corresponding to the target node, M nodes and the side information corresponding to the M nodes; and then, calculating the generated target node according to the graph algorithm to obtain a calculated value of the target node, namely a target graph calculation index. Preferably, in the graph calculation index calculation process, the graph calculation index of the target node may be calculated by a centrality calculation algorithm (PageRank).

S230, acquiring target storage attribute information corresponding to the target graph data, and determining a target distributed index corresponding to the target graph data according to the target storage attribute information.

The target distributed index is used for representing storage attribute characteristics of target graph data.

It can be understood that corresponding target storage attribute information of the target graph data is obtained; determining a target storage unit for storing target graph data according to the target storage attribute information; obtaining access delay information corresponding to a target storage unit; according to a storage unit for storing target graph data, a mapping relation list of the target graph data and access delay information is established; and determining a target distributed index corresponding to the target graph data according to the mapping relation list. The storage attribute feature may be information of a storage unit storing the target map data, and the distributed index may be an access delay of the storage unit. And adding a value to the target graph data according to the storage attribute information of the target graph data in the cache, wherein the value is the target distributed index.

S240, calculating an index and a target distributed index according to the target graph, and generating storage period information corresponding to the target graph data.

It can be understood that the graph target calculation index and the target distributed index corresponding to the target graph data are calculated in a weighted summation mode, so as to obtain the target storage period information corresponding to the target graph data. The target storage period information is the time period or duration that the target image data can be stored in the image data caching unit, and when the storage time of the target image data in the image data caching unit reaches the target storage period information, the image data is deleted from the image data caching unit. For the purpose of understanding, the target storage period information will be described by way of example, for the target node, the target graph calculation index is 1/8 calculated according to step S220, the target distributed index (access delay of the storage unit storing the target node) is 0.06S calculated according to step S230, the weight corresponding to the target graph calculation index is α, the weight corresponding to the target distributed index is β, the storage period information of the target node is (1/8xα+0.06 xβ) ×t, where T is a set period, the calculation result of (1/8xα+0.06 xβ) is a number smaller than 1, for example, when T is set to 10 days, and the calculation result of (1/8xα+0.06 xβ) is 1/4, the storage period information of the target node is 2.5 days.

S250, updating the graph data sequence table according to the storage period information corresponding to the target graph data.

It can be understood that the storage period information corresponding to the target graph data is inserted into the graph data sequence table according to the value of the storage period information, so as to update the graph data sequence table. For example, as shown in the graph data sequence table in table 1, the graph data a has a storage period information of 0.2T (T is a preset storage period), the graph data b has a storage period information of 0.6T, and the graph data c has a storage period information of 0.75T, and the graph data is sorted in ascending order of the storage period information of the graph data, to obtain the graph data sequence table. And updating the graph data sequence table according to the storage period information corresponding to the target graph data, namely inserting the target graph between the graph data a and the graph data b, wherein the updated graph data sequence table is shown in table 2.

TABLE 1

TABLE 2

According to the data processing method provided by the embodiment of the application, when new target graph data enter the cache, the calculation method is triggered, the target graph data and the historical graph data are generated into the graph data sequence table according to the storage period, so that the graph data in the graph database are deleted through the data deleting strategy, the occupation of the storage space can be effectively reduced by deleting the graph data with the outdated storage period, the storage cost is saved, and the cache hit rate is improved.

In an alternative embodiment of the data processing method provided in the corresponding embodiment of fig. 8 of the present application, referring to fig. 9, step S220 further includes sub-steps S221 to S223. Specific:

s221, generating a target graph according to the target graph data and the graph data set.

The target graph comprises M nodes, target nodes, side relations of the M nodes and side relations of the target nodes.

It can be understood that the semantic decoding is performed on M pieces of graph data in the graph data set to generate M nodes and side information corresponding to the M nodes; and performing semantic decoding on the target graph data to generate the target node and the side information corresponding to the target node.

S222, determining the number of first target edge relations corresponding to the target nodes according to the edge relations of the M nodes and the first edge relations of the target nodes.

The first target edge relationship corresponding to the target node is an edge relationship pointing to the target node.

It will be appreciated that the edge relationships of the nodes in the graph include an in-edge relationship, which is an edge pointing to a node, and an out-edge relationship, which is an edge pointed to by a node. The first edge relationship in the embodiment of the present application is an incoming edge relationship. The number of edge relations of each node in the M nodes and the number of first target edge relations corresponding to the target nodes are determined.

S223, calculating the centrality value of the target node according to the number of the first target edge relations corresponding to the target node, and obtaining a target graph calculation index by calculating the centrality value of the target node.

It can be understood that, according to the number of the edge entering relations of the target node and the number of all edge relations of the target node, the centrality value of the target node can be calculated, specifically, the centrality value can be calculated by the edge entering relations of the target node/the total number of edge relations of the target node, and the calculated centrality value of the target node is used as a target graph calculation index.

According to the data processing method provided by the embodiment of the application, the target graph calculation index of the target graph data corresponding to the target node is obtained through the incoming side relation of the target node, so that the accuracy of calculating the target graph calculation index is improved, and a foundation is laid for providing a reliable graph data deletion strategy subsequently.

In an alternative embodiment of the data processing method provided in the corresponding embodiment of fig. 9 of the present application, referring to fig. 10, step S221 further includes sub-steps S2211 to S2213. Specific:

s2211, performing semantic decoding on the target graph data to generate a target node and side information corresponding to the target node.

S2212, performing semantic decoding on M pieces of graph data in the graph data set, and generating M nodes and side information corresponding to the M nodes.

S2213, a target graph is generated according to the target node, the side information corresponding to the target node, M nodes and the side information corresponding to the M nodes.

It can be understood that, firstly, performing semantic decoding on M pieces of graph data in the graph data set to generate M nodes and side information corresponding to the M nodes; performing semantic decoding on the target graph data to generate a target node and side information corresponding to the target node; then, a target graph is generated according to the target node, the side information corresponding to the target node, the M nodes and the side information corresponding to the M nodes.

In an alternative embodiment of the data processing method provided in the corresponding embodiment of fig. 2 of the present application, step S140 further includes the sub-steps of:

It can be understood that the graph calculation index and the distributed index corresponding to each graph data in the M graph data are calculated by a weighted summation mode, so as to obtain M storage period information corresponding to the M graph data.

According to the data processing method, the life cycle of the graph data in the cache is measured from two dimensions, so that a cache elimination strategy which is suitable for the graph data is obtained, and the cache hit rate is improved.

For ease of understanding, a data processing method applied to graph data cache elimination will be described below with reference to fig. 11 to 12. Firstly, calculating each piece of graph data in a cache by adopting a graph algorithm mode aiming at a graph topological structure, wherein each piece of graph data can obtain a corresponding calculated value, and the calculated value is called a graph calculation index; then, adding a corresponding value for each piece of graph data according to the source of each piece of graph data in the cache, wherein the value is called a distributed index; finally, carrying out weighted summation on the graph calculation index and the distributed index to obtain a result; when the buffer capacity is full and data elimination is needed, the graph data in the buffer can be eliminated according to the order of the weighted summation result. Specific:

As shown in fig. 11, fig. 11 is a block diagram of a graph database with distributed caches according to an embodiment of the present application. The distributed cached graph database comprises two parts, wherein the upper layer is a graph database service layer for providing read-write service for users, which is called a graph service layer, and aims to quicken the query of each machine, wherein a part of data is cached, and the layer is called the graph service layer; the lower layer is a persistent data layer, called a graph storage layer, which stores all data. The image service layer and the image storage layer are connected through a network, and meanwhile, the image service layer and the image storage layer are mutually accessed through the network. Because the graph store machines may be in different network environments, such as different data centers, there may be different access delays when accessing the graph store data.

The data processing method provided by the embodiment of the application is specific:

1) Obtaining a graph data set from a graph data caching unit, carrying out semantic analysis on each graph data in the graph data set, generating nodes and edge relations corresponding to the nodes, and reconstructing a subgraph, as shown in fig. 12;

2) And calculating the generated graph by using a graph algorithm, and generating a calculated value, namely a graph calculation index, for each node in the graph. This process is formulated as vectorg= Algorithm (SubGraph); algorithm is a graph Algorithm used; subGraph is a graph of data composition in the cache; the vector G carries out graph calculation on the graphs to obtain a set corresponding to each graph data in the cache;

3) According to the storage attribute information (here, storage attribute information refers to a server where the map data is located) of each map data in the map data caching unit, a value, that is, a distributed index, is added to each map data. The process is formulated as: vectord=latency (machine territory). The Latency in the formula is different along with the different machine regions; vector D is the access delay of each piece of data in the cache;

4) The calculation index and the distributed index are calculated by using a weighted summation mode, and a new value is obtained, namely the storage period information of each graph data. The process is formulated as: vectorTtl = αvectorg+βvectord, where α and β are adjustable parameters representing the weights of the two indicators; vectorTtl is the set of storage cycle information for the graph data in the cache memory unit;

5) Correlating the calculated storage period information with corresponding graph data to generate a mapping relation table of the graph data and the storage period information;

6) When new graph data enter the cache unit, the calculation process is triggered;

7) When the capacity of the cache unit reaches the threshold value, the cache unit is eliminated according to the storage period of each image data in the cache.

It will be appreciated that as shown in fig. 12, the graph data in the cache unit in the graph service layer has a graph structure, that is, the graph data in the cache constitutes a small sub-graph. The graph data in the cache can be sorted according to the topology of the graph and from different data slices.

In calculating the graph calculation index, graph algorithms such as PageRank and centrality calculation can be used for scoring the graph data in the cache, for example, in FIG. 12, the score of V4 is higher than the score of V6, because V4 is referenced most times and V6 is not referenced by other points, the score of V4 can be higher than the score of V6 through PageRank, and therefore when the cache unit capacity of the graph database reaches the threshold value, the data where V6 is located is preferentially eliminated. This is a strategy that fully utilizes the graph algorithm to implement scoring of the data in the subgraph.

However, in fig. 12, three points V2, V3 and V5 are all referenced once (only edges are considered), and therefore have the same PageRank value, and the difficulty of obtaining data in a distributed manner needs to be considered; in the figure, it can be seen that V2, V3, V5 come from different data fragments (partition) respectively, and each data fragment is on different storage machines, so that according to the information recorded by the storage attribute characteristics, the delay of accessing each machine can be obtained, and the corresponding V2, V3 and V5 can be ordered according to the access delay, so that the data in the cache is ordered according to the difficulty of obtaining the data, and the ordering of the storage period information of the data in the cache is realized; for example, in fig. 12, if the network environment of the partition 3 is relatively poor, it takes relatively much time to read the data of the partition 3, so that the storage period of the data of the partition 3 in the cache is increased, that is, when the memory data is eliminated, the probability of eliminating V3 is reduced.

In general, the method provided by the embodiment of the application is that graph reproduction is carried out on the cached data by utilizing the characteristics of graphs in the graph data, then the weight (serving as an index of a storage period) of each point in the subgraph is obtained according to a graph algorithm, then the time delay condition of different piece data is obtained according to the storage attribute characteristics by combining the distributed characteristics, and the combination of the two factors is considered, so that a cache elimination strategy is obtained; the elimination strategy of the data in the cache is as follows:

Ttl(time to life) = αAlgorithm(SubGraph) + βLatency；

wherein α and β are an adjustable parameter representing the weights of the graph calculation index and the distributed index; algorithm is a graph Algorithm, preferably PageRank, centrality Algorithm, community discovery and other graph calculation methods, and SubGraph is a graph corresponding to a graph data set; latex represents access delay information of data fragments in different networks.

According to the method provided by the embodiment of the application, the characteristics of the graph in the graph database and the characteristics of data access of the distributed scene are fully utilized, and the storage period of the graph data in the cache is measured from two dimensions by combining the characteristics of the graph algorithm and the characteristics of data slicing, so that a cache elimination strategy which is more suitable for the graph data is obtained, cache hit is improved, query efficiency is improved, and the utilization rate of equipment is improved.

The data processing apparatus of the present application will be described in detail below, with reference to fig. 13. Fig. 13 is a schematic diagram of an embodiment of a data processing apparatus 10 in an embodiment of the present application, where the data processing apparatus 10 includes: a graph data acquisition module 110, a graph calculation index generation module 120, a distributed index generation module 130, a storage period information generation module 140, and a data deletion module 150; specific:

a graph data acquisition module 110, configured to acquire a graph data set, where the graph data set includes M pieces of graph data, the M pieces of graph data are used to represent M nodes of a graph to be generated and side information corresponding to the M nodes, and M is an integer greater than 1;

the graph calculation index generation module 120 is configured to determine M graph calculation indexes corresponding to M graph data according to the graph data set, where the graph calculation indexes are used to characterize a centrality feature of a node;

the distributed index generation module 130 is configured to obtain M pieces of storage attribute information corresponding to the M pieces of map data, and determine M distributed indexes corresponding to the M pieces of map data according to the M pieces of storage attribute information, where the distributed indexes are used to characterize storage attribute features of the map data;

the storage period information generating module 140 is configured to generate M storage period information corresponding to M graph data according to the M graph calculation indexes and the M distributed indexes;

The data deleting module 150 is configured to delete all or part of the M pieces of map data according to the M pieces of storage period information.

According to the data processing device provided by the embodiment of the application, the storage period information of the graph data is generated through the graph calculation index representing the centrality characteristic of the node in the graph and the distributed index representing the storage attribute characteristic of the graph data, the storage period information of the graph data is measured by combining the two dimensions of the centrality characteristic and the storage attribute characteristic of the node, the data deleting strategy suitable for the graph data is further determined according to the storage period information, the graph data in the graph database is deleted through the data deleting strategy, and the occupation of storage space can be effectively reduced by deleting the graph data with the overdue storage period, so that the storage cost is saved, and the cache hit rate is improved; over time, some of the graph data may have lost its value or have been updated, deleting these outdated graph data may ensure timeliness and quality of the data, avoiding misleading decisions; deleting expired graph data can reduce the amount of data, thereby improving query performance.

In an alternative embodiment of the data processing apparatus provided in the embodiment corresponding to fig. 13 of the present application, referring to fig. 13, the graph calculation index generating module 120 is further configured to:

According to the data processing device provided by the embodiment of the application, the graph data in the graph data set are subjected to semantic decoding to generate the corresponding graph, and the graph calculation index corresponding to each graph data is calculated through calculating the centrality value of each node in the graph, so that the node centrality characteristic is calculated, and a foundation is laid for providing a reliable graph data deletion strategy subsequently.

According to the data processing device provided by the embodiment of the application, the graph calculation index of the graph data corresponding to the nodes is obtained through the edge entering relation of the nodes in the graph, so that the accuracy of the graph calculation index calculation is improved, and a foundation is laid for providing a reliable graph data deletion strategy subsequently.

In another implementation manner of the embodiment of the present application, the distributed index generating module 130 is further configured to:

acquiring N access delay information corresponding to N storage units;

According to the data processing device, the data can be processed more accurately by acquiring the storage attribute information, determining the storage unit, acquiring the access delay information, establishing the mapping relation list and determining the distributed index, so that the efficiency and the accuracy of data processing are improved; by establishing a mapping relation list of the graph data and the corresponding access delay information, the storage position and the access delay information of each graph data can be rapidly determined, so that the data access efficiency is improved.

According to the data processing device provided by the embodiment of the application, the image data are stored in different storage units, so that the conflict of data access can be reduced, and the efficiency of data access is improved. In addition, by setting corresponding storage unit address information for each data slice, the required data slice can be positioned and accessed more quickly; dividing the graph data into several data slices and storing them in different memory locations may reduce the risk of data loss or corruption. If one memory cell fails, only the data shards stored therein are affected, and the entire graph data is not affected.

In another implementation of the embodiment of the present application, the data deleting module 150 is further configured to:

According to the data processing device provided by the embodiment of the application, the graph data are ordered according to the storage period information, the graph data sequence table is generated, and the graph data are determined to be unnecessary, so that the graph data can be deleted from the storage space, the storage space can be released by deleting the graph data which are unnecessary, and more space is provided for storing new graph data; m image data are deleted by establishing an image data sequence table, so that the storage space can be effectively cleaned, and the storage efficiency is improved.

Fig. 14 is a schematic diagram of a server structure provided in an embodiment of the present application, where the server 300 may vary considerably in configuration or performance, and may include one or more central processing units (central processing units, CPU) 322 (e.g., one or more processors) and memory 332, one or more storage media 330 (e.g., one or more mass storage devices) storing applications 342 or data 344. Wherein the memory 332 and the storage medium 330 may be transitory or persistent. The program stored on the storage medium 330 may include one or more modules (not shown), each of which may include a series of instruction operations on a server. Still further, the central processor 322 may be configured to communicate with the storage medium 330 and execute a series of instruction operations in the storage medium 330 on the server 300.

The Server 300 may also include one or more power supplies 326, one or more wired or wireless network interfaces 350, one or more input/output interfaces 358, and/or one or more operating systems 341, such as Windows Server ^TM ，Mac OS X ^TM ，Unix ^TM , Linux ^TM ，FreeBSD ^TM Etc.

The steps performed by the server in the above embodiments may be based on the server structure shown in fig. 14.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of elements is merely a logical functional division, and there may be additional divisions of actual implementation, e.g., multiple elements or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the corresponding technical solutions.

Claims

1. A method of data processing, comprising:

acquiring M pieces of storage attribute information corresponding to the M pieces of map data;

determining N storage units for storing the M pieces of graph data according to the M pieces of storage attribute information, wherein N is an integer which is more than 1 and less than or equal to M;

Acquiring N access delay information corresponding to the N storage units;

according to a storage unit for storing the image data, a mapping relation list of the M image data and the N access delay information is established;

determining M distributed indexes corresponding to the M pieces of graph data according to the mapping relation list, wherein the distributed indexes are used for representing storage attribute characteristics of the graph data, the storage attribute characteristics are information of storage units for storing the graph data, and the distributed indexes are access delay of the storage units;

performing weighted calculation according to the graph calculation index and the distributed index corresponding to each graph data in the M graph data to generate M storage period information corresponding to the M graph data;

sequencing the M pieces of graph data according to the M pieces of storage period information to generate a graph data sequence table;

all or part of the M pieces of graph data are deleted according to the graph data sequence table;

obtaining target graph data, wherein the target graph data are used for representing a target node and side information corresponding to the target node;

determining target graph calculation indexes corresponding to the target graph data according to the target graph data and the graph data set, wherein the target graph calculation indexes are used for representing the centrality characteristics of the target nodes;

Acquiring target storage attribute information corresponding to the target graph data, and determining a target distributed index corresponding to the target graph data according to the target storage attribute information, wherein the target distributed index is used for representing storage attribute characteristics of the target graph data;

calculating an index and the target distributed index according to the target graph, and generating storage period information corresponding to the target graph data;

and updating the graph data sequence table according to the storage period information corresponding to the target graph data.

2. The data processing method according to claim 1, wherein determining M graph calculation indexes corresponding to the M graph data according to the graph data set includes:

performing semantic decoding on the M pieces of graph data in the graph data set to generate M nodes and side information corresponding to the M nodes;

generating a corresponding graph according to the M nodes and the side information corresponding to the M nodes, wherein the graph comprises the M nodes and the side relations of the M nodes;

and calculating the centrality values of the M nodes according to the M nodes in the graph and the edge relations of the M nodes to obtain M graph calculation indexes.

3. The data processing method as claimed in claim 2, wherein calculating the centrality values of the M nodes according to the M nodes in the graph and the edge relationships of the M nodes to obtain M graph calculation indexes includes:

4. The data processing method according to claim 1, wherein the determining N storage units storing the M pieces of map data based on the M pieces of storage attribute information includes:

determining a data fragment for storing each image data in the M image data, wherein the data fragment carries address information of a storage unit;

and determining N storage units for storing the M pieces of graph data according to the address information of the storage units carried by the data fragments of each piece of graph data.

5. The data processing method according to claim 1, wherein the deleting all or part of the M pieces of graph data according to the graph data sequence table includes:

Obtaining a cache capacity value and a cache threshold value of the graph data set;

and deleting K pieces of image data according to the sequence of M pieces of image data in the image data sequence table if the cache capacity value of the image data set is greater than or equal to a cache threshold value, wherein K is an integer greater than or equal to 1 and less than or equal to M.

6. The data processing method according to claim 1, wherein the determining, based on the target map data and the map data set, a target map calculation index corresponding to the target map data includes:

generating a target graph according to the target graph data and the graph data set, wherein the target graph comprises the M nodes, the target nodes, the edge relations of the M nodes and the edge relations of the target nodes;

7. The method of claim 6, wherein generating the target graph from the target graph data and the graph data set comprises:

performing semantic decoding on the target graph data to generate the target node and side information corresponding to the target node;

8. A data processing apparatus, comprising:

the image data acquisition module is used for acquiring an image data set, wherein the image data set comprises M image data, the M image data are used for representing M nodes of an image to be generated and side information corresponding to the M nodes, and M is an integer larger than 1;

the distributed index generation module is used for acquiring M pieces of storage attribute information corresponding to the M pieces of image data, and determining M pieces of distributed indexes corresponding to the M pieces of image data according to the M pieces of storage attribute information, wherein the distributed indexes are used for representing storage attribute characteristics of the image data, the storage attribute characteristics are information of a storage unit for storing the image data, and the distributed indexes are access delay of the storage unit;

the data deleting module is used for sequencing the M pieces of graph data according to the M pieces of storage period information to generate a graph data sequence table; all or part of M pieces of graph data are deleted according to the graph data sequence table;

the distributed index generation module is further configured to:

acquiring N access delay information corresponding to N storage units;

determining M distributed indexes corresponding to the M graph data according to the mapping relation list;

the data deleting module is further configured to:

9. The device according to claim 8, wherein the graph calculation index generation module is specifically configured to:

10. The apparatus of claim 9, wherein the graph calculation index generation module is further configured to:

11. The apparatus of claim 8, wherein the distributed index generation module is further configured to:

12. A computer device, comprising: memory, transceiver, processor, and bus system;

wherein the memory is used for storing programs;

the processor being configured to execute a program in the memory, including performing the data processing method according to any one of claims 1 to 7;

the bus system is used for connecting the memory and the processor so as to enable the memory and the processor to communicate.

13. A computer readable storage medium comprising instructions which, when run on a computer, cause the computer to perform the data processing method of any of claims 1 to 7.