CN116595226A

CN116595226A - Distributed storage method and system for graphic data based on judicial industry

Info

Publication number: CN116595226A
Application number: CN202310459829.7A
Authority: CN
Inventors: 奚陨; 王晓艳; 裘亮; 李朋; 史毅仁; 傅丽娟; 张阳; 李敏杰; 江颢; 江衎
Original assignee: Sichuang Electronics Co ltd
Current assignee: Sichuang Electronics Co ltd
Priority date: 2023-04-26
Filing date: 2023-04-26
Publication date: 2023-08-15

Abstract

The invention discloses a distributed storage method and a system for graphic data based on judicial industry, comprising the following steps: configuring a computing node and a data node server, constructing a parallel computing cluster environment, and installing a data access middleware matched with the version of the parallel computing cluster environment; storing the original files of the pictures and the document data to be stored in a data node server, and processing the original files of the pictures and the document data into the pictures and the document data finally required by a user through a picture and document processor; establishing a data structure according to the characteristics of the picture and document data processed by the picture and document processor, and selecting at least two characteristics from the characteristics of the data structure as main keys; through the distributed data storage structure, the validity check and business adjustment are carried out on the data before warehousing, and the main key is established on the original picture or document, so that the inquiry time and the inquiry efficiency are ensured.

Description

Distributed storage method and system for graphic data based on judicial industry

Technical Field

The invention relates to the technical field of storage of picture and document data, in particular to a distributed storage method and system of graphic data based on judicial industry.

Background

With the massive use of pictures and documents in various websites, the system needs more and more unstructured data to be edited and presented, and the unstructured data is larger and larger. Currently, about 2000 pieces of picture and document data are newly added every day in a judicial hall website, about 300 ten thousand pieces of history picture and document data exist, and efficient storage and reading of unstructured picture and document data become bottlenecks of a system.

Although the existing graphic data storage system in judicial industry can record a certain amount of picture and document information, when the concurrence of inquiring and processing the picture and the document reaches a certain order of magnitude, the inquiring and writing efficiency cannot be ensured, and the real-time performance of the response is seriously affected. In addition, traditional databases present a similar bottleneck for managing large amounts of unstructured data storage.

Disclosure of Invention

The invention aims to provide a distributed storage method and a system for graphic data based on judicial industry, which solve the following technical problems:

the existing graphic data storage system in judicial industry cannot guarantee the query and writing efficiency, and seriously influences the real-time performance of response.

The aim of the invention can be achieved by the following technical scheme:

a distributed storage method and system for graphic data based on judicial industry includes the following steps:

s1: configuring a computing node and a data node server, constructing a parallel computing cluster environment, and installing a data access middleware matched with the version of the parallel computing cluster environment;

s2: storing the original files of the pictures and the document data to be stored in a data node server, and processing the original files of the pictures and the document data into the pictures and the document data finally required by a user through a picture and document processor;

s3: establishing a data structure according to the characteristics of the picture and document data processed by the picture and document processor, and selecting at least two characteristics from the characteristics of the data structure as primary keys to form picture and document information containing the primary keys;

s4: splitting the picture and document information into a plurality of key values and storing the key values into different nodes in the fragments;

s5: the user terminal inquires the required picture and document data by a field containing a key value.

As a further scheme of the invention: the method is characterized in that: the step S3 comprises the following steps:

establishing a data structure according to the characteristics of the picture or the document;

collecting a file type identifier, a processing user, a data source, an original file path, an original file name, a processing time and a client processing parameter in the characteristics of a data structure, and storing the file type identifier, the processing user, the data source, the original file path, the original file name, the processing time and the client processing parameter as data information fields;

and (3) sorting the data information fields needing persistence, and simultaneously selecting a field capable of uniquely identifying one record as a main key, wherein the main key at least comprises two characteristics in the characteristics of a data structure to form a picture and document information containing the main key.

As a further scheme of the invention: in the step S4, splitting the picture and document information into a plurality of key values and storing the key values in different nodes in the shard includes:

storing a plurality of key values to different nodes in the shard;

automatically updating key value data along with the insertion of new picture or document data;

the data is stored in the distributed physical memory through the data structure of the hash tree;

the data structure of the hash tree is as follows: file type identification, user identification, data source, original file path, original file name and processing time, and client processing parameters.

As a further scheme of the invention: the storing the plurality of key values to different nodes in the shard includes: and calculating a CRC16 value corresponding to each key value through a hash slot algorithm of Redis Cluste, taking a module for the hash slots, finding a main node corresponding to the hash slots according to the hash slots corresponding to the key values, and storing the key values into the corresponding main nodes.

As a further scheme of the invention: in the step S5, the user terminal queries the required picture and document data with the field containing the key value, including:

the user terminal inputs a user name and a time period for processing a picture or a document, reads the size of the cached picture or the document, and determines that the document has a plurality of fragments;

and reading each data fragment, if the fragment data fails to be read, discarding the data in the cache, and reading the data on the lower distributed file system.

As a further scheme of the invention: reading data on an underlying distributed file system, comprising the steps of:

establishing a distributed hash key for the main key and the common query field;

establishing a distributed hash key and designing the hash key;

accessing a picture or document information data source to be stored;

querying with a field containing a hash key;

and merging all the fragment readings, and deleting the out-of-date picture or the document data.

As a further scheme of the invention: the step of merging all the fragment readings and deleting the out-of-date picture or the document data comprises the following steps:

establishing an inverted index in a unit of a day;

extracting a file name, a path name and a processing time stamp from each indexed file, and creating a set in the references by taking a day as a unit for a picture or a document containing the file processing time stamp, wherein the path and the file name of the picture are contained in the set, so that the path and the file name of the file are represented by the day of the processing time stamp of the picture or the document file;

respectively establishing a set for the file source and the file size, wherein the intersection of the set is the result of the queried data;

and obtaining the last access time of the current picture by utilizing data analysis, and starting a deleting process if the last access time of the current time from the picture or the document file is larger than a specified threshold value.

A distributed storage system for judicial industry-based graphic data, comprising: the picture and document unstructured file processing big data clusters are respectively connected with a user terminal, a server and a switch, and a data source of the user terminal is accessed to the system through the switch, wherein:

the picture and document unstructured files process big data clusters and provide big data platform computing service and data storage service;

the user terminal provides an interface for a user to access the whole system;

the server is used for receiving a query request sent by the user terminal, configuring a master node server and a save node server of a large data Cluster for processing pictures and documents, constructing a Redis Cluster original Cluster environment, establishing a data structure according to the characteristics of the pictures and the documents to be processed, and selecting at least two characteristics in the data structure as primary keys to form picture or document information containing the primary keys; splitting the picture or document information into a plurality of key values and storing the key values on different nodes in the fragments;

and the switch is used for carrying out data exchange and local area network establishment between different servers and the user terminal in the local area network.

As a further scheme of the invention: the server deploys a Redis Cluster, the Redis Cluster comprising: the system comprises 3 master nodes, 3 slave nodes, 2 computing nodes, 1 backup computing node and N data nodes, wherein the master nodes, the slave nodes, the computing nodes, the backup computing nodes and the data nodes are connected through hundred meganetwork ports; wherein, the liquid crystal display device comprises a liquid crystal display device,

the Master node is used for storing the picture and document information data and the statistical analysis information data which are finally needed by the user in the massive pictures and documents;

the slave node is used for hot standby of data, and the master node is switched to the standby when the master node fails;

the computing node is used for processing the operation of the picture and the document information;

a backup computing node for maintaining high availability of the cluster;

and the data node is used for storing massive pictures and document information data.

As a further scheme of the invention: the computing node includes: the picture and document processor is used for processing the original files of the picture and document data into picture and document data finally required by a user;

the processed picture and document data which are finally required by the user are stored by adopting a Ceph file system; the picture and document data finally needed by the user adopt a multi-level cache scheme, which are respectively a client cache based on a browser, a network layer cache based on CDN acceleration, a routing layer cache based on an Nginx load balancing component and a service layer cache based on a Redis.

The invention has the beneficial effects that:

the invention establishes a data structure with the characteristics of the processed picture and document data through a distributed data storage structure, and establishes a main key according to the characteristics of the data structure to form picture and document information containing the main key; splitting the picture and document information into a plurality of key values, storing the key values in different nodes in the fragments, accessing unstructured information data such as the picture and the document, and carrying out validity check and business adjustment on the data before warehousing so as to ensure that the data can be analyzed. And meanwhile, a primary key is established for the original picture or document, so that the inquiry time and the inquiry efficiency are ensured, and the high availability is ensured. The system not only meets the requirement of the system for storing massive image and document information data in daily business, but also realizes the functions of quick inquiry and editing through the data structure design of the processed image and document information finally required by the user, thereby greatly improving the inquiry rate and enhancing the user experience. The method can be applied to the system record picture and document information related data in the judicial industry, can be applied to other similar service scenes with high concurrency requirements, large data volume and relatively less complicated service processing, and is used for reading unstructured files frequently.

Drawings

The invention is further described below with reference to the accompanying drawings.

FIG. 1 is a system network topology of the present invention;

FIG. 2 is a hierarchical framework of the software technology of the present invention;

fig. 3 is a flow chart of the method of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1

Referring to fig. 1-3, the invention discloses a distributed storage method and a system for graphic data based on judicial industry, comprising the following steps:

s4: splitting the picture and document information containing the primary key into a plurality of key values and storing the key values into different nodes in the shards;

Specifically, a data structure is built according to the characteristics of the processed picture and document data through a distributed data storage structure, and a primary key is built according to the characteristics of the data structure to form picture and document information containing the primary key; splitting the picture and document information into a plurality of key values, storing the key values in different nodes in the fragments, accessing unstructured information data such as the picture and the document, and carrying out validity check and business adjustment on the data before warehousing so as to ensure that the data can be analyzed. And meanwhile, a primary key is established for the original picture or document, so that the inquiry time and the inquiry efficiency are ensured, and the high availability is ensured. The system not only meets the requirement of the system for storing massive image and document information data in daily business, but also realizes the functions of quick inquiry and editing through the data structure design of the processed image and document information finally required by the user, thereby greatly improving the inquiry rate and enhancing the user experience. The method can be applied to the system record picture and document information related data in the judicial industry, can be applied to other similar service scenes with high concurrency requirements, large data volume and relatively less complicated service processing, and is used for reading unstructured files frequently.

In one embodiment of the invention, it is characterized in that: the step S3 comprises the following steps:

Specifically, a data structure is established according to the characteristics of pictures or documents according to service requirements, and the acquired fields comprise storage fields such as file type identification, processing users, data sources, original file paths, original file names, processing time, client processing parameters and the like; according to specific service requirements, all data information fields needing persistence are arranged, a field capable of uniquely identifying a record is selected as a primary key, four characteristics of a file type, an original file path, an original file name and processing time are used as primary keys, a piece of data information capable of being queried by a user is formed by the file type identification, a processing user, a data source, the original file path, the original file name, the file size, the processing time and client processing parameters, and the data information format is as follows: file type identification, data source, original file path, original file name, file size and processing time (UNIX timestamp) and client processing parameters, such as png+Zhejianghua+/home/data 1/putfile/png+image.png+83kb+ "1559928176" (UNIX timestamp) +2; after the data structure is built, the key values are stored on different nodes in the fragments, the key value data is automatically updated along with the insertion of new picture and document data, the data is stored in the distributed memory database through the data structure of the hash tree, and the data structure of the hash tree is as follows: file type identification, processing user, data source, original file path, original file name, file size, processing time, and client processing parameters.

In one embodiment of the present invention, in the step S4, splitting the picture and the document information into a plurality of key values and storing the key values in different nodes in the shard includes:

storing a plurality of key values to different nodes in the shard;

In one embodiment of the present invention, the storing the plurality of key values to different nodes in the shard includes: and calculating a CRC16 value corresponding to each key value through a hash slot algorithm of Redis Cluste, taking a module for the hash slots, finding a main node corresponding to the hash slots according to the hash slots corresponding to the key values, and storing the key values into the corresponding main nodes.

Specifically, a CRC16 value corresponding to a key is calculated through a hash slot algorithm of Redis Cluste, a module is taken for hashslothash slot, hashslothash slot corresponding to the key is found, and a master node corresponding to hashslothash slot is found.

And calculating a CRC16 value corresponding to the key value through a hash slot algorithm of Redis Cluste, taking a module for the hash slot, finding a hash slot corresponding to the key value, and finding a main node corresponding to the hash slot.

For example, 3 segments of a picture or document stored in a Redis Cluster

/path/image1.png|l

/path/image1.png|2

/path/image1.pngl3

The following are 3 segment calculations stored by Redis Cluster: 3158. 10507, 14701;

as a result of the above calculation, when the number of nodes of the master in the dis Clusters is 3, 3 segments of the picture are stored in two master nodes (6451 slots and 10578 slots on one node and 14703 on the other node). In this example of a picture service system, we will fix 8KB as one file fragment, that is, the 40KB picture file data exemplified above will have 5 file fragments.

In one embodiment of the present invention, the step S5, where the user terminal queries the required picture and document data with the field containing the key value, includes:

Specifically, when the cached file is read, the client first goes to the dis Cluster to read the size of the cached picture, so as to redetermine that the file has several slices, and then goes to the dis Cluster to read each data slice. Note that although the same expiration time is set for each picture data slice, some slice data read failures may occur due to the different actual operating states of each node, so that the entire read process is considered to fail if any one slice read fails at this time. If this occurs, the picture system discards the cached data and reads the data onto the underlying distributed file system.

In one embodiment of the present invention, reading data on an underlying distributed file system includes the steps of:

establishing a distributed hash key and designing the hash key;

accessing a picture or document information data source to be stored;

querying with a field containing a hash key;

Specifically, a distributed hash key is established for a main key and a common query field, such as a file type, and then a distributed hash key is established for a file name, and the key is designed; then, accessing a picture or document information data source to be stored; finally, the user queries with the field containing the hash key and the system returns the corresponding data within 50ms, such as. The method can rapidly count the total amount of the daily picture or document data, the statistical data of the service system is accumulated in the statistical table by utilizing the atomic self-increment operation in the Redis without inquiring the service information table and the service information table, so that the statistical efficiency and the real-time performance are greatly accelerated, the user experience is improved, the overall statistical display of the system is facilitated, the database load is reduced, and the daily hot picture or document ranking list can be rapidly established.

In one embodiment of the present invention, the merging of all slice readings to delete out-of-date pictures or document data includes the following steps:

establishing an inverted index in a unit of a day;

Specifically, since the matching efficiency of the distributed stored data to the multi-condition query of the picture and the document is low, the reverse index is established in the unit of day for the time stamp, for example:

picture A, file name [ image1.Png ], path [ hdfs1/home/data1/png ], processing timestamp [ 2019, 6, 10, 23 minutes, 31 seconds ]

Picture B, filename [ image2.Png ], path [ hdfs2/home/data1/png ], processing timestamp [ 2019, 6, 10, 23, 56 minutes, 38 seconds ]

Then, the reverse index extracts the file name, the path name and the processing time stamp from each indexed file, and for the picture a and the picture B containing the file processing time stamp, creates a set in the references by taking a day as a unit, and the path and the file names of the two pictures of the picture a and the picture B are contained in the set, so as to represent that the path and the file names of the two files are contained in the day of the processing time stamp of the two files of the picture a and the picture B. For matching of processing time stamps, an inverted index can be established for the day, the inverted index can extract the processing time stamp from each indexed picture and document, a set is created for the picture and the document in the Reids by taking the day as a unit, and the path and the file name of the two pictures of the picture A and the picture B are contained in the set.

Sets can be established for the file source and file size, respectively, with the set intersection being the result of the queried data.

Example two

Referring to fig. 1 and 2, a distributed storage system for graphic data based on judicial industries is disclosed, which is characterized by comprising: the picture and document unstructured file processing big data clusters are respectively connected with a user terminal, a server and a switch, and a data source of the user terminal is accessed to the system through the switch, wherein:

the user terminal provides an interface for a user to access the whole system;

In one embodiment of the present invention, the server deploys a dis Cluster, the dis Cluster comprising: the system comprises 3 master nodes, 3 slave nodes, 2 computing nodes, 1 backup computing node and N data nodes, wherein the master nodes, the slave nodes, the computing nodes, the backup computing nodes and the data nodes are connected through hundred meganetwork ports; wherein, the liquid crystal display device comprises a liquid crystal display device,

a backup computing node for maintaining high availability of the cluster;

In one embodiment of the present invention, the computing node includes: the picture and document processor is used for processing the original files of the picture and document data into picture and document data finally required by a user;

In the description of the present invention, it should be understood that the terms "upper," "lower," "left," "right," and the like indicate an orientation or a positional relationship based on that shown in the drawings, and are merely for convenience of description and for simplifying the description, and do not indicate or imply that the apparatus or element in question must have a specific orientation, as well as a specific orientation configuration and operation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the present invention, unless otherwise indicated, the meaning of "a plurality" is two or more.

In the description of the present invention, it should be noted that, unless explicitly specified and limited otherwise, the terms "mounted," "connected," and the like are to be construed broadly and may be, for example, fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.

The foregoing describes one embodiment of the present invention in detail, but the description is only a preferred embodiment of the present invention and should not be construed as limiting the scope of the invention. All equivalent changes and modifications within the scope of the present invention are intended to be covered by the present invention.

Claims

1. The distributed storage method of the graphic data based on the judicial industry is characterized by comprising the following steps of:

s2: storing original files of the pictures and the document data to be stored in a data node server, and processing the original files of the pictures and the document data into the pictures and the document data finally required by a user through a picture and document processor;

2. A distributed storage method of graphic data based on judicial industries according to claim 1, characterized in that: the step S3 comprises the following steps:

and finishing the data information fields needing persistence, and simultaneously selecting a field capable of uniquely identifying one record as a main key, wherein the main key at least comprises two characteristics in the characteristics of a data structure to form the picture and document information containing the main key.

3. A distributed storage method of graphic data based on judicial industries according to claim 1, characterized in that: in the step S4, splitting the picture and document information including the primary key into a plurality of key values and storing the key values in different nodes in the shard, including:

storing a plurality of key values to different nodes in the shard;

4. A distributed storage method of graphic data based on judicial industries according to claim 3, characterized in that: the storing the plurality of key values to different nodes in the shard includes: and calculating a CRC16 value corresponding to each key value through a hash slot algorithm of Redis Cluste, taking a module for the hash slots, finding a main node corresponding to the hash slots according to the hash slots corresponding to the key values, and storing the key values into the corresponding main nodes.

5. A distributed storage method of graphic data based on judicial industries according to claim 1, characterized in that: in the step S5, the user terminal queries the required picture and document data with the field containing the key value, including:

6. The distributed storage method of graphic data based on judicial industries according to claim 5, wherein the method comprises the following steps: reading data on an underlying distributed file system, comprising the steps of:

establishing a distributed hash key and designing the hash key;

accessing a picture or document information data source to be stored;

querying with a field containing a hash key;

7. The distributed storage method of graphic data based on judicial industries according to claim 6, wherein: the step of merging all the fragment readings and deleting the out-of-date picture or the document data comprises the following steps:

establishing an inverted index in a unit of a day;

8. A distributed storage system for judicial industry-based graphic data, comprising: the picture and document unstructured file processing big data clusters are respectively connected with a user terminal, a server and a switch, and a data source of the user terminal is accessed to the system through the switch, wherein:

the user terminal provides an interface for a user to access the whole system;

9. A distributed storage system for judicial industry based teletext data according to claim 8, wherein: the server deploys a Redis Cluster, the Redis Cluster comprising: the system comprises 3 master nodes, 3 slave nodes, 2 computing nodes, 1 backup computing node and N data nodes, wherein the master nodes, the slave nodes, the computing nodes, the backup computing nodes and the data nodes are connected through hundred meganetwork ports; wherein, the liquid crystal display device comprises a liquid crystal display device,

the Master node is used for storing the picture and document data and the statistical analysis information data which are finally needed by the user in the massive pictures and documents;

a backup computing node for maintaining high availability of the cluster;

and the data node is used for storing massive pictures and document data.

10. A distributed storage system for judicial industry based teletext data according to claim 9, wherein: the computing node includes: the picture and document processor is used for processing the original files of the picture and document data into picture and document data finally required by a user;