CN115277858B

CN115277858B - Data processing method and system for big data

Info

Publication number: CN115277858B
Application number: CN202211166164.2A
Authority: CN
Inventors: 陈轮; 黄海峰; 韩国权; 祁纲; 李宝东; 曹扬; 支婷
Original assignee: Taiji Computer Corp Ltd; CETC Big Data Research Institute Co Ltd
Current assignee: Taiji Computer Corp Ltd; CETC Big Data Research Institute Co Ltd
Priority date: 2022-09-23
Filing date: 2022-09-23
Publication date: 2022-12-20
Anticipated expiration: 2042-09-23
Also published as: CN115277858A

Abstract

The invention relates to the field of information processing, and discloses a data processing method and a system for big data, wherein the method comprises the steps of setting a name node, a data node and a user side, wherein the name node is configured as a central management server, storing descriptive metadata in a list form in a memory of the name node, and responding to an access request of the user side to a file to provide internal metadata service; and a static cache queue is arranged at the name node, stores the partitioned access data of the corresponding hot spot file, sets a redirection message at the data node, and can feed back the access and address change of the data to the user side in time, thereby realizing the access rapidity of the user to the data node information.

Description

Data processing method and system for big data

Technical Field

The invention relates to the field of information processing, in particular to a data processing method and system for big data.

Background

In the big data era, data is not a valuable byproduct in social production; in contrast, data has become a renewable, valuable production source. Massive data contains huge information, and by analyzing and mining the data, the existing phenomenon can be described and explained, and the future can be predicted. Big data has deepened into the aspect of life, and more intelligence and convenience are given to our life.

The big data implies a great value for management, decision and regulation reference, but the massive data is affected by the storage architecture when being provided for a user side to be processed in the storage process, and the massive data usually adopts distributed storage and multiple backup, focuses on safe standby of the data, provides user availability and fast user experience, and is also an important factor for popularization of the big data. Therefore, how to provide access control and processing for large data has become an urgent problem to be solved.

Disclosure of Invention

In order to solve one of the above problems, the present invention provides a method and a system for processing big data.

The method comprises the steps that a name node, a data node and a user side are arranged, wherein the name node is configured as a central management server, descriptive metadata are stored in a memory of the name node in a list form, and internal metadata service is provided in response to an access request of the user side for a file;

the data nodes are used for storing data required by a user side, storing the data in a blocking mode, setting the size of each fixed block and performing backup storage; receiving an access request forwarded by a name node, performing the work of creating, deleting and copying a data block under the unified scheduling of the name node, and periodically reporting to the name node;

the data nodes are distinguished and divided into data nodes in the local area and data nodes in the non-local area, and the data nodes are distinguished into hot data nodes and non-hot data nodes according to the hot access degree;

the user side executes data access through the name node; setting a routing table of data nodes at a user side, wherein the routing table is an index table for accessing the name nodes, and a user firstly accesses the name nodes and controls and distributes access requests of the user side by the name nodes;

and a static cache queue is arranged at the name node and stores the block access data of the corresponding hotspot file.

Optionally, the data nodes are divided into hot data nodes and non-hot data nodes, and the name node schedules data of the data nodes; when a user requests the position information of a file, the user cannot know whether the requested file is a hot spot or a non-hot spot, and according to default actions, the user side only sends a request to a data node in charge of the local domain.

Optionally, a redirection message is configured on the data node to notify the change of the routing information of the request user side information. And sending the updated routing index table of the name node to the user side to indicate the address of the file to be downloaded.

Optionally, the user side sends a request to the data node of the local domain if downloading the hot spot file according to the routing index table; if the non-hotspot file is downloaded; preferably, a request is sent to a non-hotspot data node in the non-region.

Optionally, before sending the access request message, the name node determines whether address information of the file to be requested already exists in a local static cache list; and if so, sending a request to a data node of the local domain, sending a data block in the returned cache sequence, and sending the request to the determined address or the data node server.

Optionally, a static cache queue is provided, and storing the block access data of the corresponding hotspot file includes:

the name node opens up a static cache queue in the memory of the name node; acquiring hot data accessed by a user terminal according to the clustering of the current user access request, and updating the hot data accessed by the user in a memory of a name node; and the storage of the hot data is to set a corresponding cache data block according to the comprehensive bandwidth of the user access request and set the access request bandwidth corresponding to the user side in a static cache queue.

Optionally, when the data address of the data node changes, a redirection message is sent to the user side or the name node.

Optionally, when the user wants to access the metadata of the file, the metadata index is mapped to a segment of a hash ring, and then the distributed routing information is searched to find the data node corresponding to the hash ring segment.

Optionally, the name node constructs a file request message and sends the file request message to the corresponding data node, where the request message includes: a file ID to be downloaded; the address of the user; and the amount of user information to be acquired.

The method disclosed by the scheme of the application comprises the steps that a name node, a data node and a user side are arranged, wherein the name node is configured as a central management server, descriptive metadata are stored in a memory of the name node in a list form, and internal metadata service is provided in response to an access request of the user side to a file; the static cache queue is arranged at the name node, the static cache queue stores the block access data of the corresponding hot spot file, the redirection message is arranged at the data node, and the access and address change of the data can be fed back to the user side in time, so that the access rapidity of the user to the data node information is realized.

Drawings

The features and advantages of the present invention will be more clearly understood by reference to the accompanying drawings, which are illustrative and not to be construed as limiting the invention in any way.

FIG. 1 is a schematic flow chart of the application of the method of the present invention.

Detailed Description

These and other features and characteristics of the present invention, as well as the methods of operation and functions of the related elements of structure and the combination of parts and economies of manufacture, will be better understood upon consideration of the following description and the accompanying drawings, which form a part of this specification. It is to be expressly understood, however, that the drawings are for the purpose of illustration and description only and are not intended as a definition of the limits of the invention. It will be understood that the figures are not drawn to scale. Various block diagrams are used in the present invention to illustrate various variations of embodiments according to the present invention.

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without inventive step based on the embodiments of the present invention, are within the scope of protection of the present invention.

It should be noted that "/" herein means "or", for example, A/B may mean A or B; "and/or" herein is merely an association relationship describing an associated object, and means that there may be three relationships, for example, a and/or B, and may mean: a exists alone, A and B exist simultaneously, and B exists alone.

It should be noted that, for the convenience of clearly describing the technical solutions of the embodiments of the present application, in the embodiments of the present application, the terms "first", "second", and the like are used to distinguish the same items or similar items with basically the same functions or actions, and those skilled in the art can understand that the terms "first", "second", and the like do not limit the quantity and execution order. For example, the first information and the second information are for distinguishing different information, not for describing a specific order of information.

It should be noted that, in the embodiments of the present invention, words such as "exemplary" or "for example" are used to indicate examples, illustrations or explanations. Any embodiment or design described as "exemplary" or "such as" in an embodiment of the present invention is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, use of the word "exemplary" or "such as" is intended to present concepts related in a concrete fashion.

Example 1

The system applying the method of the present application as shown in fig. 1 may include a name node, a data node and a user side, wherein the name node is configured as a central management server, stores descriptive metadata in a list form in a memory of the name node, and provides an internal metadata service in response to a request for accessing a file by the user side.

The system sets a master-slave cluster structure based on an access structure of big data, and the cluster consists of a name node, a backup name node, a plurality of data nodes and a plurality of user terminals.

It can be known that the name node is a key component in the system, and the name node is set as a central management server in the system, mainly providing internal metadata service, responsible for managing the namespace of the file system and responding to the access of the client to the file, and storing the descriptive metadata of the system in a list form in a memory, so that the user can access quickly. If the name node fails, the entire file system will not be usable because the information of all the data blocks is stored therein, without which the file cannot be reconstructed. Therefore, backup name nodes are set, the name nodes are backed up at regular time, and normal operation of the cluster is guaranteed through automatic switching. The name node comprises basic information of the file, mapping relation between the file and the data block and storage position of the data block.

The data node is responsible for storing data of a user, divides a local disk into a plurality of blocks or slices for storing the data, defaults to storing three parts by default, namely one part of the local disk, one part of the local disk on other machines in the same rack and one part of other racks, and stores metadata of the blocks in an internal memory. And the data nodes perform the work of creating, deleting and copying data blocks under the unified scheduling of the name nodes and periodically report the data blocks to the name nodes. The user side is an interface for user access, and is responsible for interacting with the cluster and performing operations such as reading, writing and uploading on files.

In the system, data nodes in the big data are distinguished and divided into data nodes in the local area and data nodes in non-local areas. On the basis, the data nodes are divided into hot spots and non-hot spot nodes. And setting a routing table of the data node at the user side, wherein the routing table is an index table for accessing the associated information of the name node. The user firstly accesses a name node, the name node controls and forwards an access request of the user side, a static cache table or a static cache queue is arranged at the selectable name node, and the cache table or the static cache queue stores part of the hot data in blocks according to the access heat of the data.

For example, when performing application identification on character feature data in an industrial plant, segmented key feature data can be stored at a name node, and an empty storage space is divided at the name node. And the access request of the user is scheduled to the data node through the name node index, so that the access efficiency of the user is improved.

Optionally, the data nodes are divided into data nodes of hot spots and data nodes of non-hot spots, the name node schedules data of the corresponding data node, and when a user requests location information of a file, and the user does not know whether the requested file is a hot spot or a non-hot spot, the user still sends a request to the data node in charge of the domain according to a default action. And when the data node fails to feed back the access request, feeding back the access request to the name node through the public communication interface of the node in the region, and forwarding the access request to the data node in the non-region.

Optionally, a function of redirecting messages is added to the data nodes to notify the requesting user end of changes and addresses of access information and changes of index information, and a configured and updated routing index table of a name node is sent to the user end to indicate which files are to be downloaded and to which data nodes requests should be sent. Through the routing index table, if downloading the hot file, the user side sends a request to a hot data node of the local domain and triggers the query and sending of the hot cache information; and if the non-hotspot file is downloaded, preferentially sending a request to the corresponding non-hotspot data node in the region. Therefore, the metadata of the file searched among a plurality of data nodes is avoided, and the response time of the file is greatly shortened.

Before sending data download, before sending message, searching the name node index table of the routing table, and forwarding the message to the corresponding data node by the name node. When the Data address of the Data node is changed, a redirection message, illustratively, a redirection message sent from the Data node, is sent to the user end or the name node, data ID and Data node address information are extracted from the redirection message, and a name node routing index table at the user end is updated.

Optionally, a file index acquisition request is sent to a data server cluster formed by the data nodes and the like to acquire a metadata index returned by the metadata server cluster;

the method comprises the steps that a data server cluster is configured into a full-peer-to-peer distributed storage system based on consistent hashing, when a user side needs to access metadata of a certain file, the metadata index is mapped to a section of a hash ring through the metadata index by adopting a fixed hashing algorithm, then distributed routing information is searched, a metadata server, namely a data node, corresponding to the hash ring section is found, and finally a read-write request for the metadata is sent to the metadata server, namely the data node, for storing the data.

Optionally, when the user side obtains a query request from a user, a file request message is constructed and sent to the corresponding data node, where the file request message includes: the file ID to be downloaded, the address of the user and the number of user information to be acquired.

Before sending the request message, the name node determines whether the address information of the file to be requested is already stored in a local static cache list. And if so, sending a request to a data node of the local domain, sending a data block in the returned cache sequence, and sending the request to the determined address or the data node server.

The method for analyzing the data required to access the file request comprises the following steps: decomposing the file ID into a storage path and a file name; and taking the name node as a storage path index server to obtain the access control attribute of the storage path.

Optionally, when a plurality of name nodes exist, a master-slave mode is set to serve as the name node of the backbone node, and a static cache queue is opened up in the memory of the backbone node; and when the user accesses, updating the hotspot data accessed by the user in the memory of the name node. The hot data, i.e. the first few data blocks of the file which has been downloaded and accessed the highest in the last period of time, are stored in the respective cache blocks of the static queue. All file data are not placed in the cache queue, so that a user can receive cache data of a corresponding file immediately after sending a data access request, and a starting segment of a hotspot file with a certain length is pre-stored in a static cache space of the backbone node.

The cache length of the file is set according to the size and hot degree of the accessed file data, the in-domain setting and the out-domain setting are distinguished, and the data in the static storage space is adjusted.

Specifically, the hot data accessed by the user side is obtained according to the cluster of the current user access request, and the hot data accessed by the user is updated in the memory of the name node; and the hot data is stored by setting a corresponding cache data block according to the comprehensive bandwidth of the user access request and setting the access request bandwidth of the corresponding user side in a static cache queue.

Correspondingly, a dynamic buffer area is also arranged at the data node, and the dynamic buffer area is provided with an idle queue, a write operation queue, a preparation buffer queue and a read buffer queue. It is checked whether there are requested file blocks in the file system. If so, sequentially searching in the static cache queue, the read cache queue, the preparation cache queue and the write operation queue; if the file block is found, the file block is prepared, and the file block requested by the user is downloaded by the user. In the dynamic cache area, a user can only read the file blocks in the read cache queue and the ready cache queue. And the block in the read cache queue moves to the tail of the read queue after being read, and the block in the preparation cache queue leaves from the preparation queue after being read once and is added to the tail of the read queue. When the read buffer queue reaches an upper limit, the buffer blocks are replaced, so that the read queue still keeps a certain size after the new buffer blocks are added into the read queue from the preparation queue. The cache block closest to the head of the queue and not being read or requested is emptied and added to the free queue. Additionally, if there are insufficient cache blocks in the free queue to write, the first block in the ready queue, such as the block that was written the oldest but not hit, is replaced.

Correspondingly, data uploading also exists at the user side, and the user side communicates with the name node to send a file uploading request. The name node checks the metadata to ensure that the file does not exist and that the user side has the creation authority. If not, an error is returned. And the user side physically divides the uploaded file into Block blocks, optionally, the size of each Block is 128M, the blocks are sent to a first corresponding data node (after Response is received, the next Block operation is carried out), and the first data node stores the copy in a pipeline mode. When a file writing request exists at a user side, the file writing request is sent to a name node at the first time, the name node performs information exchange with a data node after receiving a specific request, specifically, the user side sends file size and configuration information to the name node, and the name node sends address information related to the data node managed by the name node back to the user side according to the received information; then, the user side can split the file to be written into a plurality of small data blocks according to the data node address information returned by the name node, and write the small data blocks into the corresponding data nodes in sequence. The user side firstly provides a request for reading the file to the name node, and the name node can quickly return the address information of the data node for storing the file requested by the user side to the user side after receiving the request of the user side, so that the user side can smoothly read the file through the address information of the data node.

Optionally, the file or the fragment of the file is sent to a selected data server cluster for storage; the selection of the replica data nodes takes cluster stability and load balance into consideration. And after all the block copies are stored, the confirmation message is returned to the user side. And the user side informs the name node that the data uploading is finished and closes the connection channel.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Hard Disk Drive (Hard Disk Drive, abbreviated as HDD), or a Solid State Drive (SSD); the storage medium may also comprise a combination of memories of the kind described above.

As used in this application, the terms "component," "module," "system," and the like are intended to refer to a computer-related entity, either hardware, firmware, a combination of hardware and software, or software in execution. For example, a component may be, but is not limited to being: a process running on a processor, an object, an executable, a thread of execution, a program, and/or a computer. By way of illustration, both an application running on a computing device and the computing device can be a component. One or more components can reside within a process and/or thread of execution and a component can be localized on one computer and/or distributed between two or more computers. In addition, these components can execute from various computer readable media having various data structures thereon. The components may communicate by way of local and/or remote processes such as in accordance with a signal having one or more data packets (e.g., data from one component interacting with another component in a local system, distributed system, and/or across a network such as the internet with other systems by way of the signal).

It should be noted that the above-mentioned embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions can be made to the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, which should be covered by the claims of the present invention.

Claims

1. A data processing method of big data is characterized in that the method comprises the steps of setting a name node, a data node and a user side;

the name node is configured as a central management server, stores descriptive metadata in a list form in a memory of the name node, and provides internal metadata service in response to an access request of a user to a file; a static cache queue is arranged at the name node, and the static cache queue stores partitioned access data corresponding to the hotspot file;

the method comprises the following steps that a static cache queue is arranged, and the block access data of the corresponding hotspot file are stored, wherein the step of: the name node opens up a static cache queue in the memory of the name node; acquiring hot spot data accessed by a user side according to the cluster of the current user access request, and updating the hot spot data accessed by the user side in a memory of a name node; storing the hot data by setting a corresponding cache data block according to the comprehensive bandwidth of the user access request; when a plurality of name nodes exist, setting a master-slave mode, and pre-storing a starting segment of a hot spot data file with a preset length in a static cache space in the backbone name node; the cache length of the cache data block is set according to the size and hot degree of the accessed hot spot data file, and the local area setting and the non-local area setting of the data node are distinguished;

the data nodes are used for storing data required by a user side, storing the data in a blocking mode, setting the size of each fixed block and performing backup storage; receiving an access request forwarded by a name node, performing the work of creating, deleting and copying a data block under the unified scheduling of the name node, and periodically reporting to the name node; the data nodes are distinguished and divided into data nodes in the local area and data nodes in non-local areas, and the data nodes are distinguished into hot data nodes and non-hot data nodes according to the hot access degree; the name node performs scheduling on the data of the data node; when a user requests the position information of a file, the user cannot know whether the requested file is a hot spot or a non-hot spot, and according to default action, the user side preferentially sends a request to a data node in charge of the local domain;

the user side executes data access through the name node; setting a routing table of data nodes at a user side, wherein the routing table is an index table for accessing the name nodes; configuring a redirection message on a data node for informing a request user side that routing information is changed, and sending a routing index table of an updated name node to the user side to indicate a file address to be downloaded;

the user accesses the name node first, and the name node controls and distributes the access request of the user side.

2. The method of claim 1, wherein: the user side sends a request to the data node of the local domain if downloading the hot spot file according to the routing index table; if the non-hotspot file is downloaded; preferably, a request is sent to a non-hotspot data node in the non-region.

3. The method of claim 2, wherein: before sending an access request message, a name node determines whether address information of a requested file already exists in a local static cache queue; if so, sending a request to the data node in the region, sending a data block in the returned static cache queue, and sending a request to the determined address or the data node server.

4. The method of claim 3, wherein: and when the data address of the data node is changed, sending a redirection message to the user side or the name node.

5. The method of claim 4, wherein: when a user side needs to access the metadata of the file, the metadata index is mapped to a section of interval of a hash ring, and then distributed routing information is searched for, and the data node corresponding to the hash ring interval is found.

6. The method of claim 5, wherein: the name node constructs a file request message and sends the file request message to a corresponding data node, wherein the request message comprises: the file ID to be downloaded, the address of the user, and the amount of user information to be acquired.

7. A big data processing system, said system comprising name nodes, data nodes and user terminals, for implementing the method of any of claims 1-6.