CN111078635B

CN111078635B - Data processing method based on Hadoop

Info

Publication number: CN111078635B
Application number: CN201911253880.2A
Authority: CN
Inventors: 林森; 唐宁; 马娜
Original assignee: Tianjin Kuaiyou Century Technology Co Ltd
Current assignee: Tianjin kuaiyou Century Technology Co., Ltd
Priority date: 2019-12-09
Filing date: 2019-12-09
Publication date: 2021-03-19
Anticipated expiration: 2039-12-09
Also published as: CN111078635A

Abstract

The invention discloses a data acquisition priority judging and processing method based on Hadoop, which utilizes a Hadoop distributed file system to process mass data, wherein the Hadoop distributed file system comprises a user node, a naming node, a scanning module and an offspring node, the user node comprises a JAVA virtual machine, the JAVA virtual machine comprises a Hadoop user side, and the Hadoop user side is respectively interacted with a distributed file subsystem and a data output stream; the named node stores calling frequency detection information, the calling frequency detection information comprises the frequency times of each calling child node, the named node divides the priority of the child nodes in sequence according to the frequency times of the child nodes, and the higher the frequency times of the child nodes is, the higher the priority is.

Description

Data processing method based on Hadoop

Technical Field

The invention relates to the technical field of data processing, in particular to a data processing method based on Hadoop.

Background

Hadoop is a distributed system infrastructure developed by the Apache Foundation. Hadoop realizes a distributed file system, has the characteristic of high fault tolerance, and is designed to be deployed on low-cost (low-cost) hardware; and it provides high throughput (high throughput) to access data of applications, suitable for applications with very large data sets. The Hadoop distributed file system may access data in a streaming access file system in a streaming format.

Hadoop was formally introduced by Apache Software Foundation, Inc. in autumn 2005 as part of Lucene's sub-project Nutch. It was inspired by the first Map/Reduce and Google File System (GFS) developed by Google Lab.

Hadoop originally came from Google, a programming model package named MapReduce. The MapReduce framework of google can break an application into many parallel computing instructions, running a very large data set across a large number of computing nodes. A typical example of the use of this framework is a search algorithm running on network data. Hadoop is initially related to web page indexes only and rapidly develops into a leading platform for analyzing big data.

Hadoop has gained widespread use in big data processing applications thanks to its own natural advantages in data extraction, transformation and loading (ETL). The distributed architecture of Hadoop, which places the big data processing engine as close to the store as possible, is relatively suitable for batch operations such as ETL, because batch results like such operations can go directly to the store. The MapReduce function of Hadoop realizes the purposes of breaking up a single task, sending a broken task (Map) to a plurality of nodes, and then loading (Reduce) the broken task into a data warehouse in the form of a single data set.

The original objective of Hadoop design is to achieve high reliability, high expansibility, high fault tolerance and high efficiency, and the unique advantages of the design make Hadoop popular with companies as soon as the Hadoop appears, and also attract general attention of the research community. To date, Hadoop technology has been widely used in the internet field, for example, Yahoo uses a Hadoop cluster of 4000 nodes to support the research of advertisement systems and Web search; the Facebook runs Hadoop by using a cluster of 1000 nodes, stores log data and supports data analysis and machine learning on the log data; processing 200TB data every week by Hadoop to carry out search log analysis and webpage data mining work; the China Mobile research institute develops a 'Big Cloud' (Big Cloud) system based on Hadoop, and the system is not only used for analyzing related data, but also provides services to the outside; the Taobao Hadoop system is used for storing and processing relevant data of e-commerce transactions. The research of data storage, resource management, job scheduling, performance optimization, high system availability and safety is carried out in colleges and universities and scientific research institutes in China based on Hadoop, and related research results mostly contribute to Hadoop communities in an open source form.

The files of the prior art Hadoop distributed file system are write-once and have only one writer at any time. That is, files of the Hadoop distributed file system support write-once read-many, which means that once information is written, it cannot be modified, but can be read many times.

Disclosure of Invention

In order to overcome the problems in the prior art, the invention provides a data processing method based on Hadoop, which utilizes a Hadoop distributed file system to process mass data, wherein the Hadoop distributed file system comprises a user node, a naming node, a scanning module and an subnode, the user node comprises a JAVA virtual machine, the JAVA virtual machine comprises a Hadoop user side, and the Hadoop user side interacts with a distributed file subsystem and a data output stream respectively; the named node stores calling frequency detection information, the calling frequency detection information comprises the frequency times of each calling sub-node, the named node divides the priority of each sub-node in sequence according to the frequency times of each sub-node in the sub-nodes, and the higher the frequency times of each sub-node is, the higher the priority is;

the processing and analyzing method further comprises a priority domain, wherein after the named node acquires calling information from a user side and generates priority dividing information of the named node, the named node renames each child node; the priority domain part sub-nodes are divided into the priority blocks, meanwhile, the priority domain sends information to the named nodes, and the named nodes name the node information in the priority domain as priority nodes; the scanning module preferentially scans the priority nodes in the priority domain.

Further, when the child node is identified as a priority of class two or class three, the priority domain divides the child node into the priority blocks, and simultaneously, the priority domain sends information to the named node, and the named node names node information outside the priority domain as a priority node; the scanning module preferentially scans the priority nodes outside the priority domain.

Further, the priorities include a first-type priority, a second-type priority and a third-type priority, and the determination method of the priorities includes: the calling frequency of all the child nodes in the preset time is P, when the called times of the child nodes in the preset time exceed P/2, the child nodes are determined to have priority of one class, when the called times of the child nodes in the preset time exceed P/4, the child nodes are determined to have priority of two classes, and other child nodes are determined to have priority of three classes.

Further, when the calling frequency of the child node in the predetermined time is greater than the sum of the calling frequencies of any four other child nodes, the byte point is determined to have a class priority, and when the calling frequency of the child node in the predetermined time is greater than the calling frequencies of any two other child nodes, the byte point is determined to have a class two priority.

Further, the named node sends the generated priority judgment type information to the scanning module after receiving the additional command, and the scanning module starts to scan each scanning node after receiving the priority judgment type; the scanning module still starts to scan each child node one by one after the priority judging type is received for the first time so as to mark and classify the child nodes which are identified as the child nodes with the first-class priority, the second-class priority and the third-class priority; after the scanning module carries out primary scanning, the named node generates a priority block rule, wherein the priority block rule comprises a first priority scanning block which marks all the priority child nodes; a second priority scanning block for marking all the second type priority sub-nodes and a third priority scanning block for marking all the third type priority sub-nodes; and in the next scanning program of the scanning module, scanning the first priority scanning block, the second priority scanning block and the third priority scanning block in sequence, scanning for a required file, obtaining the block position of the file block, and then filling the metadata according to the scanning content.

Further, the named node responds to the Hadoop client with a locating block data structure that includes all data node identifiers that attach copies of the data blocks to the existing file, and the Hadoop client directly requests the identified data nodes to attach the data blocks to the existing file by sending a portion of the extended block data structure that includes the IDs of the data blocks and the data to the identified data nodes.

Further, the data node receiving the extension block data structure accesses the corresponding block of the existing file and the data in the received extension block data structure using the ID of the data block in the received extension block data structure to write the data to the accessed block.

Further, the file placement optimization module is configured to adjust an amount of data that can be stored on a single data node or a single server.

Further, the block scanning module scans data nodes to find a needed file, obtains block positions of file blocks, and then fills metadata according to the scanning content, so that the metadata reflects the positions of the blocks and the number of copies, and the block scanning module returns the positions of consecutive block files from a single data node to provide the Hadoop user end with the illusion that the block files are placed.

Further, the block scanning module obtains a name of the data node and an address of the data block, creates a block ID of the data block and stores the block ID in the metadata, and the naming node updates the block list and a location of each copy of each block using information received from the block scanning module.

Compared with the prior art, the invention has the following advantages: compared with the prior art, the data processing method based on the Hadoop adopts the Hadoop distributed file system to process mass data, and sends an additional command to the named node through the Hadoop user side, wherein the additional command has a name for identifying the existing file to be added and a parameter for identifying the data to be added. The invention uses the additional file to write in, and overcomes the problem that once the information is written in, the Hadoop distributed file system in the prior art can not be modified.

Furthermore, the calling frequency detection information is stored in the named node, the calling frequency detection information comprises the frequency times of each calling child node, the named node divides the priority of each child node in sequence according to the frequency times of each child node in the child nodes, and the higher the frequency times of each child node is, the higher the priority is, so that the calling efficiency of each child node is improved.

Drawings

Fig. 1 is a schematic overall structure diagram of an embodiment of a server according to the present invention.

Detailed Description

The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.

Referring to fig. 1, the present invention provides a data processing method based on Hadoop, which uses a Hadoop distributed file system to process mass data, where the Hadoop distributed file system includes a user node 261, a naming node 301, a block scanning module 560, and subnodes (the embodiment shown in fig. 1 includes a plurality of subnodes, i.e., a first subnode 311, a second subnode 312 … …, an nth subnode 31n), the user node 261 includes a JAVA virtual machine 401, the JAVA virtual machine 401 includes a Hadoop user end 221, and the Hadoop user end 221 interacts with a distributed file subsystem 402 and a data output stream 403 respectively; the child nodes include data nodes (the first child node 311 includes a first data node 341, the second child node 312 includes a second data node 342 … …, the nth child node 31n includes an nth data node 34 n); when the user terminal 221 calls the child nodes, the naming node 301 stores calling frequency detection information, the calling frequency detection information includes the frequency times of each calling child node, the naming node divides the priority of each child node in sequence according to the frequency times of each child node in the child nodes, and the higher the frequency times of each child node is, the higher the priority is.

Specifically, the priority blocks are used to adjust the amount of data that can be stored on a single data node or a single server, and the block scanning module 560 scans the data nodes 341-34n for the required files, obtains the block locations of the file blocks, and then populates the metadata 530 according to the scanned content, so that the metadata reflects the locations of the blocks and the number of copies; block scan module 560 obtains the name of the data node and the address of the data block, creates a block ID for the data block and stores the block ID in metadata 530. The chunk scan module 560 returns the location of the contiguous chunk file from a single data node to provide the Hadoop client 221 with the illusion that the chunk file is placed. The named node 301 uses the information received from the block scanning module 560 to update the block list and the location of each copy of each block.

Specifically, the priorities include a first-type priority, a second-type priority and a third-type priority, and the setting of the priorities is used for shortening the calling program; in some embodiments of the present invention, the priority determination method includes: the calling frequency of all the child nodes in the preset time is P, if the called times of some child nodes in the preset time exceed P/2, the child node is determined to have priority of one class, if the called times of some child nodes in the preset time exceed P/4, the child node is determined to have priority of two classes, and other rest child nodes are determined to have priority of three classes.

In other embodiments of the present invention, if there are one or more child nodes that are called more frequently within a predetermined time than the sum of the frequencies called by any four other child nodes, then the word nodes are considered to have one type of priority, and if the frequencies called by the child nodes within the predetermined time are more frequently than the frequencies called by any two other child nodes, then the word nodes are considered to have two types of priority.

Specifically, the named node may accept additional commands issued by additional commands sent by the user side, the additional commands being used to determine the setting of P or the type of prioritization.

Specifically, the named node sends the generated priority determination type information to the scanning module 560 after receiving the attach command, and the scanning module 560 starts scanning each scanning node after receiving the priority determination type.

In some embodiments of the present invention, the scanning module 560 still starts to scan each child node one by one after receiving the priority determination type for the first time, so as to mark and classify the child nodes identified as class-one priority, class-two priority and class-three priority; meanwhile, the scanning module 560 searches for a required file, obtains a block position of a file block, and then fills the metadata 530 according to the scanned content, so that the metadata reflects the position of the block and the number of copies; after the scanning module 560 performs the initial scanning, a priority block rule is generated, wherein the priority block rule comprises a first priority scanning block which marks all the priority child nodes of one type; a second priority scanning block for marking all the second type priority sub-nodes and a third priority scanning block for marking all the third type priority sub-nodes; the scanning module 560 scans the first priority scanning block, the second priority scanning block and the third priority scanning block in sequence in the scanning procedure except the first scanning procedure, scans and searches for a required file, obtains a block position of the file block, and then fills the metadata 530 according to the scanning content.

In other embodiments of the present invention, the system further includes a priority domain 302, and after the naming node 301 obtains the calling information from the user side and generates the priority division information of the naming node, the naming node 301 renames each child node; for example, the first sub-node 311 and the second sub-node 312 are identified as a type of priority, the priority domain 302 divides the first sub-node 311 and the second sub-node 312 into the priority block, meanwhile, the priority domain 302 sends information to the named node 301, and the named node 301 names the node information in the priority domain 302 as the priority node 311 and the priority node 312; the scanning module 560 preferentially scans the priority nodes in the priority domain 302, and after the priority nodes are scanned, the scanning module 560 continues to scan the remaining child nodes; for another example: the first child node 311 and the second child node 312 are identified as a second-class or third-class priority, the priority domain 302 divides the first child node 311 and the second child node 312 into the priority block, meanwhile, the priority domain 302 sends information to the named node 301, and the named node 301 names the external node information in the priority domain 302 as a priority node; the scanning module 560 preferentially scans the priority node, and after the priority node is scanned, the scanning module 560 continues to scan the remaining child nodes.

Specifically, the user side 221 sends an open command with a parameter to the naming node 301, the parameter identifying the name of the file to be read; the named node 301 responds to the Hadoop user side with a locating block data structure that includes identifiers of all named nodes of the stored file and block IDs of all blocks in the file; the Hadoop user side directly requests the blocks of the file from the identified data nodes by sending a block ID containing the requested block for each requested block; the data node receiving the request uses the block ID of the requested block to access one of the corresponding blocks it is storing, and uses the data of the accessed block to respond to the Hadoop user side; the Hadoop user side indicates to the named node that the data block is to be attached to the existing file; the naming node receives an attach command sent by the Hadoop client with parameters identifying the name of the existing file to be attached and the data to be appended.

Specifically, the named node 301 responds to the Hadoop client 221 with a locating block data structure that includes all data node identifiers that attach copies of the data block to the existing file, and the Hadoop client 221 directly requests the identified data node to attach the data block to the existing file by sending a portion of the extended block data structure that includes the ID of the data block and the data to the identified data node.

Specifically, a data node receiving the extension block data structure accesses a corresponding block of an existing file and data in the received extension block data structure using the ID of the data block in the received extension block data structure to write the data to the accessed block.

Although the invention has been described in detail above with reference to a general description and specific examples, it will be apparent to one skilled in the art that modifications or improvements may be made thereto based on the invention. Accordingly, such modifications and improvements are intended to be within the scope of the invention as claimed.

Claims

1. A data processing method based on Hadoop is characterized in that a Hadoop distributed file system is used for processing mass data, the Hadoop distributed file system comprises user nodes, named nodes, scanning modules and subnodes, the user nodes comprise JAVA virtual machines, the JAVA virtual machines comprise Hadoop user sides, and the Hadoop user sides interact with a distributed file subsystem and a data output stream respectively; the named node stores calling frequency detection information, the calling frequency detection information comprises the frequency times of each calling child node, the named node divides the priority of the child nodes in sequence according to the frequency times of the child nodes, and the higher the frequency times of the child nodes are, the higher the priority is;

the processing and analyzing method further comprises a priority domain, wherein after the named node acquires calling information from a user side and generates priority dividing information of the named node, the named node renames each child node; the priority domain divides part of the sub-nodes into priority blocks, and simultaneously, the priority domain sends information to the named nodes, and the named nodes name the node information in the priority domain as priority nodes; the scanning module preferentially scans the priority nodes in the priority domain;

the priority comprises a first type priority, a second type priority and a third type priority, and the priority determination mode comprises the following steps: calling frequencies of all child nodes in a preset time are P, when the called times of the child nodes in the preset time exceed P/2, the child nodes are determined to have priority of one class, when the called times of the child nodes in the preset time exceed P/4, the child nodes are determined to have priority of two classes, and other child nodes are determined to have priority of three classes;

the named node sends the generated priority judgment type information to the scanning module after receiving the additional command, and the scanning module starts to scan each scanning node after receiving the priority judgment type; the scanning module still starts to scan each child node one by one after the priority judging type is received for the first time so as to mark and classify the child nodes which are identified as the child nodes with the first-class priority, the second-class priority and the third-class priority; after the scanning module carries out primary scanning, the named node generates a priority block rule, wherein the priority block rule comprises a first priority scanning block which marks all the priority child nodes; a second priority scanning block for marking all the second type priority sub-nodes and a third priority scanning block for marking all the third type priority sub-nodes; and in the next scanning program of the scanning module, scanning the first priority scanning block, the second priority scanning block and the third priority scanning block in sequence, scanning and searching for a required file, obtaining the block position of the file block, and then filling metadata according to the scanning content.

2. The Hadoop-based data processing method according to claim 1, wherein a child node is considered to have a class one priority when the calling frequency of the child node within a predetermined time is greater than the sum of the frequencies called by any four other child nodes, and the child node is considered to have a class two priority when the calling frequency of the child node within a predetermined time is greater than the frequencies called by any two other child nodes.

3. The Hadoop-based data processing method according to claim 1, wherein the named node responds to the Hadoop client with a locating block data structure that includes all data node identifiers that attach copies of the data block to the existing file, and the Hadoop client directly requests the identified data node to attach the data block to the existing file by sending a portion of the extended block data structure that includes the ID of the data block and the data to the identified data node.

4. The Hadoop-based data processing method according to claim 3, wherein the data node receiving the extension block data structure uses the IDs of the data blocks in the received extension block data structure to access the corresponding blocks of the existing file and the data in the received extension block data structure to write the data to the accessed blocks.

5. The Hadoop-based data processing method according to claim 4, wherein the file placement optimization module is configured to adjust the amount of data that can be stored on a single data node or a single server.

6. The Hadoop-based data processing method according to claim 4, wherein the block scanning module scans the data nodes for a desired file, obtains block locations of file blocks, and then fills in metadata according to the scanned content, so that the metadata reflects the block locations and the number of copies, and the block scanning module returns the locations of consecutive block files from a single data node to provide the Hadoop user end with the illusion that the block files are placed.

7. The Hadoop-based data processing method according to claim 6, wherein the block scanning module obtains the name of the data node and the address of the data block, creates a block ID for the data block and stores the block ID in the metadata, and the naming node uses the information received from the block scanning module to update the block list and the location of each copy of each block.