JP2019204472A

JP2019204472A - Method for reading plurality of small files of 2 mb or smaller from hdfs having data merge module and hbase cash module on the basis of hadoop

Info

Publication number: JP2019204472A
Application number: JP2018147288A
Authority: JP
Inventors: 魏文国; Wenguo Wei; 謝桂園; Guiyuan Xie; 蔡君; Jun Cai; 趙慧民; Huimin Zhao; 彭建烽; Jianfeng Peng
Original assignee: Guangdong Polytechnic Normal University
Current assignee: Guangdong Polytechnic Normal University
Priority date: 2018-05-22
Filing date: 2018-08-04
Publication date: 2019-11-28
Anticipated expiration: 2038-08-04
Also published as: CN108804566A; CN108804566B; JP6695537B2

Abstract

To disclose a method for reading a large number of small files on the basis of Hadoop.SOLUTION: A reading method is applied to an HDFS system having a data merge module and an HBase cash module. The method receives a read command of a small file inputted by a user, allows a read command of the small file to include the user ID and the name of the small file therein, inquires an HBase cash module according to the user ID and the name of the small file, returns to the inquired file content when the corresponding file content appears, and inquires a database of the HDFS system when the corresponding file content does not appear. The reading method improves read efficiency of a small file after bringing together merging of a small file and an HBase cashing mechanism.SELECTED DRAWING: Figure 1

Description

本発明はコンピュータテクノロジー分野に関し、具体的には、Ｈａｄｏｏｐに基づく大量の小さなファイルの読み込みメソッドに関する。 The present invention relates to the field of computer technology, and more particularly to a method for reading a large number of small files based on Hadoop.

Ｈａｄｏｏｐは２００５年にＡｐａｃｈｅＦｏｕｎｄａｔｉｏｎによってＬｕｃｅｎｅのサブプロジェクトであるＮｕｔｃｈの一部分として正式に導入されたものである。Ｈａｄｏｏｐの最も重要な二つの設計はＨＤＦＳとＭａｐＲｅｄｕｃｅである。ＨＤＦＳは大量なデータをストレージし、ファイルがデータブロックの形でシステムにストレージされる。また、ＨＤＦＳのデータブロックは通常のディスクに定義されたデータブロック（通常は５１２Ｂ）よりも遥かに大きく、ＨＤＦＳの現在のデフォルトブロックサイズは１２８ＭＢである。もしＨＤＦＳにストレージされたファイルのサイズが１２８に超えると、ＨＤＦＳは該ファイルを複数のブロックサイズのブロックに分割し、別々にストレージする。また、ＨＤＦＳが絶えずに小さなファイルをＴＢひいてはＰＢレベルまでストレージし続けると、小さなファイルの問題が発生し、此れは、大量のメタデータがＨＤＦＳのプライマリノードのｎａｍｅｎｏｄｅにストレージされるため、ｎａｍｅｎｏｄｅの負荷が大幅に増加し、システムの読み取りパフォーマンスに影響するためである。その中に、小さなファイルのサイズが２ＭＢに定義され、つまり、ＨＤＦＳがファイルをストレージする中で、ファイルのサイズが２Ｍまたは２Ｍ以下であると、小さなファイルとして定義される。 Hadoop was officially introduced in 2005 by Apache Foundation as part of Lucene, a subproject of Lucene. The two most important Hadoop designs are HDFS and MapReduce. HDFS stores a large amount of data, and files are stored in the system in the form of data blocks. Also, the HDFS data block is much larger than the data block defined for a normal disk (usually 512B), and the current default block size of HDFS is 128 MB. If the size of a file stored in HDFS exceeds 128, HDFS divides the file into blocks having a plurality of block sizes and stores them separately. Also, if HDFS keeps storing small files continuously to TB and eventually to PB level, there will be a problem of small files, because a large amount of metadata is stored in the namenode of the primary node of HDFS. This is because the load increases significantly and affects the read performance of the system. Among them, the size of a small file is defined as 2 MB, that is, when the file size is 2M or 2M or less while HDFS is storing the file, it is defined as a small file.

大量な小さなファイルの処理について、現有の技術においては、若干の小さなファイルを一つのブロックサイズの大きなファイルにマージすることであり、ファイル間の関連性を考慮せず、小さなファイルの読み込み効率が望ましくなくなる。 Regarding the processing of a large number of small files, the current technology is to merge a few small files into one large block size file, and the efficiency of reading small files is desirable without considering the relationship between files. Disappear.

中国特許出願公開第１０２７９９６３９号明細書Chinese Patent Application No. 1027999639

本発明の実施例ではＨａｄｏｏｐに基づく大量の小さなファイルの読み込みメソッドが打ち出され、ファイルマージとＨＢａｓｅキャッシングメカニズムを組み合わせた後で、小さなファイルの読み込み効率を改善することができる。 In the embodiment of the present invention, a method for reading a large number of small files based on Hadoop is devised, and after combining the file merging and the HBase caching mechanism, the efficiency of reading a small file can be improved.

本発明の実施例はＨａｄｏｏｐに基づく大量の小さなファイルの読み込みメソッドを提供し、前記読み込みメソッドはデータマージモジュールとＨＢａｓｅキャッシュモジュールを備えるＨＤＦＳシステムに適用され、前記読み込みメソッドには： Embodiments of the present invention provide a method for reading a large number of small files based on Hadoop, which is applied to an HDFS system including a data merge module and an HBase cache module, which includes:

ユーサーによってインプットされる小さなファイルの読み込みコマンドを受信し、其中、その中に、前記小さなファイルの読み込みコマンドが前記ユーザＩＤと小さなファイルの名前を含み、 Receiving a small file read command input by the user, wherein the small file read command includes the user ID and the name of the small file;

前記ユーザＩＤと前記小さなファイルの名前にしたがって前記ＨＢａｓｅキャッシュモジュールを照会し、 Query the HBase cache module according to the user ID and the name of the small file;

対応するファイルコンテンツが出たら、前記ＨＢａｓｅキャッシュモジュールによって照会されたファイルコンテンツに戻り、そうでなければ、前記小さなファイルの名前によって前記ＨＤＦＳシステムのデータベースを照会して対応するファイルコンテンツが照会されたかどうかを判断し、 If the corresponding file content is found, return to the file content queried by the HBase cache module, otherwise query the HDFS system database by the name of the small file to see if the corresponding file content was queried Judging

イエスであれば前記データベースによって照会されたファイルコンテンツに戻り、 If yes, go back to the file content queried by the database,

そうでなければ、ＨａｄｏｏｐａｒｃｈｉｖｅツールのＡＰＩを呼び出して前記小さなファイルの名前の対応するＨＡＲファイルにアクセスして前記ＨＡＲファイルに戻り、 Otherwise, call the Hadooprchive tool API to access the corresponding HAR file with the name of the small file and return to the HAR file,

更に、前記データマージモジュールの採用するデータマージメソッドが以下のように： Furthermore, the data merge method adopted by the data merge module is as follows:

ステップＡ：クライアントがストレージされるファイルをアップロードした後で、ＨＤＦＳのすべてのファイルをトラバースし、ユーザアクセスプリファレンスモデルを組み合わせて、前記ストレージされるファイルの関連ファイルコレクションを見つけ、その中に、前記ユーザアクセスプリファレンスモデルがユーザアクセスログレコードに基づいている。 Step A: After the client uploads the stored file, traverse all the files in HDFS and combine the user access preference model to find the relevant file collection of the stored file, in which the said A user access preference model is based on user access log records.

ステップＢ：前記関連ファイルコレクションのミドルファイルと前記ストレージされるファイルを順にマージするキューに追加する。 Step B: Add the middle file of the related file collection and the stored file to the queue to be merged in order.

ステップＣ：前記マージするキューのすべてのファイルの総サイズが１２８ＭＢを超えるかどうかを判断し、イエスであれば、ステップＤに進み、そうでなければ、ステップＥに進む。 Step C: Determine whether the total size of all files in the queue to be merged exceeds 128 MB. If yes, go to step D, otherwise go to step E.

ステップＤ：前記マージするキューのすべてのファイルを一つのデータブロックにマージし、前記マージするキューのファイル情報をクリアし、マージしたファイルのソースファイルを削除して、ステップＢに戻る。 Step D: Merge all the files of the queue to be merged into one data block, clear the file information of the queue to be merged, delete the source file of the merged file, and return to Step B.

ステップＥ：前記関連ファイルコレクションのミドルファイルと前記ストレージされるファイルが全部前記マージするキューに追加されたかどうかを判断し、イエスであれば、前記マージするキューのすべてのファイルを一つのデータブロックにマージし、前記マージするキューのファイル情報をクリアし、マージしたファイルのソースファイルを削除して、ステップＦに進み、そうでなければ、ステップＢに進む。 Step E: It is determined whether all the middle files of the related file collection and the stored files are added to the merge queue. If yes, all the files of the merge queue are combined into one data block. Merge, clear the file information of the queue to be merged, delete the source file of the merged file, proceed to step F, otherwise proceed to step B.

ステップＦ：すべてのマージしたデータブロックをＨＤＦＳシステムにストレージする。 Step F: Store all merged data blocks in the HDFS system.

更に、その特徴は、前記ユーザアクセスプリファレンスモデルがユーザアクセスログレコードから統計されたものであり、具体的には： Further, the feature is that the user access preference model is statistics from user access log records, specifically:

前記ユーザアクセスログレコードからアクティブユーザセットを統計し、 Statistics active user sets from the user access log records;

ｂｅａｎオブジェクトが前記アクティブユーザセットによってアクセスされた小さなファイルを表すために使用され、前記小さなファイルがサイズが２ＭＢまたは２ＭＢ以上のファイルを指し、その中に、前記ｂｅａｎオブジェクトのプロパティが該小さなファイルをアクセスするユーザＩＤ、ユーザがアクセスした小さなファイルの名前及び該小さなファイルがアクサスされた回数を含み、 A bean object is used to represent a small file accessed by the active user set, where the small file refers to a file that is 2MB or larger in size, in which the properties of the bean object access the small file User ID, the name of the small file accessed by the user, and the number of times the small file was accessed,

ＪＤＢＣテクノロジを合わせて、前記ｂｅａｎオブジェクトを永続にＭｙｓｑｌ前記データベースにストレージし、ストレージされたデータにしたがって、任意の二つの異なるアクセス動作の類似性を計算し、 Combined with JDBC technology, store the bean object permanently in the MySQL database, calculate the similarity of any two different access operations according to the stored data,

前記任意の二つの異なるアクセス動作の類似性が正の場合、前記任意の二つのアクセス動作のユーザが類似ユーザであり、類似ユーザのＩＤを記録し、関連ファイルコレクションを使って、類似ユーザによってアクサスされ、関連付けられたファイル情報をストレージし、 If the similarity between any two different access operations is positive, the user of any two access operations is a similar user, records the similar user's ID, and uses the related file collection to access the Store the associated file information,

前記関連ファイルコレクションにしたがって、前記ユーザアクセスプリファレンスモデルを構築し、 Building the user access preference model according to the associated file collection;

更に、その特徴は、前記ＨＢａｓｅキャッシュモジュールの採用するキャッシュメソッドは： In addition, the cache method employed by the HBase cache module is:

ユーザアクセスログレコードを取得し、前記ユーザアクセスログレコードより前記アクティブユーザセットを統計し、 Obtaining a user access log record, statistics the active user set from the user access log record,

対数線形モデルをあわせて、前記アクティブユーザセットの各アクティブユーザにアクサスされたファイルの人気予測値をアカウントして、人気予測値によって各ファイルを降順にソートし、ファイルの上位２０％をホットスポットファイルとしてマークし、 The logarithmic linear model is combined to account for the popularity prediction value of the file accessed for each active user in the active user set, and each file is sorted in descending order according to the popularity prediction value, and the top 20% of the files are hot spot files. Mark as

前記ホットスポットファイルを取得し、Ｈｂａｓｅデータベースを採用して前記ホットスポットファイルの関連情報をキャッシュし、 Obtaining the hot spot file, adopting an Hbase database to cache relevant information of the hot spot file,

更に、前記前記ユーザアクセスログレコードからアクティブユーザセットを統計するについて、具体的には、 Further, for statistics on active user sets from the user access log record, specifically:

アクセスされたソースの接尾辞がｊｐｇであるレコード行を前記ユーザアクセスログレコードからフィルタリングし、前記レコード行がユーザＩＤ、アクセスページＵＲＬ、アクセス開始時刻、アクセス状況、アクセストラフィックを含み、 Filtering a record line with the accessed source suffix jpg from the user access log record, the record line including a user ID, access page URL, access start time, access status, access traffic;

レコード解析クラスを作成して前記レコード行を解析し、二次元配列を使用してビジターＩＰと小さなファイルの名前をストレージし、 Create a record parsing class to parse the record row, use a 2D array to store the visitor IP and the name of the small file,

ビジターＩＰを前記二次元配列でトラバースし、ＨａｓｈＭａｐコレクションを使用して各ビジターＩＰのトラフィックを統計し、前記ＨａｓｈＭａｐコレクションのＫｅｙ値がビジターＩＰであり、Ｖａｌｕｅ値がトラフィックであり、 Traversing the visitor IP with the two-dimensional array, using the HashMap collection to statistic the traffic of each visitor IP, the Key value of the HashMap collection is the visitor IP, the Value value is the traffic,

前記ＨａｓｈＭａｐコレクションをＶａｌｕｅ値の降順でソートし、ビジターＩＰの上位２０％をフィルタリングし、ＡｒｒａｙＬｉｓｔコレクションを使用して該ＩＰサブセットをストレージし、アクティブユーザセットとしてマークし、 Sort the HashMap collection in descending order of Value values, filter the top 20% of visitor IPs, use the ArrayList collection to store the IP subset, mark it as an active user set,

更に、前記対数線形モデルをあわせて、前記アクティブユーザセットの各アクティブユーザにアクサスされたファイルの人気予測値をアカウントして、人気予測値によって各ファイルを降順にソートし、ファイルの上位２０％をホットスポットファイルとしてマークし、具体的には： Furthermore, the logarithmic linear model is combined to account for the popularity prediction value of the file accessed by each active user of the active user set, and each file is sorted in descending order by the popularity prediction value, and the top 20% of the files are sorted. Mark as a hotspot file, specifically:

ＡｒｒａｙＬｉｓｔコレクションから抽出されたビジターＩＰを、前記二次元配列から抽出されたビジターＩＰと照合し、 Matching the visitor IP extracted from the ArrayList collection with the visitor IP extracted from the two-dimensional array;

一致が出たら、合致するビジターＩＰをキーワードとして、各ユーザのアクセス開始時刻を照会して、対数線形モデルを合わせ、前記アクティブユーザセットの各アクティブユーザにアクセスされたファイルの人気予測値をアカウントし、人気予測値によって各ファイルを降順にソートし、ファイルの上位２０％をホットスポットファイルとしてマークし、 When a match is found, the access start time of each user is inquired using the matching visitor IP as a keyword, the logarithmic linear model is matched, and the popularity prediction value of the file accessed by each active user of the active user set is accounted for. , Sort each file in descending order by popularity prediction, mark the top 20% of files as hotspot files,

前記対数線形モデルは：

であり、 The log-linear model is:

And

その中に、

がファイルｉの人気予測値であり、

がファイルｉが観測期間中のトラフィックであり、観測期間の長さがｔである。 Among them,

Is the popularity prediction for file i,

File i is the traffic during the observation period, and the length of the observation period is t.

本発明の実施例を実施すると、以下の有益効果が出る： Implementation of embodiments of the present invention has the following beneficial effects:

本発明の実施例が提供するＨａｄｏｏｐに基づく大量の小さなファイルの読み込みメソッドについては、該読み込みメソッドはデータマージモジュールとＨＢａｓｅキャッシュモジュールを備えるＨＤＦＳシステムに適用され、該読み込みメソッドは：ユーサーによってインプットされる小さなファイルの読み込みコマンドを受信し、その中に、小さなファイルの読み込みコマンドがユーザＩＤと小さなファイルの名前を含み、ユーザＩＤと小さなファイルの名前にしたがってＨＢａｓｅキャッシュモジュールを照会し、対応するファイルコンテンツが出たら、照会されたファイルコンテンツに戻り、そうでなければ、ＨＤＦＳシステムのデータベースを照会し、成功したら、照会されたファイルコンテンツに戻り、そうでなければ、ＨａｄｏｏｐａｒｃｈｉｖｅツールのＡＰＩを呼び出して対応するＨＡＲファイルにアクセスしてＨＡＲファイルに戻る。小さなファイル間の関連性とホットスポットファイルを考慮しない現有技術と比べ、本発明の読み込みメソッドは小さなファイルのマージとＨＢａｓｅキャッシングメカニズムが組み合わせられた後で、小さなファイルの読み込み効率を改善できる。 For the method of reading a large number of small files based on Hadoop provided by the embodiment of the present invention, the reading method is applied to an HDFS system including a data merge module and an HBase cache module, and the reading method is input by a user. A small file read command is received, in which the small file read command includes the user ID and the name of the small file, queries the HBase cache module according to the user ID and the name of the small file, and the corresponding file content is If so, return to the queried file content; otherwise, query the HDFS system database; if successful, return to the queried file content; otherwise, Had By calling the API of oparchive tool to access the corresponding HAR file back to the HAR file. Compared to the existing technology that does not consider the relationship between small files and hotspot files, the reading method of the present invention can improve the reading efficiency of small files after the combination of small file merging and HBase caching mechanism.

図１は本発明の提供するＨａｄｏｏｐに基づく大量の小さなファイルの読み込みメソッドの実施例のプロセス見取り図である。FIG. 1 is a process sketch of an embodiment of a method for reading a large number of small files based on Hadoop provided by the present invention. 図２は本発明の提供するデータマージメソッドの実施例のプロセス見取り図である。FIG. 2 is a process sketch of an embodiment of the data merge method provided by the present invention. 図３は本発明の提供するキャッシュメソッドの実施例のプロセス見取り図である。FIG. 3 is a process sketch of an embodiment of the cache method provided by the present invention. 図４は本発明の提供するキャッシュメソッドのもう一つの実施例のプロセス見取り図である。FIG. 4 is a process sketch of another embodiment of the cache method provided by the present invention.

図１を参照し、本発明のＨａｄｏｏｐに基づく大量の小さなファイルの読み込みメソッドの実施例のプロセス見取り図であり、そのメソッドはステップ１０１からステップ１０５を含む。該当読み込みメソッドはデータマージモジュールとＨＢａｓｅキャッシュモジュールを備えるＨＤＦＳシステムに適用され、各ステップは以下のように： Referring to FIG. 1, a process sketch of an embodiment of a method for reading a large number of small files based on the Hadoop of the present invention, which includes steps 101-105. The corresponding read method is applied to the HDFS system including the data merge module and the HBase cache module, and each step is as follows:

ステップ１０１：ユーサーによってインプットされる小さなファイルの読み込みコマンドを受信し、その中に、小さなファイルの読み込みコマンドがユーザＩＤと小さなファイルの名前を含む。 Step 101: Receive a small file read command input by the user, in which the small file read command includes a user ID and a small file name.

ステップ１０２：ユーザＩＤと小さなファイルの名前にしたがってＨＢａｓｅキャッシュモジュールを照会し、対応するファイルコンテンツが照会されたかどうかを判断し、イエスであればステップ１０５に進み、そうでなければステップ１０３に進む。 Step 102: Query the HBase cache module according to the user ID and the name of the small file to determine whether the corresponding file content has been queried. If yes, go to step 105, otherwise go to step 103.

ステップ１０３：小さなファイルの名前にしたがってＨＤＦＳシステムのデータベースを照会し、対応するファイルコンテンツが照会されたかどうかを判断し、イエスであればステップ１０５に進み、そうでなければ１０４に進む。 Step 103: Query the HDFS system database according to the name of the small file to determine whether the corresponding file content has been queried. If yes, go to step 105, otherwise go to 104.

ステップ１０４：ＨａｄｏｏｐａｒｃｈｉｖｅツールのＡＰＩを呼び出し、小さなファイルの名前が対応するＨＡＲファイルにアクセスし、そのＨＡＲファイルに戻る。 Step 104: Call the API of the Hadooparkive tool, access the HAR file to which the name of the small file corresponds, and return to that HAR file.

ステップ１０５：照会されたファイルコンテンツに戻る。 Step 105: Return to the inquired file content.

図２を参照し、図２は本発明の提供するデータマージメソッドの実施例のプロセス看取り図であり、本発明のデータマージモジュールは図２の示すデータマージメソッドを採用し、ステップＡからステップＦまでを含み、各ステップが以下のように： Referring to FIG. 2, FIG. 2 is a process sketch diagram of an embodiment of the data merge method provided by the present invention. The data merge module of the present invention adopts the data merge method shown in FIG. Each step is as follows:

ステップＡ：クライアントがストレージされるファイルをアップロードした後で、ＨＤＦＳのすべてのファイルをトラバースし、ユーザアクセスプリファレンスモデルを組み合わせて、ストレージされるファイルの関連ファイルコレクションを見つけ、その中に、ユーザアクセスプリファレンスモデルがユーザアクセスログレコードに基づいている。 Step A: After the client uploads the stored file, traverse all the files in HDFS and combine the user access preference model to find the relevant file collection of the stored file, in which the user access The preference model is based on user access log records.

本実施例の中で、ユーザアクセスプリファレンスモデルがユーザアクセスログレコードに基づいて統計されたもので、具体的には：ユーザアクセスログレコードからアクティブユーザセットを統計し、ｂｅａｎオブジェクトがアクティブユーザセットによってアクセスされた小さなファイルを表すために使用され、小さなファイルがサイズが２ＭＢまたは２ＭＢ以上のファイルを指し、その中に、ｂｅａｎオブジェクトのプロパティが該小さなファイルをアクセスするユーザＩＤ、ユーザがアクセスした小さなファイルの名前及び該小さなファイルがアクサスされた回数を含み、ＪＤＢＣテクノロジによって、ｂｅａｎオブジェクトを永続にＭｙｓｑｌデータベースにストレージし、ストレージされたデータにしたがって、任意の二つの異なるアクセス動作の類似性を計算し、任意の二つの異なるアクセス動作の類似性が正の場合、任意の二つのアクセス動作のユーザが類似ユーザであり、類似ユーザのＩＤを記録し、関連ファイルコレクションを使って、類似ユーザによってアクサスされ、関連付けられたファイル情報をストレージし、関連ファイルコレクションにしたがって、ユーザアクセスプリファレンスモデルを構築する。 In this embodiment, the user access preference model is statistically based on the user access log record, specifically: active user set statistics from the user access log record, and the bean object is determined by the active user set. Used to represent a accessed small file, where the small file refers to a file that is 2MB or larger in size, in which the bean object property is the user ID accessing the small file, the small file accessed by the user And the number of times the small file has been accessed, and with JDBC technology, the bean object is permanently stored in the MySQL database, and any two different If the similarity of any two different access actions is positive, the user of any two access actions is a similar user, record the similar user's ID, and the related file collection Use to store file information associated with and accessed by similar users and build a user access preference model according to the related file collection.

本実施例の中で、ユーザアクセスログレコードよりアクティブユーザセットを統計し、具体的には：アクセスされたソースの接尾辞がｊｐｇであるレコード行をユーザアクセスログレコードからフィルタリングし、その中に、レコード行がユーザＩＤ、アクセスページＵＲＬ、アクセス開始時刻、アクセス状況、アクセストラフィックを含み、レコード解析クラスを作成してレコード行を解析し、二次元配列を使用してビジターＩＰと小さなファイルの名前をストレージし、二次元配列のビジターＩＰをトラバースし、ＨａｓｈＭａｐコレクションを使用して各ビジターＩＰのトラフィックを統計し、ＨａｓｈＭａｐコレクションのＫｅｙ値がビジターＩＰであり、Ｖａｌｕｅ値がトラフィックであり、ＨａｓｈＭａｐコレクションをＶａｌｕｅ値の降順でソートし、ビジターＩＰの上位２０％をフィルタリングし、ＡｒｒａｙＬｉｓｔコレクションを使用して該ＩＰサブセットをストレージし、アクティブユーザセットとしてマークする。 In this example, the active user set is statistics from the user access log record, specifically: the record line with the suffix of the accessed source is jpg is filtered from the user access log record, Record row contains user ID, access page URL, access start time, access status, access traffic, create record analysis class to analyze record row, use 2D array to enter visitor IP and small file name Storage, traversing a two-dimensional array of visitor IPs, using the HashMap collection to statistic the traffic for each visitor IP, the Key value of the HashMap collection is visitor IP, the Value value is traffic, and the HashMap collection is Val Sort by descending e value, filters the 20% upper visitor IP, the IP subnet to storage using ArrayList collection, marked as active user set.

本発明のモデルの構築過程をもっとよく説明するために、下記の例えにより説明し、具体的な実現する過程が以下のように： In order to better explain the process of building the model of the present invention, it is explained by the following illustration, and the specific realization process is as follows:

（１）正規表現を使用してアクセスされたソースの接尾辞がｊｐｇであるレコード行をフィルタリングする。 (1) Filter record rows whose source suffix is jpg accessed using regular expressions.

（２）ログ解析クラスを作成してレコード行の五つのコンポーネントを別々に解析し、二次元配列を使ってビジターＩＰと小さなファイルの名前をストレージする。 (2) Create a log analysis class to analyze the five components of the record row separately, and store the visitor IP and the name of the small file using a two-dimensional array.

（３）二次元配列のビジターＩＰ要素をトラバースし、各ビジターＩＰのトラフィックをカウンタするカウンタを設計する。ＨａｓｈＭａｐコレクションを使って、ビジターＩＰをＫｅｙ値とし、Ｖａｌｕｅ値が該ビジターのトラフィックである。 (3) A counter that traverses the two-dimensional array of visitor IP elements and counts the traffic of each visitor IP is designed. Using the HashMap collection, the visitor IP is a key value, and the value value is the traffic of the visitor.

（４）ステップ３で生成されたＨａｓｈＭａｐコレクションをＶａｌｕｅ値にしたがって降順でソートし、ビジターＩＰの上位２０％をフィルタリングし、ＡｒｒａｙＬｉｓｔコレクションで該ＩＰサブセットをストレージし、アクティブユーザセットとしてマークする。 (4) Sort the HashMap collection generated in step 3 in descending order according to Value values, filter the top 20% of visitor IPs, store the IP subset in the ArrayList collection, and mark it as the active user set.

（５）一つのｂｅａｎオブジェクトによってアクティブユーザセットにアクセスされた小さなファイルを抽象に表し、オブジェクトのプロパティが該小さなファイルをアクセスしたユーザＩＤ、ユーザにアクセスされた小さなファイルの名前及び該小さなファイルがアクセスされた回数を含む。メソッドはプロパティを取得するｇｅｔ及びｓｅｔメソッドである。 (5) An abstract representation of a small file accessed to the active user set by one bean object, where the object properties are the user ID that accessed the small file, the name of the small file accessed by the user, and the small file accessed Including the number of times The methods are get and set methods for acquiring properties.

（６）ＪＤＢＣテクノロジｂｅａｎオブジェクトをＭｙｓｑｌデータベースに結合して永続にストレージし、以下の形式のテーブルが形成される： (6) Join the JDBC object bean object to the MySQL database and store it persistently to form a table of the form:

（７）２０行の二行の間にデータを取り込み、数式

によって二つの異なるユーザアクセス作動の類似性をカウンタする。その中に、本発明はピアソン相関係数を使用して類似のユーザを決定し、スコアリング行列Ｒを指定し、ユーザａとユーザｂの類似性をｓｉｍ（ａ，ｂ）で表し、ｒ_ａ及びｒ_ｂが「ユーザ−トラフィック」ストアリングマトリックスのストアリングデータである。 (7) Take data between two lines of 20 lines, formula

Counters the similarity of two different user access actions. Among them, the present invention uses the Pearson correlation coefficient to determine similar users, specifies a scoring matrix R, expresses the similarity between user a and user b as sim (a, b), and r _a and r _b is - a "user traffic" store ring data store ring matrix.

（８）ここで、ｓｉｍ（ａ，ｂ）の値が正の値であれば、二人の異なるユーザが類似ユーザであると判定され、そのユーザＩＤが記録される。 (8) Here, if the value of sim (a, b) is a positive value, it is determined that two different users are similar users, and the user IDs are recorded.

（９）類似ユーザのユーザＩＤに基づいて、一つのコレクションを使用して、類似ユーザにアクセスされ、関連付けられているすべてのファイル情報をストレージする。 (9) Based on the user ID of the similar user, all file information accessed and related by the similar user is stored using one collection.

ステップＢ：関連ファイルコレクションのミドルファイルとストレージされるファイルを順にマージするキューに追加する。 Step B: Add the middle file of the related file collection and the stored file to the queue to be merged in order.

ステップＣ：マージするキューのすべてのファイルの総サイズが１２８ＭＢを超えるかどうかを判断し、イエスであれば、ステップＤに進み、そうでなければ、ステップＥに進む。 Step C: Determine whether the total size of all files in the queue to be merged exceeds 128 MB. If yes, go to step D, otherwise go to step E.

ステップＤ：マージするキューのすべてのファイルを一つのデータブロックにマージし、マージするキューのファイル情報をクリアし、マージしたファイルのソースファイルを削除して、ステップＢに戻る。 Step D: Merge all files in the queue to be merged into one data block, clear file information in the queue to be merged, delete the source file of the merged file, and return to Step B.

ステップＥ：関連ファイルコレクションのミドルファイルとストレージされるファイルが全部マージするキューに追加されたかどうかを判断し、イエスであれば、マージするキューのすべてのファイルを一つのデータブロックにマージし、マージするキューのファイル情報をクリアし、マージしたファイルのソースファイルを削除して、ステップＦに進み、そうでなければ、ステップＢに進む。 Step E: Determine whether all of the middle files in the related file collection and the stored files have been added to the merge queue, and if yes, merge all the files in the merge queue into one data block and merge The file information of the queue to be cleared is cleared, the source file of the merged file is deleted, and the process proceeds to Step F. Otherwise, the process proceeds to Step B.

図３を参照し、図３は本発明が提供するキャッシュメソッドの実施例のプロセス見取り図である。本発明のＨＢａｓｅキャッシュモデルが図３のキャッシュメソッドを採用し、該メソッドはステップ３０１からステップ３０３までを含み、各ステップが以下のように： Reference is made to FIG. 3, which is a process sketch of an embodiment of the cache method provided by the present invention. The HBase cache model of the present invention employs the cache method of FIG. 3, which includes steps 301 through 303, with each step as follows:

ステップ３０１：ユーザアクセスログレコードを取得し、ユーザアクセスログレコードからアクティブユーザセットを統計する。 Step 301: Obtain a user access log record and statistic an active user set from the user access log record.

本実施例の中に、ステップ３０１においてユーザアクセスログレコードからアクティブユーザセットを統計し、具体的には：アクセスされたソースの接尾辞がｊｐｇであるレコード行をユーザアクセスログレコードからフィルタリングし、その中に、レコード行がユーザＩＰ、アクセスページＵＲＬ、アクセス開始時刻、アクセス状況及びアクセストラフィックを含み、ログ解析クラスを作成してレコード行を解析し、二次元配列を使用してビジターＩＰと小さなファイルの名前をストレージし、二次元配列のビジターＩＰをトラバースし、ＨａｓｈＭａｐコレクションを使用して各ビジターＩＰのトラフィックを統計し、ＨａｓｈＭａｐコレクションのＫｅｙ値がビジターＩＰであり、Ｖａｌｕｅ値がトラフィックであり、ＨａｓｈＭａｐコレクションをＶａｌｕｅ値によって降順にソートし、ビジターＩＰの上位２０％をフィルタリングし、ＡｒｒａｙＬｉｓｔコレクションを使用して該ＩＰサブセットをストレージしてアクティブユーザセットとしてマークする。 In this embodiment, in step 301, the active user set is statistically analyzed from the user access log record, and specifically: the record line whose accessed source suffix is jpg is filtered from the user access log record; Inside, record line contains user IP, access page URL, access start time, access status and access traffic, create log analysis class to analyze record line, visitor IP and small file using 2D array , Traverse the visitor IP in a two-dimensional array, use the HashMap collection to stat the traffic for each visitor IP, the Key value in the HashMap collection is the visitor IP, the Value value is the traffic, and the HashMap Sorted in descending order collection by Value value, filters the 20% upper visitor IP, marked as active user set and storage the IP subset using ArrayList collection.

本発明のステップ３０１においてアクティブユーザセットをカウントする目的は、小さなファイルへのユーザのアクセスが均一なランダムではなくパレート分布法則に近いのである。すなわち、ほとんどのＩ／Ｏが少量の人気データへのアクセスをリクエストし、トラフィックの８０％がデータの２０％に集中している。したがって、ファイルシステムにストレージされた大量の小さなファイルからモデルを介してホットスポットファイルを予測してキャッシュすることができれば、ユーザのデータへのアクセスの効率を上げられる。 The purpose of counting the active user set in step 301 of the present invention is that the user's access to small files is close to Pareto distribution law rather than uniform random. That is, most I / Os request access to a small amount of popular data, with 80% of the traffic concentrated on 20% of the data. Therefore, if a hot spot file can be predicted and cached from a large number of small files stored in the file system through a model, the efficiency of access to user data can be increased.

ステップ３０２：対数線形モデルをあわせて、アクティブユーザセットの各アクティブユーザにアクサスされたファイルの人気予測値をアカウントして、人気予測値によって各ファイルを降順にソートし、ファイルの上位２０％をホットスポットファイルとしてマークする。 Step 302: Combine log-linear models to account for the popularity predictions of the file accessed for each active user in the active user set, sort each file in descending order by popularity prediction, and hot the top 20% of the files Mark as a spot file.

本実施例において、ステップ３０２は具体的には：ＡｒｒａｙＬｉｓｔコレクションから抽出されたビジターＩＰを、二次元配列から抽出されたビジターＩＰと照合し、一致が出たら、合致するビジターＩＰをキーワードとして、各ユーザのアクセス開始時刻を照会して、対数線形モデルを合わせ、アクティブユーザセットの各アクティブユーザにアクセスされたファイルの人気予測値をアカウントし、人気予測値によって各ファイルを降順にソートし、ファイルの上位２０％をホットスポットファイルとしてマークする。 In this embodiment, the step 302 is specifically: matching the visitor IP extracted from the ArrayList collection with the visitor IP extracted from the two-dimensional array, and if a match is found, each matching visitor IP is used as a keyword. Queries the user's access start time, fits a log-linear model, accounts for popularity estimates of files accessed by each active user in the active user set, sorts each file in descending order by popularity prediction, Mark the top 20% as hotspot files.

対数線形モデルは：

であり、 The log-linear model is:

And

その中に、

がファイルｉの人気予測値であり、

がファイルｉが観測期間中のトラフィックであり、観測期間の長さがｔである。

と

が線形関係の関連パラメータであり、線形回帰法によって最適値をアカウントすることができる。 Among them,

Is the popularity prediction for file i,

When

Are related parameters of the linear relationship, and the optimal value can be accounted for by linear regression.

本発明の対数線形モデルにおいて記載された観測期間の長さｔの定義は：ユーザアクセスログレコードのレコード行のアクセス開始時刻要素とユーザアクセスログレコードを収集した時刻との時間差である。例えば、収集されたユーザアクセスログレコードの時点が３０／Ｊａｎ／２０１８：１７：３８：２０で、ユーザアクセスログレコードのレコード行のアクセス開始時刻が２９／Ｊａｎ／２０１８：１０：３５：１５で、観測期間の長さが２９／Ｊａｎ／２０１８：１０：３５：１５から３０／Ｊａｎ／２０１８：１７：３８：２０までの時間差であり、アカウントしやすいために、期間の長さが時間単位にする。 The definition of the observation period length t described in the logarithmic linear model of the present invention is: the time difference between the access start time element of the record row of the user access log record and the time when the user access log record is collected. For example, the collected user access log record time point is 30 / Jan / 2018: 17: 38: 20, and the access start time of the record line of the user access log record is 29 / Jan / 2018: 10: 35: 15, The length of the observation period is the time difference from 29 / Jan / 2018: 10: 35: 15 to 30 / Jan / 2018: 17: 38: 20, and the length of the period is in units of time because it is easy to account .

ステップ３０３：ホットスポットファイルを取得し、Ｈｂａｓｅデータベースを採用してホットスポットファイルの関連情報をキャッシュする。 Step 303: Acquire a hot spot file and cache the related information of the hot spot file by adopting the Hbase database.

本実施例においては、Ｈｂａｓｅデータベースを採用してホットスポットファイルの関連情報をキャッシュし、ＨＢａｓｅのテーブル名値がビジターＩＤであり、ＨＢａｓｅのＲｏｗＫｅｙが小さなファイルの名前であり、ＨＢａｓｅのファミリ名が「ファイルコンテンツ」であり、Ｖａｌｕｅ値すなわちセル値が小さなファイルのコンテンツである。ユーザがＨＢａｓｅの小さなファイルにアクセス時に、ユーザＩＤをテーブル名とし、アクセスする小さなファイルの名前をＨＢａｓｅのｇｅｔ（）メソッドのパラメータとすれば、対応する小さなファイルのコンテンツを取得できる。 In this embodiment, the Hbase database is used to cache the related information of the hot spot file, the table name value of the HBase is the visitor ID, the RowKey of the HBase is the name of the small file, and the family name of the HBase is “ “File content”, which is the content of a file having a small Value value, ie, cell value. When a user accesses a small HBase file, the contents of the corresponding small file can be acquired if the user ID is a table name and the name of the small file to be accessed is a parameter of the HBase get () method.

本発明のキャッシュメソッドをもっと詳しく説明するため、図４を参照し、図４は本発明の提供するキャッシュメソッドのもう一つの実施例のプロセス見取り図である。図４が示すように、該プロセスが：ユーザアクセスレコードセット→正規表現が需要するレコード行をフィルタリングする→レコード行を解析する→ｂｅａｎオブジェクトでレコード行情報をカプセル化する→ＪＤＢＣＡＰＩを調査研究してｂｅａｎオブジェクトをＭｙｓｑｌデータベースに永続化する→二次元配列によってビジターＩＰと小さなファイルの名前情報をストレージする→配列をトラバースし、ビジタートラフィックを統計する→ビジタートラフィックによってサーとし、ＡｒｒａｙＬｉｓｔコレクションを使ってアクティブユーザセットのユーザＩＰをストレージする→二次元配列をアクティブユーザセットのビジターＩＰと照合する→一致が出たら、ビジターＩＰをキーワードとして、ユーザアクセス開始時刻とトラフィックを抽出する→ファイル人気予測数式によってファイル人気値をアカウントする→ファイル人気値をサートし、ホットスポットファイルをマークする→ＨＢａｓｅによってホットスポットファイルの関連情報をキャッシュする。 To describe the cache method of the present invention in more detail, please refer to FIG. 4, which is a process sketch of another embodiment of the cache method provided by the present invention. As FIG. 4 shows, the process is as follows: user access record set → filtering record rows required by regular expressions → analyzing record rows → encapsulating record row information with bean object → researching the JDBC API Persist bean object in Mysql database → Store visitor IP and small file name information by 2D array → Traverse array and statistics visitor traffic → Use visitor traffic as sir and active using ArrayList collection The user IP of the user set is stored → The two-dimensional array is checked with the visitor IP of the active user set. Tsu to extract the click → by file popular prediction formula to account the file popularity value → to insert the file popularity value, to cache the relevant information of the hot spot file by → HBase to mark the hot spot file.

上記からわかったことは、本発明の実施例の提供するＨａｄｏｏｐに基づく大量の小さなファイルの読み込みメソッドにおいて、該読み込みメソッドがデータマージモジュールとＨＢａｓｅキャッシュモジュールを備えるＨＤＦＳシステムに適用される。該読み込みメソッドは、ユーサーによってインプットされる小さなファイルの読み込みコマンドを受信し、その中に、小さなファイルの読み込みコマンドがユーザＩＤと小さなファイルの名前を含み、ユーザＩＤと小さなファイルの名前でＨＢａｓｅキャッシュモジュールを照会し、対応するファイルコンテンツが出たら、照会されたファイルコンテンツに戻り、対応するファイルコンテンツが出ないと、ＨＤＦＳシステムのデータベースを照会し、成功したら、照会されたファイルコンテンツに戻り、失敗したら、ＨａｄｏｏｐａｒｃｈｉｖｅツールのＡＰＩを呼び出し、対応するＨＡＲファイルにアクセスしてＨＡＲファイルに戻る。現有技術と比べて、小さなファイルの間のアソシエーションとホットスポットファイルを問わず、本発明の読み込みメソッドは小さなファイルとＨＢａｓｅキャッシングメカニズムがマージされた後で、小さなファイルの読み込み効率を上げられる。 As can be seen from the above, in the method for reading a large number of small files based on Hadoop provided by the embodiment of the present invention, the read method is applied to an HDFS system including a data merge module and an HBase cache module. The read method receives a small file read command input by the user, and the small file read command includes a user ID and a small file name, and the HBase cache module includes the user ID and the small file name. If the corresponding file content appears, return to the inquired file content. If the corresponding file content does not appear, query the HDFS system database. If successful, return to the inquired file content. , Calls the API of the Hadooparkive tool, accesses the corresponding HAR file, and returns to the HAR file. Compared to current technology, regardless of the association between small files and hotspot files, the reading method of the present invention can increase the efficiency of reading small files after the small file and HBase caching mechanism are merged.

更に、本発明はデータマージメソッドを提供し、小さなファイルの読み込み効率を高め、ＨＤＦＳシステムでのｎａｍｅｎｏｄｅメモリの消費を削減する。 Furthermore, the present invention provides a data merge method, which increases the efficiency of reading small files and reduces the consumption of namode memory in the HDFS system.

更に、本発明のデータマージメソッドには、複数の関連付けられた小さなファイルが一つの大きなファイルにマージされてシステムにストレージされ、システムのｎａｍｅｎｏｄｅノードが一つの大きなファイルの対応するメタデータをストレージし、ｎａｍｅｎｏｄｅノードの維持する必要のあるメタデータの量が大幅に減少し、メモリ消費量も減少になる。 Furthermore, in the data merge method of the present invention, a plurality of associated small files are merged into one large file and stored in the system, and the system's namenode node stores the corresponding metadata of one large file, The amount of metadata that the namenode node needs to maintain is greatly reduced and memory consumption is also reduced.

更に、本発明のデータマージメソッドには、関連ファイルが同じの一つの大きなファイルにストレージされ、ファイルがマージされた後で同じデータノードの同じデータブロックの中にストレージされる。ファイルへのユーザのリクエストに強い関連性があれば、すなわち、ユーザに耐えずにアクセスされた小さなファイルが同じ大きなファイルにマージされば、ファイルアクセスの原則によると、システムはより近いｄａｔａｎｏｄｅノードのデータブロックを読み込み、つまり絶えずに同じｄａｔａｎｏｄｅのデータブロックからデータを読み込み、こうして異なるファイルにアクセス時に異なるデータノードの間にジャンプしなくで済み、ディスクアドレッシングのオーバーヘッドが削減され、占められるシステムリソースが比較的に少なくなり、ファイルの読み込み効率を大きく高める。 Furthermore, in the data merge method of the present invention, related files are stored in the same large file and are stored in the same data block of the same data node after the files are merged. If there is a strong relevance to the user's request for the file, i.e., a small file accessed without enduring the user is merged into the same large file, according to the principle of file access, the system will use the data of the closer datanode node. Read blocks, that is, constantly read data from the same datanode data block, thus avoiding jumping between different data nodes when accessing different files, reducing disk addressing overhead and occupying relatively less system resources The file reading efficiency is greatly increased.

更に、本発明が提供するキャッシュメソッドには、現有技術と比べてユーザにアクセスされたホットスポットファイルを考慮せず、本発明はＨＢａｓｅによってホッとスポットファイルをキャッシュし、キャッシュヒット率を高めるだけでなく、ファイルの読み込み効率も上げる。 In addition, the cache method provided by the present invention does not consider hot spot files accessed by the user compared to the existing technology, and the present invention only caches spot files by HBase to increase the cache hit rate. In addition, it increases the efficiency of reading files.

当業者は、上記の実施形態を実施するプロセスの全部または一部を理解することができ、コンピュータプログラムによって関連するハードウェアを指示することで完了することができ、前記のプログラムがコンピュター可読記憶媒体にストレージされることができ、該プログラムが実行される時に、上記の各メソッドの実施例のプロセスが含まれる。その中に、前記の記憶媒体が磁気ディスク、光ディスク、読み出し専用メモリ（Ｒｅａｄ−ＯｎｌｙＭｅｍｏｒｙ，ＲＯＭ）またはランダムアクセスメモリ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ，ＲＡＭ）であってもよい。 A person skilled in the art can understand all or part of the process of implementing the above embodiments, and can be completed by indicating the relevant hardware by means of a computer program, said program being a computer readable storage medium Each of the above method embodiment processes is included when the program is executed. The storage medium may be a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM).

以上に述べたのは本発明の優先された実施形態であり、注意すべきのは、本技術分野の一般的な技術員にとって、本発明の原理から離れないことを前提として、若干な改善や飾りができ、これらの改善や飾りも本発明の保護範囲に含まれる。 The above is a preferred embodiment of the present invention, and it should be noted that it should be noted that a general engineer in the technical field will make slight improvements and decorations on the assumption that the principle of the present invention is not departed. These improvements and decorations are also included in the protection scope of the present invention.

本発明はコンピュータテクノロジー分野に関し、具体的には、Ｈａｄｏｏｐに基づいて、データマージモジュールとＨＢａｓｅキャッシュモジュールを備えるＨＤＦＳから複数の２ＭＢ以下の小さなファイルを読み込む方法に関する。 The present invention relates to the field of computer technology, and more particularly, to a method of reading a plurality of small files of 2 MB or less from HDFS having a data merge module and an HBase cache module based on Hadoop .

Ｈａｄｏｏｐは２００５年にＡｐａｃｈｅＦｏｕｎｄａｔｉｏｎによってＬｕｃｅｎｅのサブプロジェクトであるＮｕｔｃｈの一部分として正式に導入されたものである。Ｈａｄｏｏｐの最も重要な二つの設計はＨＤＦＳとＭａｐＲｅｄｕｃｅである。ＨＤＦＳは大量なデータをストレージし、ファイルがデータブロックの形でシステムにストレージされる。また、ＨＤＦＳのデータブロックは通常のディスクに定義されたデータブロック（通常は５１２Ｂ）よりも遥かに大きく、ＨＤＦＳの現在のデフォルトブロックサイズは１２８ＭＢである。もしＨＤＦＳにストレージされたファイルのサイズが１２８に超えると、ＨＤＦＳは該ファイルを複数のブロックサイズのブロックに分割し、別々にストレージする。また、ＨＤＦＳが絶えずに小さなファイルをＴＢひいてはＰＢレベルまでストレージし続けると、小さなファイルの問題が発生し、此れは、大量のメタデータがＨＤＦＳのプライマリノードのｎａｍｅｎｏｄｅにストレージされるため、ｎａｍｅｎｏｄｅの負荷が大幅に増加し、システムの読み取りパフォーマンスに影響するためである。その中に、小さなファイルのサイズが２ＭＢに定義され、つまり、ＨＤＦＳがファイルをストレージする中で、ファイルのサイズが２ＭＢ以下であると、小さなファイルとして定義される。 Hadoop was officially introduced in 2005 by Apache Foundation as part of Lucene, a subproject of Lucene. The two most important Hadoop designs are HDFS and MapReduce. HDFS stores a large amount of data, and files are stored in the system in the form of data blocks. Also, the HDFS data block is much larger than the data block defined for a normal disk (usually 512B), and the current default block size of HDFS is 128 MB. If the size of a file stored in HDFS exceeds 128, HDFS divides the file into blocks having a plurality of block sizes and stores them separately. In addition, if HDFS continuously stores small files to the TB and thus to the PB level, a problem of small files occurs. This is because a large amount of metadata is stored in the nanonode of the primary node of HDFS. This is because the load increases significantly and affects the read performance of the system. Among them, the size of a small file is defined as 2 MB. That is, when the file size is 2 MB or less while HDFS stores a file, it is defined as a small file.

本発明の実施例ではＨａｄｏｏｐに基づいて、データマージモジュールとＨＢａｓｅキャッシュモジュールを備えるＨＤＦＳから複数の２ＭＢ以下の小さなファイルを読み込む方法が打ち出され、ファイルマージとＨＢａｓｅキャッシングメカニズムを組み合わせた後で、小さなファイルの読み込み効率を改善することができる。 In the embodiment of the present invention , based on Hadoop, a method of reading a plurality of small files of 2 MB or less from HDFS having a data merge module and an HBase cache module is devised. After combining the file merge and the HBase caching mechanism, the small file Can improve the reading efficiency.

本発明の実施例はＨａｄｏｏｐに基づいて、データマージモジュールとＨＢａｓｅキャッシュモジュールを備えるＨＤＦＳから複数の２ＭＢ以下の小さなファイルを読み込む方法を提供し、前記読み込みメソッドはデータマージモジュールとＨＢａｓｅキャッシュモジュールを備えるＨＤＦＳシステムに適用され、前記読み込みメソッドには： An embodiment of the present invention provides a method for reading a plurality of small files of 2 MB or less from HDFS having a data merge module and an HBase cache module based on Hadoop , wherein the read method includes an HDFS having a data merge module and an HBase cache module. Applied to the system, the load method includes:

ユーサーによってインプットされる小さなファイルの読み込みコマンドを受信し、前記読み込みコマンド中に前記ユーザＩＤと小さなファイルの名前を含み、 Receiving a small file read command input by the user, including the user ID and the name of the small file in the read command ;

対応するファイルコンテンツが出たら、前記ＨＢａｓｅキャッシュモジュールによって照会されたファイルコンテンツを戻し、そうでなければ、前記小さなファイルの名前によって前記ＨＤＦＳシステムのデータベースを照会して対応するファイルコンテンツが照会されたかどうかを判断し、 If the corresponding file content is found , return the file content queried by the HBase cache module; otherwise, query the HDFS system database by the name of the small file to see if the corresponding file content was queried Judging

イエスであれば前記データベースによって照会されたファイルコンテンツを戻し、 If yes, return the file content queried by the database,

そうでなければ、ＨａｄｏｏｐａｒｃｈｉｖｅツールのＡＰＩを呼び出して前記小さなファイルの名前の対応するＨＡＲファイルにアクセスして前記ＨＡＲファイルを戻し、 Otherwise, call the Hadooparkive tool API to access the corresponding HAR file with the name of the small file and return the HAR file,

ステップＡ：クライアントがストレージされるファイルをアップロードした後で、ＨＤＦＳのすべてのファイルをトラバースし、ユーザアクセスプリファレンスモデルを用いて、前記ストレージされるファイルの関連ファイルコレクションを見つけ、ここで、前記ユーザアクセスプリファレンスモデルはユーザアクセスログレコードに基づいている。 Step A: After the client uploads the stored file, traverse all the files in HDFS and use the user access preference model to find the relevant file collection of the stored file , where the user The access preference model is based on user access log records.

ステップＤ：前記マージするキューのすべてのファイルを一つのデータブロックにマージし、前記マージするキューのファイル情報をクリアし、マージしたファイルのソースファイルを削除して、ステップＢを戻す。 Step D: Merge all the files in the queue to be merged into one data block, clear the file information in the queue to be merged, delete the source file of the merged file, and return to Step B.

ｂｅａｎオブジェクトが前記アクティブユーザセットによってアクセスされた小さなファイルを表すために使用され、その中に、前記ｂｅａｎオブジェクトのプロパティが該小さなファイルをアクセスするユーザＩＤ、ユーザがアクセスした小さなファイルの名前及び該小さなファイルがアクサスされた回数を含み、 bean object is used to represent a small file that is accessed by the active user set, therein, user ID property of the bean object accesses the small files, the names and the small smaller files a user has accessed Including the number of times the file was accessed,

ＪＤＢＣテクノロジを用いて、前記ｂｅａｎオブジェクトを永続にＭｙｓｑｌ前記データベースにストレージし、ストレージされたデータにしたがって、任意の二つの異なるアクセス動作の類似性を計算し、 Using the JDBC technology , persistently store the bean object in the MySQL database and calculate the similarity of any two different access operations according to the stored data;

対数線形モデルをあわせて、前記アクティブユーザセットの各アクティブユーザにアクサスされたファイルの人気予測値をアカウントして、人気予測値によって各ファイルを降順にソートし、ファイルの上位２０％をホットスポットファイルとしてマークし、 The logarithmic linear model is combined to account for the popularity prediction value of the file accessed for each active user in the active user set, and each file is sorted in descending order according to the popularity prediction value, and the top 20% of the files are hot spot files Mark as

一致が出たら、合致するビジターＩＰをキーワードとして、各ユーザのアクセス開始時刻を照会して、対数線形モデを用いて、前記アクティブユーザセットの各アクティブユーザにアクセスされたファイルの人気予測値をアカウントし、人気予測値によって各ファイルを降順にソートし、ファイルの上位２０％をホットスポットファイルとしてマークし、 When a match is found, the access start time of each user is inquired using the matching visitor IP as a keyword, and the popularity prediction value of the file accessed by each active user of the active user set is accounted using a logarithmic linear model. Sort each file in descending order by popularity prediction, mark the top 20% of files as hotspot files,

前記対数線形モデルは：

であり、 The log-linear model is:

And

その中に、

がファイルｉの人気予測値であり、

Is the popularity prediction for file i,

本発明の実施例が提供するＨａｄｏｏｐに基づいて、データマージモジュールとＨＢａｓｅキャッシュモジュールを備えるＨＤＦＳから複数の２ＭＢ以下の小さなファイルを読み込む方法については、該読み込みメソッドはデータマージモジュールとＨＢａｓｅキャッシュモジュールを備えるＨＤＦＳシステムに適用され、該読み込みメソッドは：ユーサーによってインプットされる小さなファイルの読み込みコマンドを受信し、その中に、小さなファイルの読み込みコマンドがユーザＩＤと小さなファイルの名前を含み、ユーザＩＤと小さなファイルの名前にしたがってＨＢａｓｅキャッシュモジュールを照会し、対応するファイルコンテンツが出たら、照会されたファイルコンテンツを戻し、そうでなければ、ＨＤＦＳシステムのデータベースを照会し、成功したら、照会されたファイルコンテンツを戻し、そうでなければ、ＨａｄｏｏｐａｒｃｈｉｖｅツールのＡＰＩを呼び出して対応するＨＡＲファイルにアクセスしてＨＡＲファイルを戻す。小さなファイル間の関連性とホットスポットファイルを考慮しない現有技術と比べ、本発明の読み込みメソッドは小さなファイルのマージとＨＢａｓｅキャッシングメカニズムが組み合わせられた後で、小さなファイルの読み込み効率を改善できる。 Based on Hadoop provided by an embodiment of the present invention, for a method of reading a plurality of small files of 2 MB or less from HDFS having a data merge module and an HBase cache module , the read method includes a data merge module and an HBase cache module. Applied to the HDFS system, the read method receives: a small file read command input by the user, in which the small file read command includes a user ID and a small file name, the user ID and the small file named according to query the HBase cache module, if you get the corresponding file content, return the file content which is queried, otherwise, the data of HDFS system Queries over vinegar, if successful, return the file content which is queried, otherwise, return the HAR file to access the corresponding HAR file by calling the API of Hadooparchive tool. Compared to the existing technology that does not consider the relationship between small files and hotspot files, the reading method of the present invention can improve the reading efficiency of small files after the combination of small file merging and HBase caching mechanism.

図１は本発明の提供するＨａｄｏｏｐに基づく大量の小さなファイルの読み込みメソッドの実施例のプロセス見取り図である。FIG. 1 is a process diagram of an embodiment of a method for reading a large amount of small files based on Hadoop provided by the present invention. 図２は本発明の提供するデータマージメソッドの実施例のプロセス見取り図である。FIG. 2 is a process sketch of an embodiment of the data merge method provided by the present invention. 図３は本発明の提供するキャッシュメソッドの実施例のプロセス見取り図である。FIG. 3 is a process sketch of the embodiment of the cache method provided by the present invention. 図４は本発明の提供するキャッシュメソッドのもう一つの実施例のプロセス見取り図である。FIG. 4 is a process sketch of another embodiment of the cache method provided by the present invention.

図１を参照し、本発明のＨａｄｏｏｐに基づいて、データマージモジュールとＨＢａｓｅキャッシュモジュールを備えるＨＤＦＳから複数の２ＭＢ以下の小さなファイルを読み込む方法の実施例のプロセス見取り図であり、そのメソッドはステップ１０１からステップ１０５を含む。該当読み込みメソッドはデータマージモジュールとＨＢａｓｅキャッシュモジュールを備えるＨＤＦＳシステムに適用され、各ステップは以下のように： Referring to FIG. 1, a process sketch of an embodiment of a method for reading a plurality of small files of 2 MB or less from HDFS comprising a data merge module and an HBase cache module based on the Hadoop of the present invention, the method starts from step 101 Step 105 is included. The corresponding read method is applied to the HDFS system including the data merge module and the HBase cache module, and each step is as follows:

ステップ１０４：ＨａｄｏｏｐａｒｃｈｉｖｅツールのＡＰＩを呼び出し、小さなファイルの名前が対応するＨＡＲファイルにアクセスし、そのＨＡＲファイルを戻す。 Step 104: Invoke the API of the Hadooparkive tool, access the HAR file to which the name of the small file corresponds, and return the HAR file.

ステップ１０５：照会されたファイルコンテンツを戻す。 Step 105: returning the file contents that have been queried.

ステップＡ：クライアントがストレージされるファイルをアップロードした後で、ＨＤＦＳのすべてのファイルをトラバースし、ユーザアクセスプリファレンスモデルを用いて、ストレージされるファイルの関連ファイルコレクションを見つけ、ここで、ユーザアクセスプリファレンスモデルはユーザアクセスログレコードに基づいている。 Step A: After the client uploads the stored file, traverse all the files in HDFS and use the user access preference model to find the relevant file collection of the stored file, where the user access profile The reference model is based on user access log records.

本実施例の中で、ユーザアクセスプリファレンスモデルはユーザアクセスログレコードに基づいて統計されたもので、具体的には：ユーザアクセスログレコードからアクティブユーザセットを統計し、ｂｅａｎオブジェクトがアクティブユーザセットによってアクセスされた小さなファイルを表すために使用され、その中に、ｂｅａｎオブジェクトのプロパティが該小さなファイルをアクセスするユーザＩＤ、ユーザがアクセスした小さなファイルの名前及び該小さなファイルがアクサスされた回数を含み、ＪＤＢＣテクノロジによって、ｂｅａｎオブジェクトを永続にＭｙｓｑｌデータベースにストレージし、ストレージされたデータにしたがって、任意の二つの異なるアクセス動作の類似性を計算し、任意の二つの異なるアクセス動作の類似性が正の場合、任意の二つのアクセス動作のユーザが類似ユーザであり、類似ユーザのＩＤを記録し、関連ファイルコレクションを使って、類似ユーザによってアクサスされ、関連付けられたファイル情報をストレージし、関連ファイルコレクションにしたがって、ユーザアクセスプリファレンスモデルを構築する。 In this embodiment, the user access preference model is statistically based on the user access log record, specifically: statistics the active user set from the user access log record, and the bean object depends on the active user set. Used to represent the accessed small file, in which the properties of the bean object include the user ID accessing the small file, the name of the small file accessed by the user, and the number of times the small file was accessed, With the JDBC technology, the bean object is permanently stored in the MySQL database, the similarity of any two different access operations is calculated according to the stored data, and any two different access operations are If the similarity is positive, the user of any two access actions is a similar user, records the ID of the similar user, and uses the related file collection to store the associated file information accessed by the similar user. Build a user access preference model according to the related file collection.

本実施例の中で、ユーザアクセスログレコードよりアクティブユーザセットを統計し、具体的には：アクセスされたソースの接尾辞がｊｐｇであるレコード行をユーザアクセスログレコードからフィルタリングし、その中に、レコード行がユーザＩＤ、アクセスページＵＲＬ、アクセス開始時刻、アクセス状況、アクセストラフィックを含み、レコード解析クラスを作成してレコード行を解析し、二次元配列を使用してビジターＩＰと小さなファイルの名前をストレージし、二次元配列のビジターＩＰをトラバースし、ＨａｓｈＭａｐコレクションを使用して各ビジターＩＰのトラフィックを統計し、ＨａｓｈＭａｐコレクションのＫｅｙ値がビジターＩＰであり、Ｖａｌｕｅ値がトラフィックであり、ＨａｓｈＭａｐコレクションをＶａｌｕｅ値の降順でソートし、ビジターＩＰの上位２０％をフィルタリングし、ＡｒｒａｙＬｉｓｔコレクションを使用して該ＩＰサブセットをストレージし、アクティブユーザセットとしてマークする。 In this example, the active user set is statistics from the user access log record, specifically: the record line with the suffix of the accessed source is jpg is filtered from the user access log record, Record row contains user ID, access page URL, access start time, access status, access traffic, create record analysis class to analyze record row, use 2D array to enter visitor IP and small file name Storage, traversing a two-dimensional array of visitor IPs, using the HashMap collection to statistic the traffic for each visitor IP, the Key value of the HashMap collection is the visitor IP, the Value value is the traffic, and the HashMap collection is Val Sort by descending e value, filters the 20% upper visitor IP, the IP subnet to storage using ArrayList collection, marked as active user set.

（７）２０行の二行の間にデータを取り込み、数式

によって二つの異なるユーザアクセス作動の類似性をカウンタする。その中に、本発明はピアソン相関係数を使用して類似のユーザを決定し、スコアリング行列Ｒを指定し、ユーザａとユーザｂの類似性をｓｉｍ（ａ，ｂ）で表し、ｒａ及びｒｂが「ユーザ−トラフィック」ストアリングマトリックスのストアリングデータである。 (7) Take data between two lines of 20 lines, formula

Counters the similarity of two different user access actions. Among them, the present invention uses the Pearson correlation coefficient to determine similar users, specifies a scoring matrix R, expresses the similarity between user a and user b as sim (a, b), and ra and rb is the storing data of the “user-traffic” storing matrix.

ステップＤ：マージするキューのすべてのファイルを一つのデータブロックにマージし、マージするキューのファイル情報をクリアし、マージしたファイルのソースファイルを削除して、ステップＢを戻す。 Step D: Merge all the files of the queue to be merged into one data block, clear the file information of the queue to be merged, delete the source file of the merged file, and return to Step B.

本実施例において、ステップ３０２は具体的には：ＡｒｒａｙＬｉｓｔコレクションから抽出されたビジターＩＰを、二次元配列から抽出されたビジターＩＰと照合し、一致が出たら、合致するビジターＩＰをキーワードとして、各ユーザのアクセス開始時刻を照会して、対数線形モデを用いて、アクティブユーザセットの各アクティブユーザにアクセスされたファイルの人気予測値をアカウントし、人気予測値によって各ファイルを降順にソートし、ファイルの上位２０％をホットスポットファイルとしてマークする。 In this embodiment, the step 302 is specifically: matching the visitor IP extracted from the ArrayList collection with the visitor IP extracted from the two-dimensional array, and if a match is found, each matching visitor IP is used as a keyword. Queries the user's access start time, accounts for the popularity predictions of the files accessed by each active user in the active user set using a logarithmic linear model, sorts each file in descending order by popularity prediction value, and Mark the top 20% of as hotspot files.

対数線形モデルは：

であり、 The log-linear model is:

And

その中に、

がファイルｉの人気予測値であり、

と

Is the popularity prediction for file i,

When

上記からわかったことは、本発明の実施例の提供するＨａｄｏｏｐに基づいて、データマージモジュールとＨＢａｓｅキャッシュモジュールを備えるＨＤＦＳから複数の２ＭＢ以下の小さなファイルを読み込む方法において、該読み込みメソッドがデータマージモジュールとＨＢａｓｅキャッシュモジュールを備えるＨＤＦＳシステムに適用される。該読み込みメソッドは、ユーサーによってインプットされる小さなファイルの読み込みコマンドを受信し、その中に、小さなファイルの読み込みコマンドがユーザＩＤと小さなファイルの名前を含み、ユーザＩＤと小さなファイルの名前でＨＢａｓｅキャッシュモジュールを照会し、対応するファイルコンテンツが出たら、照会されたファイルコンテンツを戻し、対応するファイルコンテンツが出ないと、ＨＤＦＳシステムのデータベースを照会し、成功したら、照会されたファイルコンテンツを戻し、失敗したら、ＨａｄｏｏｐａｒｃｈｉｖｅツールのＡＰＩを呼び出し、対応するＨＡＲファイルにアクセスしてＨＡＲファイルを戻す。現有技術と比べて、小さなファイルの間のアソシエーションとホットスポットファイルを問わず、本発明の読み込みメソッドは小さなファイルとＨＢａｓｅキャッシングメカニズムがマージされた後で、小さなファイルの読み込み効率を上げられる。 What has been found from the above is that, based on Hadoop provided by the embodiment of the present invention, a method for reading a plurality of small files of 2 MB or less from HDFS having a data merge module and an HBase cache module , the read method is the data merge module. And an HDFS system including an HBase cache module. The read method receives a small file read command input by the user, and the small file read command includes a user ID and a small file name, and the HBase cache module includes the user ID and the small file name. If the corresponding file content is found, the queried file content is returned . If the corresponding file content is not found, the HDFS system database is queried. If successful, the queried file content is returned . , Calls the API of the Hadooparkive tool, accesses the corresponding HAR file, and returns the HAR file. Compared to existing technologies, regardless of the association between small files and hotspot files, the read method of the present invention can increase the efficiency of reading small files after the small file and HBase caching mechanism are merged.

Claims

A method for reading a large amount of small files based on Hadoop, which features: The read method is applied to an HDFS system including a data merge module and an HBase cache module, and the read method: a read command of a small file input by a user In which the read command for the small file includes the user ID and the name of the small file, queries the HBase cache module according to the user ID and the name of the small file, and the corresponding file When the content comes out, it goes back to the file content queried by the HBase cache module, otherwise the HDFS system defaults by the name of the small file. Query the database to determine if the corresponding file content has been queried; if yes, go back to the file content queried by the database; otherwise, call the Hadooprchive tool API to name the small file Access the corresponding HAR file and return to the HAR file, and the data merge method employed by the data merge module includes: Step A: All files in HDFS after the client uploads the file to be stored And combine user access preference models to find an associated file collection of the stored files, in which the user access preference model is Based on the log record, Step B: Add the middle file of the related file collection and the stored file to the queue to be merged in sequence, Step C: The total size of all files in the merge queue is 128 MB If yes, go to Step D, otherwise go to Step E, Step D: Merge all the files in the queue to be merged into one data block, and merge Clear the file information of the queue to be deleted, delete the source file of the merged file, and return to Step B. Step E: All the middle files of the related file collection and the files to be stored are added to the queue to be merged. And if yes , Merge all the files of the queue to be merged into one data block, clear the file information of the queue to be merged, delete the source file of the merged file, go to step F, otherwise Proceed to step B, step F: store all merged data blocks in the HDFS system, the user access preference model is statistic from user access log records, specifically: from user access log records Statistics of the active user set, the bean object is used to represent a small file accessed by the active user set, and the small file refers to a file of 2 MB or more in size, in which the bean object The properties of the object include the user ID for accessing the small file, the name of the small file accessed by the user, and the number of times the small file has been accessed, and combined with JDBC technology, the bean object is permanently stored in the database Then, according to the stored data, the similarity between any two different access operations is calculated, and when the similarity between any two different access operations is positive, the users of any two access operations are similar The user is a user, records the ID of the similar user, uses the related file collection to store the file information related to and accessed by the similar user, and configures the user access preference model according to the related file collection. In the cast method employed by the HBase cache module, a user access log record is acquired, the active user set is statistically calculated from the user access log record, and a logarithmic linear model is added to each active user of the active user set. Account for the popularity prediction value of the file that was accessed, and sort each file in descending order by popularity prediction value, mark the top 20% of the file as a hot spot file, retrieve the hot spot file, and store the Hbase database Adopting and caching the relevant information of the hotspot file and statistics the active user set from the user access log record, specifically, the accessed source suffix is jpg Filtering a record line from the user access log record, the record line including a user ID, an access page URL, an access start time, an access status, and access traffic, creating a record analysis class and analyzing the record line; Dimension array is used to store visitor IP and small file names, visitor IP is traversed by the two dimensional array, HashMap collection is used to stat each visitor IP traffic, and the Key value of the HashMap collection is The visitor IP, the value of the value is traffic, the HashMap collection is sorted in descending order of the value of the value, the top 20% of the visitor IP is filtered, and the IP List is collected using the ArrayList collection. Store the set, mark it as an active user set, combine the log-linear model, account for the popularity prediction value of the file accessed for each active user in the active user set, and sort each file in descending order by popularity prediction value And the top 20% of the file is marked as a hot spot file. Specifically, the visitor IP extracted from the ArrayList collection is compared with the visitor IP extracted from the two-dimensional array, and if a match is found. The access start time of each user is inquired using the matching visitor IP as a keyword, the logarithmic linear model is matched, the popularity prediction value of the file accessed by each active user of the active user set is accounted, and the popularity prediction value By descending order for each file Sort, mark the top 20% of the file as a hot spot file, the log-linear model:

And in that,

Is the popularity prediction for file i,