JP2019204474A

JP2019204474A - Storage method using user access preference model

Info

Publication number: JP2019204474A
Application number: JP2018147290A
Authority: JP
Inventors: 魏文国; Wenguo Wei; 黄雄; Xiong Huang; 陳木朝; Mu Chao Chen; 蔡君; Jun Cai; 謝桂園; Guiyuan Xie; 趙慧民; Huimin Zhao; 彭建烽; Jianfeng Peng
Original assignee: New H3C Technologies Co Ltd; Guangdong Polytechnic Normal University; Guangdong Communications Polytechnic
Current assignee: New H3C Technologies Co Ltd; Guangdong Polytechnic Normal University; Guangdong Communications Polytechnic
Priority date: 2018-05-22
Filing date: 2018-08-04
Publication date: 2019-11-28
Anticipated expiration: 2038-08-04
Also published as: JP6642651B2; CN108846021A; CN108846021B

Abstract

To provide a storage method for a large number of small files based upon a user access preference model.SOLUTION: A method comprises: matching a user access preference model and finding out a related file; matching the user access preference model and finding out and adding the related file to a queue to be merged; deleting, when the size of files in the queue to be merged exceeds 128 MB, the source file of the merged file; and storing, if the size of files in the queue to be merged is 128 MB or less when all the files are added to the queue to be merged, all merged data blocks in an HDFS system. A technique plan of the present invention improves read efficiency of small files, so that the HDFS system can be reduced in consumption of a namenode memory.SELECTED DRAWING: Figure 1

Description

本発明はコンピュータテクノロジー分野に関し、具体的には、ユーザアクセスプリファレンスモデルに基づく大量の小さなファイルのストレージメソッドに関する。 The present invention relates to the field of computer technology, and in particular to a storage method for large numbers of small files based on a user access preference model.

Ｈａｄｏｏｐは２００５年にＡｐａｃｈｅＦｏｕｎｄａｔｉｏｎによってＬｕｃｅｎｅのサブプロジェクトであるＮｕｔｃｈの一部分として正式に導入されたものである。Ｈａｄｏｏｐの最も重要な二つの設計はＨＤＦＳとＭａｐＲｅｄｕｃｅである。ＨＤＦＳは大量なデータをストレージし、ファイルがデータブロックの形でシステムにストレージされる。また、ＨＤＦＳのデータブロックは通常のディスクに定義されたデータブロック（通常は５１２Ｂ）よりも遥かに大きく、ＨＤＦＳの現在のデフォルトブロックサイズは１２８ＭＢである。もしＨＤＦＳにストレージされたファイルのサイズが１２８に超えると、ＨＤＦＳは該ファイルを複数のブロックサイズのブロックに分割し、別々にストレージする。また、ＨＤＦＳが絶えずに小さなファイルをＴＢひいてはＰＢレベルまでストレージし続けると、小さなファイルの問題が発生し、此れは、大量のメタデータがＨＤＦＳのプライマリノードのｎａｍｅｎｏｄｅにストレージされるため、ｎａｍｅｎｏｄｅの負荷が大幅に増加し、システムの読み取りパフォーマンスに影響するためである。その中に、小さなファイルのサイズが２ＭＢに定義され、つまり、ＨＤＦＳがファイルをストレージする中で、ファイルのサイズが２Ｍまたは２Ｍ以下であると、小さなファイルとして定義される。 Hadoop was officially introduced in 2005 by Apache Foundation as part of Lucene, a subproject of Lucene. The two most important Hadoop designs are HDFS and MapReduce. HDFS stores a large amount of data, and files are stored in the system in the form of data blocks. Also, the HDFS data block is much larger than the data block defined for a normal disk (usually 512B), and the current default block size of HDFS is 128 MB. If the size of a file stored in HDFS exceeds 128, HDFS divides the file into blocks having a plurality of block sizes and stores them separately. In addition, if HDFS continuously stores small files to the TB and thus to the PB level, a problem of small files occurs. This is because a large amount of metadata is stored in the nanonode of the primary node of HDFS. This is because the load increases significantly and affects the read performance of the system. Among them, the size of a small file is defined as 2 MB, that is, when the file size is 2M or 2M or less while HDFS is storing the file, it is defined as a small file.

大量な小さなファイルの処理について、現有の技術においては、若干の小さなファイルを一つのブロックサイズの大きなファイルにマージすることであり、ファイル間の関連性を考慮せず、小さなファイルの読み込み効率が望ましくなくなる。 Regarding the processing of a large number of small files, the current technology is to merge a few small files into one large block size file, and the efficiency of reading small files is desirable without considering the relationship between files. Disappear.

中国特許出願公開第１０３５０００７７号明細書Chinese Patent Application No. 103500077 Specification

本発明の実施例はユーザアクセスプリファレンスモデルに基づく大量の小さなファイルのストレージメソッドを提供し、小さなファイルの読み取り効率を向上させ、ＨＤＦＳシステムのｎａｍｅｎｏｄｅメモリの消費を削減する。 Embodiments of the present invention provide a storage method for a large number of small files based on a user access preference model, improving the read efficiency of small files and reducing the consumption of namenode memory in the HDFS system.

本発明の実施例はユーザアクセスプリファレンスモデルに基づく大量の小さなファイルのストレージメソッドを提供し、具体的には： Embodiments of the present invention provide a large number of small file storage methods based on the user access preference model, specifically:

ステップＡ：クライアントがストレージされるファイルをアップロードした後で、ＨＤＦＳのすべてのファイルをトラバースし、ユーザアクセスプリファレンスモデルを合わせて前記ストレージされるファイルの関連ファイルセットを見つけ、その中に、前記ユーザアクセスプリファレンスモデルがユーザアクセスログレコードから統計されたものである。 Step A: After the client uploads the stored file, traverse all the files in HDFS and match the user access preference model to find the relevant file set of the stored file, in which the user The access preference model is statistics from user access log records.

ステップＢ：前記関連ファイルセットのミドルファイルと前記ストレージされるファイルをマージするキューに追加し、 Step B: Add the middle file of the related file set and the stored file to the queue for merging,

ステップＣ：前記マージするキューのすべてのファイルの総サイズが１２８ＭＢを超えるかどうかを判断し、イエスであれば、ステップＤに進み、そうでなければ、ステップＥに進み、 Step C: Determine whether the total size of all files in the queue to be merged exceeds 128 MB. If yes, go to Step D, otherwise go to Step E.

ステップＤ：マージするキューのすべてのファイルを一つのデータブロックにマージし、前記マージするキューのファイル情報をクリアし、マージしたファイルのソースファイルを削除して、ステップＢに進み、 Step D: Merge all the files of the queue to be merged into one data block, clear the file information of the queue to be merged, delete the source file of the merged file, go to Step B,

ステップＥ：前記関連ファイルセットのファイルと前記ストレージされるファイルが全部前記マージするキューに追加されたかどうかを判断し、イエスであれば、前記マージするキューのすべてのファイルを一つのデータブロックにマージし、前記マージするキューのファイル情報をクリアし、マージしたファイルのソースファイルを削除して、ステップＦに進み、そうでなければ、ステップＢに戻り、 Step E: It is determined whether all the files of the related file set and the files to be stored have been added to the merge queue. If yes, all the files of the merge queue are merged into one data block. Then, the file information of the queue to be merged is cleared, the source file of the merged file is deleted, and the process proceeds to Step F. Otherwise, the process returns to Step B.

ステップＦ：マージしたすべてのデータブロックをＨＤＦＳシステムにストレージする。 Step F: Store all merged data blocks in the HDFS system.

更に、前記ユーザアクセスプリファレンスモデルがユーザアクセスログレコードから統計されたものであり、 Further, the user access preference model is a statistic from a user access log record,

具体的には： In particular:

前記ユーザアクセスログレコードからアクティブユーザセットを統計し、 Statistics active user sets from the user access log records;

前記アクティブユーザセットにアクセスされた小さなファイルをｂｅａｎオブジェクトで表し、前記小さなファイルがサイズが２ＭＢ又は２ＭＢ以下のファイルであり、その中に、前記ｂｅａｎオブジェクトのプロパティが該小さなファイルにアクセスしたユーザＩＤ、ユーザにアクセスされた小さなファイルの名前及び該小さなファイルがアクセスされた回数を含み、 A small file accessed by the active user set is represented by a bean object, and the small file is a file having a size of 2 MB or less, and a property of the bean object is a user ID that accesses the small file, Including the name of the small file accessed by the user and the number of times the small file has been accessed,

ＪＤＢＣテクノロジを合わせ、前記ｂｅａｎオブジェクトをＭｙｓｑｌデータベースに永続化してストレージし、ストレージされたデータによって、任意の二つの異なるアクセス動作の類似性をアカウントし、 Combine JDBC technology, store the bean object in the MySQL database persistently, and account for the similarity of any two different access operations with the stored data,

任意の二つの異なるアクセス動作の類似性が正の場合、前記任意の二つの異なるアクセス動作のユーザが類似ユーザであり、類似ユーザのＩＤレコードを記録して関連ファイルセットによってすべての類似ユーザにアクセスされ、関連付けられたファイル情報をストレージし、 If the similarity between any two different access operations is positive, the users of any two different access operations are similar users and record similar user ID records and access all similar users via related file sets Store the associated file information,

前記関連ファイルセットによって、前記ユーザアクセスプリファレンスモデルを構築する。 The user access preference model is constructed by the related file set.

更に、前記前記ユーザアクセスログレコードからアクティブユーザセットを統計するについて、具体的には： Further on statistics of active user sets from the user access log record, specifically:

アクセスされたソースの接尾辞がｊｐｇであるレコード行を前記ユーザアクセスログレコードからフィルタリングし、前記レコード行がユーザＩＤ、アクセスページＵＲＬ、アクセス開始時刻、アクセス状況、アクセストラフィックを含み、 Filtering a record line with the accessed source suffix jpg from the user access log record, the record line including a user ID, access page URL, access start time, access status, access traffic;

レコード解析クラスを作成して前記レコード行を解析し、二次元配列を使用してビジターＩＰと小さなファイルの名前をストレージし、 Create a record parsing class to parse the record row, use a 2D array to store the visitor IP and the name of the small file,

ビジターＩＰを前記二次元配列でトラバースし、ＨａｓｈＭａｐコレクションを使用して各ビジターＩＰのトラフィックを統計し、前記ＨａｓｈＭａｐコレクションのＫｅｙ値がビジターＩＰであり、Ｖａｌｕｅ値がトラフィックであり、 Traversing the visitor IP with the two-dimensional array, using the HashMap collection to statistic the traffic of each visitor IP, the Key value of the HashMap collection is the visitor IP, the Value value is the traffic,

前記ＨａｓｈＭａｐコレクションをＶａｌｕｅ値の降順でソートし、ビジターＩＰの上位２０％をフィルタリングし、ＡｒｒａｙＬｉｓｔコレクションを使用して該ＩＰサブセットをストレージし、アクティブユーザセットとしてマークし、 Sort the HashMap collection in descending order of Value values, filter the top 20% of visitor IPs, use the ArrayList collection to store the IP subset, mark it as an active user set,

本発明の実施例を実施すると、以下の有益効果が出る： Implementation of embodiments of the present invention has the following beneficial effects:

本発明の実施例が提供するユーザアクレスプリファレンスモデルに基づく大量の小さなファイルのストレージメソッドは、ユーザアクセスプリファレンスモデルを合わせて関連ファイルセットを見つけてマージするキューに順に追加し、マージするキューのファイルのサイズが１２８ＭＢを超えると、キューにあるファイルをすべて一つのデータブロックにマージし、マージするキューのファイル情報をクリアし、マージしたファイルのソースファイルを削除し、すべてのファイルがマージするキューに追加された時、マージするキューにあるファイルのサイズが１２８ＭＢ以下であれば、該キューのすべてのファイルを一つのデータブロックにマージし、マージするキューにあるファイル情報をクリアし、マージしたファイルのソースファイルを削除して、最後にマージしたすべてのデータブロックをＨＤＦＳシステムにストレージする。小さなファイル間の関連性を考慮しない既存の技術と比べて、本発明の技術プランが小さなファイルの読み取り効率を向上させ、ＨＤＦＳシステムでのｎａｍｅｎｏｄｅメモリの消費を削減できる。 The storage method for a large number of small files based on the user acces preference model provided by the embodiment of the present invention adds the user access preference model to find and merge related file sets in order and merges the queues. When the file size exceeds 128 MB, all the files in the queue are merged into one data block, the file information of the merged queue is cleared, the source file of the merged file is deleted, and all the files are merged If the size of the file in the queue to be merged is 128 MB or less, all the files in the queue are merged into one data block, the file information in the merge queue is cleared, and the merged file Source file Remove the Le and storage of all data blocks in HDFS systems merge last. Compared with the existing technology that does not consider the relationship between small files, the technical plan of the present invention can improve the reading efficiency of small files and reduce consumption of nameode memory in the HDFS system.

図１は本発明の提供するユーザアクセスプリファレンスモデルに基づく大量の小さなファイルのストレージメソッドの実施例のプロセス見取り図である。FIG. 1 is a process diagram of an embodiment of a storage method for a large number of small files based on the user access preference model provided by the present invention.

下記に本発明の実施例の中の附図を交え、本発明の実施例の技術方案を明確にはっきり説明し、説明した実施例がただ本発明の一部分の実施例で、全部の実施例ではないである。本発明の実施例に基づいて、本領域の普通技術者が創造的な労働を払わないことを前提に得る全部のその他の実施例は本発明の保護範囲に所属する The accompanying drawings in the embodiments of the present invention are described below, the technical solutions of the embodiments of the present invention are clearly explained, and the described embodiments are only a part of the embodiments of the present invention and not all the embodiments. It is. Based on the embodiments of the present invention, all other embodiments obtained on the assumption that ordinary engineers in this area do not pay creative labor belong to the protection scope of the present invention.

図１を参照し、本発明の提供するユーザアクセスプリファレンスモデルに基づく大量の小さなファイルのストレージメソッドであり、該メソッドがステップＡからステップＦまでを含み、各ステップは以下のように： Referring to FIG. 1, a storage method for a large number of small files based on the user access preference model provided by the present invention, which includes steps A to F, and each step is as follows:

ステップＡ：クライアントがストレージされるファイルをアップロードした後で、ＨＤＦＳのすべてのファイルをトラバースし、ユーザアクセスプリファレンスモデルを組み合わせて、ストレージされるファイルの関連ファイルコレクションを見つけ、その中に、ユーザアクセスプリファレンスモデルがユーザアクセスログレコードに基づいている。 Step A: After the client uploads the stored file, traverse all the files in HDFS and combine the user access preference model to find the relevant file collection of the stored file, in which the user access The preference model is based on user access log records.

本実施例の中で、ユーザアクセスプリファレンスモデルがユーザアクセスログレコードに基づいて統計されたもので、具体的には：ユーザアクセスログレコードからアクティブユーザセットを統計し、ｂｅａｎオブジェクトがアクティブユーザセットによってアクセスされた小さなファイルを表すために使用され、小さなファイルがサイズが２ＭＢまたは２ＭＢ以上のファイルを指し、その中に、ｂｅａｎオブジェクトのプロパティが該小さなファイルをアクセスするユーザＩＤ、ユーザがアクセスした小さなファイルの名前及び該小さなファイルがアクサスされた回数を含み、ＪＤＢＣテクノロジによって、ｂｅａｎオブジェクトを永続にＭｙｓｑｌデータベースにストレージし、ストレージされたデータにしたがって、任意の二つの異なるアクセス動作の類似性を計算し、任意の二つの異なるアクセス動作の類似性が正の場合、任意の二つのアクセス動作のユーザが類似ユーザであり、類似ユーザのＩＤを記録し、関連ファイルコレクションを使って、類似ユーザによってアクサスされ、関連付けられたファイル情報をストレージし、関連ファイルコレクションにしたがって、ユーザアクセスプリファレンスモデルを構築する。 In this embodiment, the user access preference model is statistically based on the user access log record, specifically: active user set statistics from the user access log record, and the bean object is determined by the active user set. Used to represent a accessed small file, where the small file refers to a file that is 2MB or larger in size, in which the bean object property is the user ID accessing the small file, the small file accessed by the user And the number of times the small file has been accessed, and with JDBC technology, the bean object is permanently stored in the MySQL database, and any two different If the similarity of any two different access actions is positive, the user of any two access actions is a similar user, record the similar user's ID, and the related file collection Use to store file information associated with and accessed by similar users and build a user access preference model according to the related file collection.

本実施例の中で、ユーザアクセスログレコードよりアクティブユーザセットを統計し、具体的には：アクセスされたソースの接尾辞がｊｐｇであるレコード行をユーザアクセスログレコードからフィルタリングし、その中に、レコード行がユーザＩＤ、アクセスページＵＲＬ、アクセス開始時刻、アクセス状況、アクセストラフィックを含み、レコード解析クラスを作成してレコード行を解析し、二次元配列を使用してビジターＩＰと小さなファイルの名前をストレージし、二次元配列のビジターＩＰをトラバースし、ＨａｓｈＭａｐコレクションを使用して各ビジターＩＰのトラフィックを統計し、ＨａｓｈＭａｐコレクションのＫｅｙ値がビジターＩＰであり、Ｖａｌｕｅ値がトラフィックであり、ＨａｓｈＭａｐコレクションをＶａｌｕｅ値の降順でソートし、ビジターＩＰの上位２０％をフィルタリングし、ＡｒｒａｙＬｉｓｔコレクションを使用して該ＩＰサブセットをストレージし、アクティブユーザセットとしてマークする。 In this example, the active user set is statistics from the user access log record, specifically: the record line with the suffix of the accessed source is jpg is filtered from the user access log record, Record row contains user ID, access page URL, access start time, access status, access traffic, create record analysis class to analyze record row, use 2D array to enter visitor IP and small file name Storage, traversing a two-dimensional array of visitor IPs, using the HashMap collection to statistic the traffic for each visitor IP, the Key value of the HashMap collection is the visitor IP, the Value value is the traffic, and the HashMap collection is Val Sort by descending e value, filters the 20% upper visitor IP, the IP subnet to storage using ArrayList collection, marked as active user set.

本発明のモデルの構築過程をもっとよく説明するために、下記の例えにより説明し、具体的な実現する過程が以下のように： In order to better explain the process of building the model of the present invention, it is explained by the following illustration, and the specific realization process is as follows:

（１）正規表現を使用してアクセスされたソースの接尾辞がｊｐｇであるレコード行をフィルタリングする。 (1) Filter record rows whose source suffix is jpg accessed using regular expressions.

（２）ログ解析クラスを作成してレコード行の五つのコンポーネントを別々に解析し、二次元配列を使ってビジターＩＰと小さなファイルの名前をストレージする。 (2) Create a log analysis class to analyze the five components of the record row separately, and store the visitor IP and the name of the small file using a two-dimensional array.

（３）二次元配列のビジターＩＰ要素をトラバースし、各ビジターＩＰのトラフィックをカウンタするカウンタを設計する。ＨａｓｈＭａｐコレクションを使って、ビジターＩＰをＫｅｙ値とし、Ｖａｌｕｅ値が該ビジターのトラフィックである。 (3) A counter that traverses the two-dimensional array of visitor IP elements and counts the traffic of each visitor IP is designed. Using the HashMap collection, the visitor IP is a key value, and the value value is the traffic of the visitor.

（４）ステップ３で生成されたＨａｓｈＭａｐコレクションをＶａｌｕｅ値にしたがって降順でソートし、ビジターＩＰの上位２０％をフィルタリングし、ＡｒｒａｙＬｉｓｔコレクションで該ＩＰサブセットをストレージし、アクティブユーザセットとしてマークする。 (4) Sort the HashMap collection generated in step 3 in descending order according to Value values, filter the top 20% of visitor IPs, store the IP subset in the ArrayList collection, and mark it as the active user set.

（５）一つのｂｅａｎオブジェクトによってアクティブユーザセットにアクセスされた小さなファイルを抽象に表し、オブジェクトのプロパティが該小さなファイルをアクセスしたユーザＩＤ、ユーザにアクセスされた小さなファイルの名前及び該小さなファイルがアクセスされた回数を含む。メソッドはプロパティを取得するｇｅｔ及びｓｅｔメソッドである。 (5) An abstract representation of a small file accessed to the active user set by one bean object, where the object properties are the user ID that accessed the small file, the name of the small file accessed by the user, and the small file accessed Including the number of times The methods are get and set methods for acquiring properties.

（６）ＪＤＢＣテクノロジｂｅａｎオブジェクトをＭｙｓｑｌデータベースに結合して永続にストレージし、以下の形式のテーブルが形成される： (6) Join the JDBC object bean object to the MySQL database and store it persistently to form a table of the form:

（７）２０行の二行の間にデータを取り込み、数式

によって二つの異なるユーザアクセス作動の類似性をカウンタする。その中に、本発明はピアソン相関係数を使用して類似のユーザを決定し、スコアリング行列Ｒを指定し、ユーザａとユーザｂの類似性をｓｉｍ（ａ，ｂ）で表し、ｒ_ａ及びｒ_ｂが「ユーザ−トラフィック」ストアリングマトリックスのストアリングデータである。 (7) Take data between two lines of 20 lines, formula

Counters the similarity of two different user access actions. Among them, the present invention uses the Pearson correlation coefficient to determine similar users, specifies a scoring matrix R, expresses the similarity between user a and user b as sim (a, b), and r _a and r _b is - a "user traffic" store ring data store ring matrix.

（８）ここで、ｓｉｍ（ａ，ｂ）の値が正の値であれば、二人の異なるユーザが類似ユーザであると判定され、そのユーザＩＤが記録される。 (8) Here, if the value of sim (a, b) is a positive value, it is determined that two different users are similar users, and the user IDs are recorded.

（９）類似ユーザのユーザＩＤに基づいて、一つのコレクションを使用して、類似ユーザにアクセスされ、関連付けられているすべてのファイル情報をストレージする。 (9) Based on the user ID of the similar user, all file information accessed and related by the similar user is stored using one collection.

ステップＢ：関連ファイルコレクションのミドルファイルとストレージされるファイルを順にマージするキューに追加する。 Step B: Add the middle file of the related file collection and the stored file to the queue to be merged in order.

ステップＣ：マージするキューのすべてのファイルの総サイズが１２８ＭＢを超えるかどうかを判断し、イエスであれば、ステップＤに進み、そうでなければ、ステップＥに進む。 Step C: Determine whether the total size of all files in the queue to be merged exceeds 128 MB. If yes, go to step D, otherwise go to step E.

ステップＤ：マージするキューのすべてのファイルを一つのデータブロックにマージし、マージするキューのファイル情報をクリアし、マージしたファイルのソースファイルを削除して、ステップＢに戻る。 Step D: Merge all files in the queue to be merged into one data block, clear file information in the queue to be merged, delete the source file of the merged file, and return to Step B.

ステップＥ：関連ファイルコレクションのミドルファイルとストレージされるファイルが全部マージするキューに追加されたかどうかを判断し、イエスであれば、マージするキューのすべてのファイルを一つのデータブロックにマージし、マージするキューのファイル情報をクリアし、マージしたファイルのソースファイルを削除して、ステップＦに進み、そうでなければ、ステップＢに進む。 Step E: Determine whether all of the middle files in the related file collection and the stored files have been added to the merge queue, and if yes, merge all the files in the merge queue into one data block and merge The file information of the queue to be cleared is cleared, the source file of the merged file is deleted, and the process proceeds to Step F. Otherwise, the process proceeds to Step B.

ステップＦ：すべてのマージしたデータブロックをＨＤＦＳシステムにストレージする。 Step F: Store all merged data blocks in the HDFS system.

更に、本発明には、複数の関連付けられた小さなファイルが一つの大きなファイルにマージされてシステムにストレージされ、システムのｎａｍｅｎｏｄｅノードが大きなファイルに対応するメタデータのみをストレージし、ｎａｍｅｎｏｄｅノードの維持するメタデータが大きく減少し、メモリの消費も減少になる。 Furthermore, the present invention merges a plurality of associated small files into one large file and stores it in the system, and the system's namode node stores only the metadata corresponding to the large file and maintains the namnode node. Metadata is greatly reduced and memory consumption is also reduced.

更に、本発明のマージメソッドは、関連付けられたファイルを同じの大きなファイルにマージして、マージされたファイルが同じのデータノードの同じのデータブロックにストレージされる。ユーザからファイルへのリクエストに強く関連性があると、すなわちユーザに絶えずにアクセスされた小さなファイルが同じ大きなファイルにマージされば、ファイルアクセスの原理によれば、システムがより近いｄａｔａｎｏｄｅノードのデータブロックを読み取り、つまり、同じｄａｔａｎｏｄｅのデータブロックカラデータを読み取るということであり、こうして異なるファイルにアクセス時に異なるデータノードの間にジャンプしなくで済み、ディスクアドレッシングのオーバーヘッドが削減され、占められるシステムリソースが比較的に少なくなり、ファイルの読み込み効率を大きく高める。 Furthermore, the merge method of the present invention merges the associated files into the same large file, and the merged file is stored in the same data block of the same data node. If a user's request for a file is strongly relevant, i.e. a small file that is constantly accessed by the user is merged into the same large file, according to the principle of file access, the data block of the datanode node closer to the system That is, reading the same datanode data block color data, thus avoiding jumping between different data nodes when accessing different files, reducing disk addressing overhead and occupying system resources. Relatively less, greatly increase the efficiency of reading files.

当業者は、上記の実施形態を実施するプロセスの全部または一部を理解することができ、コンピュータプログラムによって関連するハードウェアを指示することで完了することができ、のプログラムがコンピュター可読記憶媒体にストレージされることができ、該プログラムが実行される時に、上記の各メソッドの実施例のプロセスが含まれる。その中に、の記憶媒体が磁気ディスク、光ディスク、読み出し専用メモリ（Ｒｅａｄ−ＯｎｌｙＭｅｍｏｒｙ，ＲＯＭ）またはランダムアクセスメモリ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ，ＲＡＭ）であってもよい。 One of ordinary skill in the art can understand all or part of the process of implementing the above embodiments, and can be completed by directing the relevant hardware by a computer program, which is stored on a computer readable storage medium. Each of the above method example processes is included when the program can be stored and executed. Among them, the storage medium may be a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM) or a random access memory (Random Access Memory, RAM).

以上に述べたのは本発明の優先された実施形態であり、注意すべきのは、本技術分野の一般的な技術員にとって、本発明の原理から離れないことを前提として、若干な改善や飾りができ、これらの改善や飾りも本発明の保護範囲に含まれる。 The above is a preferred embodiment of the present invention, and it should be noted that it should be noted that a general engineer in the technical field will make slight improvements and decorations on the assumption that the principle of the present invention is not departed. These improvements and decorations are also included in the protection scope of the present invention.

本発明はコンピュータテクノロジー分野に関し、具体的には、ユーザアクセスプリファレンスモデルを用いたストレージ方法に関する。 The present invention relates to the field of computer technology, and more particularly, to a storage method using a user access preference model .

本発明の実施例はユーザアクセスプリファレンスモデルを用いたストレージ方法を提供し、小さなファイルの読み取り効率を向上させ、ＨＤＦＳシステムのｎａｍｅｎｏｄｅメモリの消費を削減する。 Embodiments of the present invention provide a storage method using a user access preference model , improving the reading efficiency of small files and reducing the consumption of the namenode memory of the HDFS system.

本発明の実施例はユーザアクセスプリファレンスモデルを用いたストレージ方法を提供し、具体的には： Embodiments of the present invention provide a storage method using a user access preference model , specifically:

ステップＡ：クライアントがストレージされるファイルをアップロードした後で、ＨＤＦＳのすべてのファイルをトラバースし、ユーザアクセスプリファレンスモデルを用いて前記ストレージされるファイルの関連ファイルセットを見つけ、ここで、前記ユーザアクセスプリファレンスモデルがユーザアクセスログレコードから統計されたものである。 Step A: After the client uploads the stored file, it traverses all the files in HDFS and finds the associated file set of the stored file using the user access preference model , where the user access The preference model is statistics from user access log records.

具体的には： In particular:

ＪＤＢＣテクノロジを用いて、前記ｂｅａｎオブジェクトをＭｙｓｑｌデータベースに永続化してストレージし、ストレージされたデータによって、任意の二つの異なるアクセス動作の類似性をアカウントし、 Using the JDBC technology, the bean object is persisted and stored in a MySQL database, and the stored data accounts for the similarity of any two different access operations,

本発明の実施例が提供するユーザアクレスプリファレンスモデルを用いた大量の小さなファイルのストレージ方法は、ユーザアクセスプリファレンスモデルを用いて関連ファイルセットを見つけてマージするキューに順に追加し、マージするキューのファイルのサイズが１２８ＭＢを超えると、キューにあるファイルをすべて一つのデータブロックにマージし、マージするキューのファイル情報をクリアし、マージしたファイルのソースファイルを削除し、すべてのファイルがマージするキューに追加された時、マージするキューにあるファイルのサイズが１２８ＭＢ以下であれば、該キューのすべてのファイルを一つのデータブロックにマージし、マージするキューにあるファイル情報をクリアし、マージしたファイルのソースファイルを削除して、最後にマージしたすべてのデータブロックをＨＤＦＳシステムにストレージする。小さなファイル間の関連性を考慮しない既存の技術と比べて、本発明の技術プランが小さなファイルの読み取り効率を向上させ、ＨＤＦＳシステムでのｎａｍｅｎｏｄｅメモリの消費を削減できる。 A method for storing a large amount of small files using a user access preference model provided by an embodiment of the present invention is a queue for sequentially adding and merging related file sets using a user access preference model. If the size of the file exceeds 128MB, all the files in the queue are merged into one data block, the file information of the merged queue is cleared, the source file of the merged file is deleted, and all the files are merged When added to the queue, if the size of the file in the queue to be merged is 128MB or less, all files in the queue are merged into one data block, the file information in the queue to be merged is cleared, and merged The source file of the file Dividing it to storage all data blocks finally merged into HDFS system. Compared with the existing technology that does not consider the relationship between small files, the technical plan of the present invention can improve the reading efficiency of small files and reduce consumption of nameode memory in the HDFS system.

図１は本発明の提供するユーザアクセスプリファレンスモデルを用いた大量の小さなファイルのストレージ方法の実施例のプロセス見取り図である。FIG. 1 is a process sketch of an embodiment of a storage method for a large amount of small files using the user access preference model provided by the present invention.

図１を参照し、本発明の提供するユーザアクセスプリファレンスモデルを用いたストレージ方法であり、該メソッドがステップＡからステップＦまでを含み、各ステップは以下のように： Referring to FIG. 1, a storage method using a user access preference model provided by the present invention, the method includes steps A to F, and each step is as follows:

（７）２０行の二行の間にデータを取り込み、数式
によって二つの異なるユーザアクセス作動の類似性をカウンタする。その中に、本発明はピアソン相関係数を使用して類似のユーザを決定し、スコアリング行列Ｒを指定し、ユーザａとユーザｂの類似性をｓｉｍ（ａ，ｂ）で表し、ｒａ及びｒｂが「ユーザ−トラフィック」ストアリングマトリックスのストアリングデータである。 (7) Capture data between two rows of 20 and count the similarity of two different user access actions by mathematical formula. Among them, the present invention uses the Pearson correlation coefficient to determine similar users, specifies a scoring matrix R, expresses the similarity between user a and user b as sim (a, b), and ra and rb is the storing data of the “user-traffic” storing matrix.

Claims

The storage method of the user access preference model includes: Step A: After the client uploads the file to be stored, traverses all the files in HDFS, and the stored file in accordance with the user access preference model Find the associated file set, in which the user access preference model is the statistics from the user access log record, Step B: a queue that merges the middle file of the associated file set and the stored file Step C: Determine whether the total size of all files in the queue to be merged exceeds 128 MB. If yes, go to Step D, otherwise go to Step E, D: Merge all the files in the queue to be merged into one data block, clear the file information in the queue to be merged, delete the source file of the merged file, proceed to step B, step E: It is determined whether all the files of the related file set and the stored file have been added to the merge queue. If yes, all the files of the merge queue are merged into one data block, and Clear the file information of the queue to be merged, delete the source file of the merged file, proceed to Step F, otherwise return to Step B, Step F: Store all merged data blocks in the HDFS system And the user access preference model is Statistic from user access log records, specifically: statistics active user set from user access log record, and represents a small file accessed by the active user set as a bean object, A file whose size is 2 MB or less, and the bean object properties include the user ID that accessed the small file, the name of the small file accessed by the user, and the number of times the small file was accessed. , Combining JDBC technology, persistently storing the bean object in the MySQL database, and accounting the similarity of any two different access operations depending on the stored data, and any two different If the access action similarity is positive, the user of any two different access actions is a similar user, and the similar user ID record is recorded and all similar users are accessed and associated by the related file set. For storing file information and building an active user set from the user access log record that builds the user access preference model by the associated file set, specifically: the suffix of the accessed source is filter the record line that is jpg from the user access log record, the record line includes a user ID, access page URL, access start time, access status, access traffic, create a record analysis class and record the record line Analyze, store visitor IP and small file names using a 2D array, traverse the visitor IP with the 2D array, use the HashMap collection to stat the traffic for each visitor IP, and the HashMap collection The Key value of the visitor IP, the Value value is the traffic, the HashMap collection is sorted in descending order of the Value value, the top 20% of the visitor IP is filtered, and the IPList is stored using the ArrayList collection. Mark as active user set.