JP2014174966A

JP2014174966A - Character string data processing method, program and system

Info

Publication number: JP2014174966A
Application number: JP2013050191A
Authority: JP
Inventors: Michihiro Horie; 倫大堀江; Kazunori Ogata; 一則緒方; Kiyokuni Kawachiya; 清久仁河内谷; Johnson Graeme; グレーム・ジョンソン; Dawson Michael; マイケル・ドーソン
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 2013-03-13
Filing date: 2013-03-13
Publication date: 2014-09-22
Anticipated expiration: 2033-03-13
Also published as: JP6103994B2

Abstract

PROBLEM TO BE SOLVED: To improve memory use efficiency by allowing guest VMs to share character string objects in a heap memory.SOLUTION: In order to allow objects in the same memory image to be utilized among a plurality of JVMs, a system of the invention performs the following: extracting character string objects commonly used by a plurality of guest VMs to summarize them into a file; mapping the file to a predefined address in a memory upon starting execution; retrieving the character string objects in the mapped file when a character string object is to be generated in execution; and, if the same character string object is found, making it usable. Shared character string data sets are extracted and summarized for each application and component. In order to prepare such shared character string data sets, the system of the invention executes one or more Java(R) programs on a plurality of JVMs on a plurality of guest VMs.

Description

この発明は、１つのコンピュータ・システムの上で複数のゲスト環境（オペレーティング・システムやJava(R) VM）が動作する環境において、文字列データを処理する技法に関するものである。 The present invention relates to a technique for processing character string data in an environment in which a plurality of guest environments (operating system and Java (R) VM) operate on one computer system.

従来より、１つのコンピュータ・システム（マシンとも呼ばれる）複数のゲスト環境が動作する環境において、物理メモリの使用効率を上げることは重要な課題である。その際、ほぼ同一のソフトウェア群が、複数のゲストＶＭで動作する環境では、各ゲストＶＭが同じ文字列データを生成するので、無駄がある。 Conventionally, it has been an important issue to increase the use efficiency of physical memory in an environment where a plurality of guest environments operate as a single computer system (also called a machine). At that time, in an environment where almost the same software group operates in a plurality of guest VMs, each guest VM generates the same character string data, which is wasteful.

これを解消するための１つの従来技法として、ＪＶＭのClass sharing機能を拡張して、ＪＶＭのクラスデータ等を複数ゲストＶＭ間で共有する手法が知られている。これに関するより詳しい情報は、http://www.ibm.com/developerworks/jp/java/library/j-shared/などを参照されたい。 As one conventional technique for solving this problem, there is known a method of extending the class sharing function of JVM to share JVM class data and the like among a plurality of guest VMs. For more information about this, see http://www.ibm.com/developerworks/jp/java/library/j-shared/.

また、ＪＶＭ起動時に生成される文字列オブジェクトのキャッシュファイルを作成し、各ゲストＶＭで共有する手法も知られている。しかしこのとき、キャッシュファイルが共有されるだけで、オブジェクトそのものは各ＪＶＭ内で作り直さなくてはならず、共有できない。 Also known is a method of creating a cache file of a character string object generated when starting up a JVM and sharing it with each guest VM. However, at this time, only the cache file is shared, and the object itself must be recreated in each JVM, and cannot be shared.

特開２００２−２２９７９３号公報は、Ｊａｖａ(R)プログラムにおいて、各変数名および各メソッド名として文字または文字列を割り当てるに際して、変数名グループと、引数型別のメソッド名グループからなる複数のグループについて、１つのグループ内では個別の対象に対してそれぞれ固有の文字または文字列を割り当てるとともに、複数のグループ間では個別の対象に対して適宜に共通の文字または文字列を割り当てることを開示する。しかし、この先行技術には、割り当てた文字列を検索する技法についての記述はない。 Japanese Patent Laid-Open No. 2002-229793 discloses a plurality of groups consisting of a variable name group and a method name group for each argument type when assigning characters or character strings as variable names and method names in a Java (R) program. Disclosed is that a unique character or character string is assigned to each individual object within one group, and a common character or character string is appropriately assigned to each individual object among a plurality of groups. However, this prior art does not describe a technique for searching for an assigned character string.

米国特許公開第２０１２／００１７２０４号明細書は、ＪＶＭ起動時に生成される文字列オブジェクトをキャッシュファイルに入れ、次回起動時以降は、キャッシュファイルから文字列オブジェクトを生成することで、ＪＶＭ起動時に重複して生成される文字列オブジェクトを削減することを開示する。この技法においては、ＪＶＭが起動するまでに重複して生成される文字列しか対象にされない。また、キャッシュファイルをロードした後で、インターン・テーブルに文字列オブジェクトを格納し直さなくてはならない。 In US Patent Publication No. 2012/0017204, a character string object generated at the time of starting up a JVM is put in a cache file, and a character string object is generated from the cache file at the next time of starting up. It is disclosed that the number of character string objects generated is reduced. In this technique, only character strings that are duplicated before the JVM is started are targeted. Also, after loading the cache file, the string object must be re-stored in the intern table.

米国特許第７７０７５８３号明細書は、ランタイム・システムにおいてオブジェクトを共有し、スケーラブル・マネジャにおけるユーザ・セッション間を隔離する技法を開示する。この技法において、ユーザ・セッションに対応するユーザ・コンテキストが、共有メモリ領域にストアされる。そして、当該ユーザ・セッションに対応するリクエストを受領すると、一組のオペレーティング・システム・プロセスから１つのプロセスが選択され、一組のランタイム・システムから、１つのランタイム・システムが選択される。この技法においては、複数仮想マシン間でクラスやオブジェクトを共有することで、物理メモリ使用量が削減される。また、クラス単位で共有可能なオブジェクトを分類し、どんな型のオブジェクトを共有するかユーザーが判断することが可能ならしめられる。しかし、この先行技術には、高速に共有オブジェクトを検索する技法についての記述はない。 U.S. Pat. No. 7,707,583 discloses a technique for sharing objects in a runtime system and isolating user sessions in a scalable manager. In this technique, a user context corresponding to a user session is stored in a shared memory area. When a request corresponding to the user session is received, one process is selected from the set of operating system processes, and one runtime system is selected from the set of runtime systems. In this technique, the amount of physical memory used is reduced by sharing classes and objects among a plurality of virtual machines. In addition, it is possible to classify objects that can be shared in class units and to determine what types of objects are shared by the user. However, this prior art does not describe a technique for searching for a shared object at high speed.

米国特許公開第２００４／００４９４９３号明細書は、バケットペイロードのＡＳＣＩＩ文字列に基づいてルーティングを行うための文字列検索手法を開示する。この手法において、登録文字列は、一つのハッシュテーブルに登録される。検索時、ハッシュテーブルから検索文字列を探す前に、配列に部分文字列を検索しに行き、登録されている可能性がない文字列検索は即座に打ち切られる。さらに、配列は２つ用意され、ハッシュテーブルとともに階層的に構成される。この技法は、検索の早い段階で該当しない文字列の検索を打ち切ることは示すものの、検索を高速化するための文字列の格納の工夫については示唆するものではない。 U.S. Patent Publication No. 2004/0049493 discloses a string search technique for routing based on an ASCII string in a bucket payload. In this method, the registered character string is registered in one hash table. At the time of searching, before searching for a search character string from the hash table, a search for a partial character string in the array is performed, and a character string search that may not be registered is immediately terminated. Further, two arrays are prepared and are hierarchically configured with a hash table. Although this technique indicates that the search for a character string that does not apply is terminated at an early stage of the search, it does not suggest a contrivance for storing the character string for speeding up the search.

米国特許第７４１８５０５号明細書は、ＩＰアドレスのプリフィックス長毎にハッシュテーブルを分け、ハッシュテーブル中の値の衝突を減らすことで、ルーティングを高速に行うための手法を開示する。しかし、この技法においては、ハッシュテーブル中の値の衝突をできるだけ回避する準備を限定的にしか行うことができない。 US Pat. No. 7,418,505 discloses a technique for performing routing at high speed by dividing a hash table for each prefix length of an IP address and reducing collision of values in the hash table. However, this technique can only make limited provisions to avoid collisions of values in the hash table as much as possible.

Kiyokuni Kawachiya, Kazunori Ogata, Tamiya Onodera. "Analysis and Reduction of Memory Inefficiencies in Java Strings,", In Proceedings of the 23rd Annual ACM Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA '08), pp. 385-401 (Oct. 2008).は、Java(R)ヒープにおいて、重複した文字列と使用されていないリテラルに着目することにより、メモリの使用効率を向上することを開示する。 Kiyokuni Kawachiya, Kazunori Ogata, Tamiya Onodera. "Analysis and Reduction of Memory Inefficiencies in Java Strings,", In Proceedings of the 23rd Annual ACM Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA '08), pp. 385-401 (Oct. 2008). Discloses that in the Java (R) heap, memory usage efficiency is improved by focusing on duplicate character strings and unused literals.

特開２００２−２２９７９３号公報JP 2002-229793 A 米国特許公開第２０１２／００１７２０４号明細書US Patent Publication No. 2012/0017204 米国特許第７７０７５８３号明細書US Pat. No. 7,707,583 米国特許公開第２００４／００４９４９３号明細書US Patent Publication No. 2004/0049493 米国特許第７４１８５０５号明細書US Pat. No. 7,418,505

http://www.ibm.com/developerworks/jp/java/library/j-shared/http://www.ibm.com/developerworks/jp/java/library/j-shared/ Kiyokuni Kawachiya, Kazunori Ogata, Tamiya Onodera. "Analysis and Reduction of Memory Inefficiencies in Java Strings,", In Proceedings of the 23rd Annual ACM Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA '08), pp. 385-401 (Oct. 2008).Kiyokuni Kawachiya, Kazunori Ogata, Tamiya Onodera. "Analysis and Reduction of Memory Inefficiencies in Java Strings,", In Proceedings of the 23rd Annual ACM Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA '08), pp. 385-401 (Oct. 2008).

この発明の目的は、ヒープメモリ中の文字列オブジェクトを各ゲストＶＭ間で共有することを可能ならしめることにより、メモリ使用効率を向上させることにある。 An object of the present invention is to improve memory use efficiency by enabling a character string object in a heap memory to be shared between guest VMs.

この発明の他の目的は、共有されたヒープメモリ中の文字列オブジェクトを高速で検索できる技法を提供することにある。 Another object of the present invention is to provide a technique capable of retrieving a character string object in a shared heap memory at high speed.

この発明は、複数ゲストＶＭ間で共有可能な文字列オブジェクトを直接参照可能な形式で保存しておき、実行時に効率よく探索することを可能ならしめることにより、上記課題を解決するものである。 The present invention solves the above-mentioned problem by storing character string objects that can be shared among a plurality of guest VMs in a format that can be directly referred to, and enabling efficient search during execution.

この発明は、これには限定されないが、好適な実装は、Java(R)による実装である。 The present invention is not limited to this, but a preferred implementation is an implementation based on Java (R).

Java(R)による実装の場合、本発明に係るシステムは、同じメモリイメージのオブジェクトを複数ＪＶＭ間で利用するために、複数ゲストＶＭで共通して用いられる文字列オブジェクトを抽出してファイルにまとめ、実行開始時にそのファイルをメモリ上の予め決められたアドレスにマップし、実行時に文字列オブジェクトを生成しようとするとき、マップされたファイル中の文字列オブジェクトを検索して、同じものがあればそれを利用できるようにする。 In the case of implementation using Java (R), the system according to the present invention extracts character string objects that are commonly used by a plurality of guest VMs and collects them into a file in order to use the same memory image object among a plurality of JVMs. When the execution is started, the file is mapped to a predetermined address in the memory, and when trying to generate a string object at the time of execution, the string object in the mapped file is searched. Make it available.

その際、アプリケーションやコンポーネント毎に共有する文字列データセットが抽出しまとめられる。そのような共有する文字列データセットを用意するため、本発明に係るシステムは、複数ゲストＶＭ上の複数ＪＶＭ上で、一つ、もしくは複数のJava(R)プログラムを実行する。そのJava(R)プログラムは、各ＪＶＭのJava(R)ヒープ内に存在する文字列データをそれぞれ抽出し、異なるマシン上のＪＶＭから共通して出現した文字列データを共有対象文字列データとする。このとき、ＪＶＭプロセスのアドレス空間にマップするだけで、共有対象文字列データをJava(R)オブジェクトとして直接参照可能である。 At that time, a character string data set shared for each application or component is extracted and collected. In order to prepare such a shared character string data set, the system according to the present invention executes one or a plurality of Java (R) programs on a plurality of JVMs on a plurality of guest VMs. The Java (R) program extracts the character string data existing in the Java (R) heap of each JVM, and sets the character string data appearing in common from JVMs on different machines as the sharing target character string data. . At this time, the character string data to be shared can be directly referenced as a Java (R) object simply by mapping to the address space of the JVM process.

この発明の１つの側面においては、文字列オブジェクトの高速検索を可能ならしめるため、対象データの特性に基づきグループ分けし、対象オブジェクトの検索が一番高速になるデータ構造をグループ毎に使用するようにする。ここでいう対象データの特性とは、文字列の長さ、文字列オブジェクトが生成されるクラスファイル、jarファイルなどのことである。このようなグループ分けされた文字列毎にハッシュテーブルが作成され、後で、ハッシュテーブルを用いて文字列が高速検索される。 In one aspect of the present invention, in order to enable high-speed search of character string objects, grouping is performed based on characteristics of target data, and a data structure that provides the fastest search for target objects is used for each group. To. The characteristics of the target data here are the length of the character string, the class file in which the character string object is generated, the jar file, and the like. A hash table is created for each of the grouped character strings, and the character strings are later searched using the hash table.

この発明によれば、ヒープメモリ中の文字列オブジェクトを各ゲストＶＭ間で共有することにより、メモリ使用効率を向上させるという効果が得られる。 According to the present invention, by sharing the character string object in the heap memory among the guest VMs, an effect of improving the memory usage efficiency can be obtained.

また、好適には文字列オブジェクトを対象データの特性に基づきグループ分けし、対象オブジェクトの検索が一番高速になるデータ構造をグループ毎に使用するようにしたことにより、文字列検索の効率が向上する。 In addition, character string objects are preferably grouped based on the characteristics of the target data, and the data structure that makes the search for the target object the fastest is used for each group, thereby improving the efficiency of character string search. To do.

本発明を実施するためのハードウェア構成の一例のブロック図である。It is a block diagram of an example of the hardware constitutions for carrying out the present invention. 複数仮想マシン環境を示す図である。It is a figure which shows a multiple virtual machine environment. １つのゲストＶＭ中に複数のＪＶＭが起動されている状態を示す図である。It is a figure which shows the state by which several JVM is started in one guest VM. 複数のゲストＶＭから、共通対象文字列を集める処理のフローチャートを示す図である。It is a figure which shows the flowchart of the process which collects a common object character string from several guest VM. 共通対象文字列から文字列を検索するためのハッシュ関数とビットシフト量を決定する処理のフローチャートを示す図である。It is a figure which shows the flowchart of the process which determines the hash function for searching a character string from a common object character string, and bit shift amount. グループ分けした文字列において、ハッシュ関数とビットシフト量を決定する処理を図式的に示す図である。It is a figure which shows typically the process which determines a hash function and a bit shift amount in the character string divided into groups. グループ内の文字列データから、ハッシュ関数で使用するインデックスを求める処理のフローチャートを示す図である。It is a figure which shows the flowchart of the process which calculates | requires the index used with a hash function from the character string data in a group. ハッシュの衝突が最も少なかったハッシュ関数を決定する処理のフローチャートを示す図である。It is a figure which shows the flowchart of the process which determines the hash function with the fewest hash collisions. ハッシュテーブルインデックス値の衝突が最も少なかったシフト演算を求める処理のフローチャートを示す図である。It is a figure which shows the flowchart of the process which calculates | requires the shift calculation with the least collision of the hash table index value. ハッシュテーブルインデックスと、ビットシフトの関係を示す図である。It is a figure which shows the relationship between a hash table index and a bit shift. DLLにおける共有文字列データを格納するための構造体を示す図である。It is a figure which shows the structure for storing the shared character string data in DLL. DLLにおける共有文字列データを具体的に格納した状態を示す図である。It is a figure which shows the state which stored the shared character string data in DLL concretely. 複数のＪＶＭが起動されている状態におけるメモリマップを示す図である。It is a figure which shows the memory map in the state in which the some JVM is started. Stringコンストラクタ呼び出し処理のフローチャートを示す図である。It is a figure which shows the flowchart of a String constructor call process. String.intern()呼び出し処理のフローチャートを示す図である。It is a figure which shows the flowchart of a String.intern () call process.

以下、図面に従って、本発明の実施例を説明する。これらの実施例は、本発明の好適な態様を説明するためのものであり、発明の範囲をここで示すものに限定する意図はないことを理解されたい。また、以下の図を通して、特に断わらない限り、同一符号は、同一の対象を指すものとする。 Embodiments of the present invention will be described below with reference to the drawings. It should be understood that these examples are for the purpose of illustrating preferred embodiments of the invention and are not intended to limit the scope of the invention to what is shown here. Further, throughout the following drawings, the same reference numerals denote the same objects unless otherwise specified.

図１を参照すると、参照番号１００で総称される、本発明の一実施例に係るシステム構成及び処理を実現するためのコンピュータ・ハードウェアのブロック図が示されている。図１において、システム・バス１０２には、ＣＰＵ１０４と、主記憶（ＲＡＭ）１０６と、ハードディスク・ドライブ（ＨＤＤ）１０８と、キーボード１１０と、マウス１１２と、ディスプレイ１１４が接続されている。ＣＰＵ１０４は、好適には、３２ビットまたは６４ビットのアーキテクチャに基づくものであり、例えば、インテル社のCore(商標) i3、Core(商標) i5、Core(商標) i7、Xeon(R)、AMD社のAthlon(商標)、Phenom(商標)、Sempron(商標)などを使用することができる。主記憶１０６は、好適には、８ＧＢ以上の容量、より好ましくは、１６ＧＢ以上の容量をもつものである。 Referring to FIG. 1, there is shown a block diagram of computer hardware for realizing a system configuration and processing according to an embodiment of the present invention, generally designated by reference numeral 100. In FIG. 1, a CPU 104, a main memory (RAM) 106, a hard disk drive (HDD) 108, a keyboard 110, a mouse 112, and a display 114 are connected to the system bus 102. The CPU 104 is preferably based on a 32-bit or 64-bit architecture, such as Intel Core (TM) i3, Core (TM) i5, Core (TM) i7, Xeon (R), AMD Athlon ™, Phenom ™, Sempron ™, etc. can be used. The main memory 106 preferably has a capacity of 8 GB or more, more preferably a capacity of 16 GB or more.

ハードディスク・ドライブ１０８には、図２に示すように、複数の仮想マシン（ＶＭ）を実現するためのハイパーバイザ２０２が導入されている。ハイパーバイザ２０２として利用可能なプログラムとして、これには限定されないが、VMWare(R)、Xenなどがある。ここでは、Xenを用いるものとして説明する。 As shown in FIG. 2, a hypervisor 202 for implementing a plurality of virtual machines (VMs) is installed in the hard disk drive 108. Examples of programs that can be used as the hypervisor 202 include, but are not limited to, VMWare® and Xen. Here, explanation will be made assuming that Xen is used.

ハイパーバイザ２０２上には、ホストＶＭ２０４及び、複数のゲストＶＭ２０６ａ、２０６、・・・、２０６ｎが構成される。ホストＶＭ２０４は、Xenではドメイン０とも呼ばれ、ハイパーバイザ２０２を介してハードウェア１００とインターフェースするデバイス・ドライバが含まれている。これにより、ゲストＶＭ２０６ａ、２０６ｂ、・・・、２０６ｎは、ホストＶＭ２０４を介して、ハードウェア１００とインターフェースすることになる。 On the hypervisor 202, a host VM 204 and a plurality of guest VMs 206a, 206,. The host VM 204 is also called domain 0 in Xen and includes a device driver that interfaces with the hardware 100 via the hypervisor 202. As a result, the guest VMs 206a, 206b,..., 206n interface with the hardware 100 via the host VM 204.

ハードディスク・ドライブ１０８には、オペレーティング・システム（ＯＳ）が格納されている。オペレーティング・システムは、Linux（商標）、マイクロソフト社のWindows(商標) 7、Windows(商標)2008サーバなどの、ＣＰＵ１０４に適合する任意のものでよい。オペレーティング・システム（ＯＳ）は、後述する図３では、参照番号３０２で示される。 The hard disk drive 108 stores an operating system (OS). The operating system may be any suitable for the CPU 104, such as Linux (trademark), Microsoft Windows (trademark) 7, Windows (trademark) 2008 server. The operating system (OS) is indicated by reference numeral 302 in FIG.

ハードディスク・ドライブ１０８にはさらに、Java(R)仮想マシン（ＪＶＭ）２０４（図２）を実現するためのJava(R) Runtime Environmentプログラムが格納されている。ＪＶＭは、後述する図３では、参照番号３０６ａ、３０６ｂ、・・・、３０６ｍで示される。 The hard disk drive 108 further stores a Java® Runtime Environment program for realizing the Java® virtual machine (JVM) 204 (FIG. 2). The JVM is indicated by reference numerals 306a, 306b,..., 306m in FIG.

ハードディスク・ドライブ１０８にはさらに、ＪＶＭ上で動作するJava(R)アプリケーション・プログラムが格納されている。この実施例では、Java(R)アプリケーション・プログラムは、インターナショナル・ビジネス・マシーンズ・コーポレーションから提供される、WebSphere(R) Application Serverを含む。Java(R)アプリケーション・プログラムは、後述する図３では、参照番号３０８ａ、３０８ｂ、・・・、３０８ｍで示される。 The hard disk drive 108 further stores a Java® application program that runs on the JVM. In this embodiment, the Java® application program includes WebSphere® Application Server provided by International Business Machines Corporation. The Java® application program is indicated by reference numerals 308a, 308b,..., 308m in FIG.

ハードディスク・ドライブ１０８にはまた、Apacheなどの、Ｗｅｂサーバとしてシステムを動作させるためのプログラムが保存されている。 The hard disk drive 108 also stores a program for operating the system as a Web server, such as Apache.

キーボード１１０及びマウス１１２は、オペレーティング・システムが提供するグラフィック・ユーザ・インターフェースに従い、ディスプレイ１１４に表示されたアイコン、タスクバー、テキストボックスなどのグラフィック・オブジェクトを操作するために使用される。 The keyboard 110 and the mouse 112 are used to operate graphic objects such as icons, task bars, and text boxes displayed on the display 114 in accordance with a graphic user interface provided by the operating system.

ディスプレイ１１４は、これには限定されないが、好適には、１０２４×７６８以上の解像度をもち、３２ビットtrue colorのＬＣＤモニタである。ディスプレイ１１４は例えば、ＪＶＭ上で実行されるアプリケーション・プログラムによる動作の結果を表示するために使用される。 The display 114 is preferably, but is not limited to, a 32-bit true color LCD monitor with a resolution of 1024 × 768 or higher. The display 114 is used, for example, to display a result of an operation performed by an application program executed on the JVM.

通信インターフェース１１６は、好適には、イーサネット(R)プロトコルにより、ネットワークと接続されている。通信インターフェース１１６は、クライアント・コンピュータ（図示しない）からApacheが提供する機能により、ＴＣＰ／ＩＰなどの通信プロトコルに従い、処理リクエストを受け取り、ホストＶＭ２０４がその処理リクエストを指定されたゲストＶＭに送り、その処理結果をゲストＶＭから受け取って、クライアント・コンピュータ（図示しない）に返す。 The communication interface 116 is preferably connected to the network by the Ethernet (R) protocol. The communication interface 116 receives a processing request according to a communication protocol such as TCP / IP using a function provided by Apache from a client computer (not shown), and the host VM 204 sends the processing request to a designated guest VM. The processing result is received from the guest VM and returned to the client computer (not shown).

次に図３を参照して、ゲストＶＭについて説明する。ゲストＶＭ２０６ａ、２０６ｂ、・・・、２０６ｎはどれも機能的に同一であるので、ここでは代表的に、ゲストＶＭ２０６ａとして説明する。 Next, the guest VM will be described with reference to FIG. Since the guest VMs 206a, 206b,..., 206n are all functionally identical, here, the guest VMs 206a, 206b,.

すると、図３に示すように、ゲストＶＭ２０６ａは、ＯＳ３０２上に、各々ＪＶＭ３０６ａ、３０６ｂ、・・・、３０６ｍとアプリケーション・プログラムを含む、複数の仮想マシン３０４ａ、３０４ｂ、・・・、３０４ｍを含みえる。 Then, as shown in FIG. 3, the guest VM 206a can include a plurality of virtual machines 304a, 304b,..., 304m including JVMs 306a, 306b,. .

このような前提で、図４以下のフローチャートを参照して、本発明の処理について説明する。図４に示す処理は、各々のゲストＶＭにおけるJava(R)プログラム毎に実行されるので、複数のゲストＶＭに亘る処理として、ホストＶＭ２０４に、全体の処理を制御するプログラムを配置することができる。 Under such a premise, the processing of the present invention will be described with reference to the flowcharts in FIG. Since the processing shown in FIG. 4 is executed for each Java (R) program in each guest VM, a program for controlling the entire processing can be arranged in the host VM 204 as processing over a plurality of guest VMs. .

図４において、本発明のプログラムは、ステップ４０２で、あるゲストＶＭにおけるJava(R)プログラムを実行する。次にステップ４０４で、本発明のプログラムは、そのゲストＶＭにcoreファイルを出力させる。なお、ここでいうcoreファイルとは、ＪＶＭにおけるシステム・ダンプファイルのことである。システム・ダンプファイルは好適には、ハードディスク・ドライブ１０８の所定のディレクトリに書き出される。 In FIG. 4, the program of the present invention executes a Java® program in a guest VM in step 402. Next, in step 404, the program of the present invention causes the guest VM to output a core file. The core file referred to here is a system dump file in JVM. The system dump file is preferably written to a predetermined directory on the hard disk drive 108.

次にステップ４０６で、本発明のプログラムは、coreファイルから文字列オブジェクトの情報を取り出す。 Next, in step 406, the program of the present invention extracts information on the character string object from the core file.

こうして、すべてのゲストＶＭにおいて、Java(R)プログラムの実行結果のcoreファイルから文字列オブジェクトの情報を取り出すと、本発明のプログラムは、ステップ４０８で、複数のゲストＶＭから情報を集め、ステップ４１０で、複数のcoreファイルに出現する文字列データを共通対象文字列とする。 Thus, when the information of the character string object is extracted from the core file as the execution result of the Java (R) program in all the guest VMs, the program of the present invention collects information from the plurality of guest VMs in step 408, and step 410 Thus, character string data appearing in multiple core files is set as a common target character string.

このように複数のゲストＶＭから情報を集められた共通対象文字列の集合データのままでは、文字列の検索に時間がかかるので、本発明のプログラムは、図５のフローチャートで示す処理により、共通対象文字列の集合データに対して、検索キーをつける。 As described above, since it takes a long time to search for a character string if it is a set of common target character strings obtained by collecting information from a plurality of guest VMs, the program of the present invention is shared by the process shown in the flowchart of FIG. A search key is attached to the set data of the target character string.

すなわち、ステップ５０２で、本発明のプログラムは、文字列対象データを特性に基づきグループ分けする。ここでいう対象データの特性とは、文字列の長さ、文字列オブジェクトが生成されるクラスファイル、jarファイルなどのことである。このうち一番典型的な特性は文字列の長さである。 That is, in step 502, the program of the present invention groups character string target data based on characteristics. The characteristics of the target data here are the length of the character string, the class file in which the character string object is generated, the jar file, and the like. The most typical of these is the length of the string.

本発明のプログラムは、ステップ５０４で、このようにしてグループ分けした全ての文字列を使い、何番目の文字がグループ内で文字の種類が多いかを調べ、種類の多い上位数個の文字インデックスをハッシュ計算に用いる。なお、ステップ５０４の詳細は、図７のフローチャートを参照して後で説明する。 In step 504, the program of the present invention uses all the character strings grouped in this way, checks what number of characters has many types of characters in the group, and determines the top few character indexes with many types. Is used for hash calculation. Details of step 504 will be described later with reference to the flowchart of FIG.

図６は、文字列の長さでグループ分けされた文字列に対する処理を図式的に示す図である。図示されているのは、長さ９の文字列の処理と、長さ１０の文字列の処理と、長さ１２の文字列の処理の場合である。ここでは、文字列を並べた列の範囲で情報エントロピーの高い３個から４個のインデックスが選ばれる。 FIG. 6 is a diagram schematically showing processing for character strings grouped by character string length. Shown are the processing of a character string with a length of 9, processing of a character string with a length of 10, and processing of a character string with a length of 12. Here, three to four indexes having high information entropy are selected within a range of strings in which character strings are arranged.

本発明のプログラムは、ステップ５０６で、予め用意しておいたハッシュ関数の中から、ハッシュ値の衝突が最も少なかったものを選ぶ。ここで、予め用意しておいたハッシュ関数とは、図６ではhashFn1、hashFn2、・・・、hashFn10のように示されているものである。予め用意しておいたハッシュ関数の例としては次のようなものがある。下記の式で、ch1、ch2、ch3、ch4は、選ばれたインデックスにおける文字列の値である。なお、これらのハッシュ関数は一例であって、当業者が思いつく様々な他のハッシュ関数も使用可能である。また、ステップ５０６の詳細は、図８のフローチャートを参照して後で説明する。 In step 506, the program of the present invention selects the hash function with the least number of collisions among hash functions prepared in advance. Here, the hash functions prepared in advance are shown as hashFn1, hashFn2,..., HashFn10 in FIG. Examples of hash functions prepared in advance include the following. In the following equation, ch1, ch2, ch3, and ch4 are character string values at the selected index. Note that these hash functions are merely examples, and various other hash functions that can be conceived by those skilled in the art can also be used. Details of step 506 will be described later with reference to the flowchart of FIG.

int hash = ch1 * ch2 + ch3;
return (hash * hash); int hash = ch1 * ch2 + ch3;
return (hash * hash);

int hash = ch1 * ch2 * ch3;
return (hash * hash); int hash = ch1 * ch2 * ch3;
return (hash * hash);

int hash = ch1 * ch2 * ch3 + ch4;
return (hash * hash); int hash = ch1 * ch2 * ch3 + ch4;
return (hash * hash);

本発明のプログラムは、ステップ５０８で、インデックスを求めるときに、全てのシフト演算を試して衝突が最も少なかった計算方法を選ぶ。これは、図６では、1ビット・シフト、・・・、nビット・シフトとして示されている。ステップ５０８の詳細は、図９のフローチャートを参照して後で説明する。 In step 508, the program of the present invention tries all shift operations when selecting an index, and selects a calculation method with the least collision. This is shown in FIG. 6 as a 1 bit shift,..., N bit shift. Details of step 508 will be described later with reference to the flowchart of FIG.

次に、図７のフローチャートを参照して、ステップ５０４の詳細を説明する。図７のフローチャートは、グループ内の文字列の処理に関するものであって、ステップ７０２で、本発明のプログラムは、ｉ番目の文字が何種類あるか数える。このとき、文字列の長さがｉ未満のものは数え上げの対象から外す。 Next, the details of step 504 will be described with reference to the flowchart of FIG. The flowchart of FIG. 7 relates to the processing of character strings in a group. In step 702, the program of the present invention counts how many kinds of i-th character. At this time, characters whose length is less than i are not counted.

ステップ７０４で、本発明のプログラムは、種類の多かった上位数個のインデックスを求める。こうしてステップ７０６で、ハッシュ関数で使用する、複数個の文字列インデックスが得られる。 In step 704, the program of the present invention obtains the top several indexes having many types. In step 706, a plurality of character string indexes to be used in the hash function are obtained.

次に、図８のフローチャートを参照して、ステップ５０６の詳細を説明する。図８のフローチャートは、グループ内の文字列データと文字列インデックス、すなわち何番目の文字を計算に使用するかを用いるものであって、本発明のプログラムは、ステップ８０２で、与えられた文字列インデックスを使って、上述したような予め用意したハッシュ関数でハッシュ値を計算し、これをグループ内の全ての文字列に対して繰り返す。ステップ８０４で本発明のプログラムは、ハッシュ値の衝突した回数を数える。 Next, the details of step 506 will be described with reference to the flowchart of FIG. The flowchart of FIG. 8 uses character string data in a group and character string index, that is, what number character is used for calculation. The program of the present invention performs the given character string in step 802. Using the index, a hash value is calculated using a hash function prepared in advance as described above, and this is repeated for all character strings in the group. In step 804, the program of the present invention counts the number of times the hash values collide.

本発明のプログラムは、ステップ８０２とステップ８０４を、予め用意しておいた全てのハッシュ関数について繰り返し、ステップ８０６で、ハッシュ値の衝突が最も少なかったハッシュ関数を選ぶ。 The program of the present invention repeats step 802 and step 804 for all the hash functions prepared in advance, and in step 806, selects the hash function with the least number of hash value collisions.

次に、図９のフローチャートを参照して、ステップ５０８の詳細を説明する。図９のフローチャートは、グループ内の文字列データそれぞれに対してハッシュ関数を適用したときのハッシュ値を用いるものであって、本発明のプログラムは、ステップ９０４で、図１０に示すように、nビットだけ右シフトして、Lビットをハッシュテーブルインデックスと計算する。ここでLは、ハッシュテーブルのサイズに依存する定数である。次に本発明のプログラムはステップ９０６で、ハッシュテーブルインデックス値の衝突した回数を数え、ステップ９０４とステップ９０６を、右シフト可能なビット数分だけ繰り返し、ステップ９０８で、ハッシュテーブルインデックス値の衝突が最も少なかったシフト演算が選ばれる。 Next, the details of step 508 will be described with reference to the flowchart of FIG. The flowchart of FIG. 9 uses a hash value obtained when a hash function is applied to each character string data in the group. The program of the present invention performs n in step 904 as shown in FIG. Shift right by bits and compute L bits as hash table index. Here, L is a constant depending on the size of the hash table. Next, in step 906, the program of the present invention counts the number of times the hash table index value collides, repeats step 904 and step 906 by the number of bits that can be shifted to the right, and in step 908, the hash table index value conflict occurs. The least shift operation is selected.

この結果、例えば、情報エントロピーが高いインデックスが、3、4、6であり、ハッシュ値の衝突が最も少なかったハッシュ関数がhashFn3であり、2ビット分だけ右シフトした演算のハッシュテーブルインデックスの衝突が最も少なかったとすると、長さ9の文字を見出すために使用されるインデックスindexは、以下のようにして計算されることになる。
int hc = hashFn3(char, offset,3,4,6);
int index = (hc >> 2 & ((1 << 12) - 1); As a result, for example, the indexes with high information entropy are 3, 4, and 6, the hash function with the least hash value collision is hashFn3, and the hash table index collision of the operation shifted right by 2 bits If it is the least, the index index used to find a character of length 9 will be calculated as follows:
int hc = hashFn3 (char, offset, 3,4,6);
int index = (hc >> 2 & ((1 <<12)-1);

ここで、ハッシュテーブルインデックスについて補足する。この実施例では、共有文字列を格納するためにグループごとにハッシュテーブルを用意している。すると、共有文字列をハッシュテーブルに格納するためには、ハッシュテーブルのどの位置（インデックス）に格納するかを決める必要がある。 Here, it supplements about a hash table index. In this embodiment, a hash table is prepared for each group in order to store a shared character string. Then, in order to store the shared character string in the hash table, it is necessary to determine at which position (index) in the hash table.

文字列を入力としてハッシュ関数で計算されるが、その計算結果がハッシュテーブルのインデックスに必ずしも一対一対応するわけではない。そこで、計算結果を補正してハッシュテーブルのインデックスの範囲内に収める必要がある。その補正をするのが「nビット分だけ右シフトして、Lビットをハッシュテーブルインデックスとして使用する」処理である。 Although it is calculated by a hash function using a character string as an input, the calculation result does not necessarily correspond one-to-one with the index of the hash table. Therefore, it is necessary to correct the calculation result so that it falls within the index range of the hash table. The correction is a process of “shifting right by n bits and using L bits as a hash table index”.

ハッシュテーブルを利用する際にポイントとなるのが、ハッシュテーブルの同じインデックスにはできるだけひとつの文字列しか格納しない、という方針である。となると、グループ内の文字列を用いて得る結果：
(1) ハッシュ関数での計算の結果
(2) インデックス計算での結果
で(2)が可能な限りばらけるように工夫する必要がある。(2)がばらけるためには、(1)の値がそもそもばらけていることが望ましい。それが「予め用意しておいたハッシュ関数の中からハッシュ値の衝突が最も少なかったものを選ぶ」処理である。 The point when using a hash table is to store only one character string as much as possible in the same index of the hash table. Would result in using the strings in the group:
(1) Result of calculation with hash function
(2) It is necessary to devise so that (2) can be dispersed as much as possible in the result of index calculation. In order to disperse (2), it is desirable that the value of (1) is disjoint in the first place. This is the process of “selecting the hash function with the least number of collisions among hash functions prepared in advance”.

さらに、(1)を得るために入力として文字列を使うわけであるが、文字列のすべての情報を使う必要はない。そこで、グループ内の文字列を調べて、必要そうな場所だけを取り出して使う。例えば、"ISOLATION"、"ASSERTION","ASSOCIATE"という３つの文字列があるグループに属するとする。このとき、以下のように並べてみると、
"ISOLATION"
"ASSERTION"
"ASSOCIATE"
グループ内の文字列の中で最も違いが出るインデックス3,4だけを使えば十分であることが分かる。 Furthermore, we use a string as input to get (1), but it is not necessary to use all the information in the string. So, check the character string in the group, and use only the places you think are necessary. For example, assume that three character strings “ISOLATION”, “ASSERTION”, and “ASSOCIATE” belong to a group. At this time, when arranged as follows,
"ISOLATION"
"ASSERTION"
"ASSOCIATE"
It turns out that it is sufficient to use only the indexes 3 and 4 that make the most difference among the strings in the group.

本発明のプログラムは、共有文字列を保存したDLLを作成し、JVMがそのDLLを、毎回同じアドレスにロードするようにする。これは、Linux(商標)の場合、prelinkコマンドを使用して達成できる。 The program of the present invention creates a DLL storing the shared character string, and causes the JVM to load the DLL to the same address every time. This can be accomplished using the prelink command for Linux ™.

そして、JVM起動時、Stringとchar[]のClassだけDLLの書き込み可能な場所に置く。図１１は、このための、共有文字列テータとしてのjava.lang.Stringとchar[]の構造体string_len3の定義を示す。このような構造体は、java.lang.Stringとchar[]を一組として文字列の長さごとに定義される。図１２は、string_len3に実際に値が格納された様子を示す。なお、図１２には、java.lang.Stringとchar[]の組が２つしか示されていないが、実際はもっと多数含まれることを理解されたい。 Then, when starting the JVM, place only String and char [] Class in a place where the DLL can be written. FIG. 11 shows the definition of the structure string_len3 of java.lang.String and char [] as the shared character string data for this purpose. Such a structure is defined for each length of a character string with java.lang.String and char [] as a set. FIG. 12 shows a state in which a value is actually stored in string_len3. FIG. 12 shows only two pairs of java.lang.String and char [], but it should be understood that a larger number is actually included.

図１３は、単一のゲストＶＭ中の複数のＪＶＭに亘って、共有文字列を保存したDLLが同一アドレスでロードされている様子を示す図である。これにより、各ＪＶＭは同様に且つ独立にDLL中に定義された共有文字列を検索することができる。その検索の際に、共有文字列テータのグループ毎に選ばれたハッシュ関数とインデックスが使用される。なお、共有文字列は好適には、ヒープ外(off-heap)に配置される。なお、ＪＶＭからヒープ外メモリへのアクセスは、JNI(Java(R) Native Interface)などの技法を用いて達成することができる。 FIG. 13 is a diagram illustrating a state in which a DLL storing a shared character string is loaded at the same address across a plurality of JVMs in a single guest VM. Thereby, each JVM can search the shared character string defined in the DLL in the same manner and independently. In the search, a hash function and an index selected for each group of shared character string data are used. The shared character string is preferably arranged outside the heap (off-heap). Access from the JVM to the memory outside the heap can be achieved by using a technique such as JNI (Java (R) Native Interface).

図１４は、Java(R)プログラムにおける、Stringコンストラクタの呼び出し処理のフローチャートを示す図である。ステップ１４０２ではプログラムは、引数のchar[]が表す文字列の特性（長さ）などをチェックする。 FIG. 14 is a flowchart of the String constructor call process in the Java (R) program. In step 1402, the program checks the characteristics (length) of the character string represented by the argument char [].

そして、ステップ１４０４で特性に対するグループがあるかどうか判断し、もしないならプログラムは、ステップ１４０６でStringオブジェクトをnewする。もし特性に対するグループがあるなら、プログラムは、引数のchar[]が表す文字と同じものがグループ内にあるかどうか、上記したDLLにアクセスして検索する。 In step 1404, it is determined whether there is a group for the characteristic. If not, the program updates the String object in step 1406. If there is a group for the property, the program will search the DLL above to see if the same character as the char [] argument is in the group.

そして、プログラムは、ステップ１４１０で文字列が見つからなかったと判断すると、ステップ１４１２でStringオブジェクトをnewする。プログラムは、ステップ１４１０で文字列が見つかったと判断すると、ステップ１４１４で、StringをnewしてDLL内のchar[]を参照する。 If the program determines in step 1410 that no character string has been found, the program updates a string object in step 1412. When the program determines that a character string has been found in step 1410, in step 1414, the program updates String and refers to char [] in the DLL.

図１５は、Java(R)プログラムにおける、String.intern()の呼び出し処理のフローチャートを示す図である。ステップ１５０２ではプログラムは、引数のchar[]が表す文字列の特性（長さ）などをチェックする。 FIG. 15 is a flowchart of String.intern () call processing in the Java (R) program. In step 1502, the program checks the characteristics (length) of the character string represented by the argument char [].

そして、ステップ１５０４で特性に対するグループがあるかどうか判断し、もしないならプログラムは、ステップ１５０６でString.intern()を実行する。もし特性に対するグループがあるなら、プログラムは、引数のchar[]が表す文字と同じものがグループ内にあるかどうか、上記したDLLにアクセスして検索する。 In step 1504, it is determined whether there is a group for the characteristic. If not, the program executes String.intern () in step 1506. If there is a group for the property, the program will search the DLL above to see if the same character as the char [] argument is in the group.

そして、プログラムは、ステップ１５１０で文字列が見つからなかったと判断すると、ステップ１５１２でString.intern()を実行する。プログラムは、ステップ１５１０で文字列が見つかったと判断すると、ステップ１５１４で、DLL内のStringオブジェクトを返す。 When the program determines in step 1510 that no character string has been found, it executes String.intern () in step 1512. If the program determines that a character string has been found in step 1510, it returns a String object in the DLL in step 1514.

以上、ゲストＶＭにおけるＪＶＭでの実装の上で本発明の実施例を説明してきたが、これには限定されず、複数のＶＭ環境から使用文字列を取り出して、数のＶＭ環境に対して共通文字列をアクセス可能ならしめることができるような任意の環境に、この発明は適用可能であることを理解されたい。すなわち、本発明は特定の言語環境やプラットフォームに限定されないで実施可能である。 As described above, the embodiment of the present invention has been described on the implementation of the JVM in the guest VM. However, the present invention is not limited to this, and the used character string is extracted from a plurality of VM environments and is common to a number of VM environments. It should be understood that the present invention is applicable to any environment where a string can be made accessible. That is, the present invention can be implemented without being limited to a specific language environment or platform.

また、複数のＶＭ環境から、共通文字列をアクセス可能ならしめる仕組みとして、上記実施例ではDLLが使用されたが、これは一例に過ぎず、ヒープ外メモリなど、個別のＪＶＭの管理外のメモリ配置するなら、任意の方法でメモリを配置してよい。 Moreover, as a mechanism for making a common character string accessible from a plurality of VM environments, a DLL is used in the above embodiment. However, this is only an example, and a memory outside the management of individual JVMs such as a memory outside the heap. If it is arranged, the memory may be arranged by an arbitrary method.

１０４ＣＰＵ
１０６ＲＡＭ
１０８ハードディスク・ドライブ
２０４ホストＶＭ
２０６ａ、２０６ｂ、・・・２０６ｎゲストＶＭ
３０６ａ、３０６ｂ、・・・３０６ｍＪＶＭ 104 CPU
106 RAM
108 Hard disk drive 204 Host VM
206a, 206b, ... 206n Guest VM
306a, 306b, ... 306m JVM

Claims

In a computer system in which an application program is executed in each of a plurality of guest VMs,
In each guest VM, extracting a common character string from each heap area used by each application program;
Placing the extracted common character string in a memory of the system so that each application program can access the character string;
Method.

The guest VM includes a plurality of JVMs, the application program is a Java program, and the extracted common character string is configured so that the JVM performs class definition at the same address each time. The method according to claim 1, wherein the extracted common string object is configured such that a class pointer is the same among the plurality of guest VMs.

The method of claim 2, wherein the string object is loaded into memory as a DLL.

Grouping the extracted common strings according to the length of the string or the characteristics of how it was created;
For each group of the grouped character strings, the method further comprises the step of constructing a hash index for performing a search for the target character string.
The method of claim 1.

Building the hash index comprises:
Finding the position to use for the index in terms of the height of information entropy in the string,
Calculating a plurality of hash functions at the found index and selecting a hash function having the least collision among the plurality of hash functions;
Bit shifting the character string to select the shift position with the least collision,
The method of claim 4.

In a computer system in which an application program is executed in each of a plurality of guest VMs,
In the computer system,
In each guest VM, extracting a common character string from each heap area used by each application program;
Placing the extracted common character string in a memory of the system so that each application program can access the character string;
program.

The guest VM includes a plurality of JVMs, the application program is a Java program, and the extracted common character string is configured so that the JVM performs class definition at the same address each time. The program according to claim 6, wherein the extracted common character string object has a class pointer that is the same among the plurality of guest VMs.

The program according to claim 7, wherein the character string object is loaded into a memory as a DLL.

In the computer system,
Grouping the extracted common strings according to the length of the string or the characteristics of how it was created;
For each group of the grouped character strings, a step of constructing a hash index for searching for the target character string is further executed.
The program according to claim 6.

Building the hash index comprises:
Finding the position to use for the index in terms of the height of information entropy in the string,
Calculating a plurality of hash functions at the found index and selecting a hash function having the least collision among the plurality of hash functions;
Bit shifting the character string to select the shift position with the least collision,
The program according to claim 9.

In a computer system in which an application program is executed in each of a plurality of guest VMs,
Means for extracting a common character string from each heap area used by each application program in each guest VM;
Means for placing the extracted common character string in a memory of the system so that each application program can access the character string;
Computer system.

The guest VM includes a plurality of JVMs, the application program is a Java program, and the extracted common character string is configured so that the JVM performs class definition at the same address each time. The computer system according to claim 11, wherein the extracted common string object has a class pointer that is the same among the plurality of guest VMs.

13. The computer system according to claim 12, wherein the character string object is loaded into a memory as a DLL.