JP4079665B2

JP4079665B2 - Search device

Info

Publication number: JP4079665B2
Application number: JP2002088985A
Authority: JP
Inventors: 泰三阿南; 貴文上戸
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2002-03-27
Filing date: 2002-03-27
Publication date: 2008-04-23
Anticipated expiration: 2022-03-27
Also published as: JP2003288368A

Description

【０００１】
【発明の属する技術分野】
本発明は検索装置に関し、特にパターンの検索を行う検索装置に関する。
【０００２】
【従来の技術】
近年の情報化社会の発展に伴い、扱われる情報量は大規模化しており、マルチメディアサービスの高度化、広域化を実現するためには、記号列（文字列）検索処理の高速化が強く求められている。検索処理では、テキストと呼ばれる与えられた特定の記号列の中から、パターンと呼ばれる指定された記号列を検索する。
【０００３】
従来の最も単純な検索アルゴリズムでは、検索過程で不一致が検出されたら、パターンを１記号ずらして再び照合を繰り返すものであるが、より高速処理を可能とした代表的なアルゴリズムとして、Knuth、Morris、Prattによって提案されたＫＭＰ法や、BoyerとMorreによって提案されたＢＭ法がある。
【０００４】
ＫＭＰ法及びＢＭ法の基本検索操作としては、パターンの記号の照合時に（ＫＭＰ法はパターンの先頭から前→後の方向に、ＢＭ法ではパターンの最後尾から後→前の方向に照合していく）、不一致を検出したときには、ある条件にもとづき、パターンを大きく移動するものである。このような検索を行うことで、記号の不一致が生じたときに常にパターンを１記号ずらして照合を行う単純な方法と比べて、検索性能を向上させることができる。
【０００５】
【発明が解決しようとする課題】
情報通信の分野においては、連続する同一記号を含む構造を持つパターンの検索を行うことが多い。同一記号が連続するパターンとしては、例えば、ＨＤＬＣフラグやＭＰＥＧで圧縮されたストリーム中に存在するヘッダなどがある。
【０００６】
このような特定の構造を持つパターンに対して、上記のような従来のＫＭＰ法やＢＭ法で検索処理を行うと、無駄な記号照合が発生し、検索性能の向上を図れないといった問題があった。
【０００７】
高速検索アルゴリズムを代表するＫＭＰ法やＢＭ法であっても、あらゆる構造のパターンに対して有効とはいえず、同一記号列を持つパターンに対しての検索性能は必ずしも最良とはいえなかった。
【０００８】
本発明はこのような点に鑑みてなされたものであり、連続する同一記号を含む構造を持つパターンに対する検索処理を高速に行う検索装置を提供することを目的とする。
【０００９】
【課題を解決するための手段】
本発明では上記課題を解決するために、図１に示すような、パターンの検索を行う検索装置１０において、パターンの中の連続する同一の記号からなる同一記号列を認識する同一記号列認識部１１と、パターンとテキストとの記号照合を行う検索処理部１２と、を有することを特徴とする検索装置１０が提供される。
【００１０】
ここで、検索処理部１２は、同一記号列の先頭から最後尾まで１〜ｎの番号を付けて、同一記号列の最後尾の記号から、同一記号列の先頭に向かって記号照合を開始し、同一記号列の中の記号の内のｋ（１≦ｋ≦ｎ）の番号が付与された記号の記号照合に失敗したときは、失敗した位置から最後尾方向へパターンを記号ｋ個分シフトして、同一記号列の最後尾の記号から再度記号照合を行う。
【００１１】
【発明の実施の形態】
以下、本発明の実施の形態を図面を参照して説明する。図１は検索装置の原理図である。検索装置１０は、テキスト（またはバイナリデータ）の中からパターンの検索を行う。
【００１２】
同一記号列認識部１１は、パターンの中の連続する同一の記号からなる同一記号列を認識する。検索処理部１２は、同一記号列の最後尾の記号から、パターンとテキストとの記号照合（または単に照合と呼ぶ）を開始する。そして、最後尾の記号の記号照合が失敗した場合には、同一記号の数だけパターンをシフトして再度記号照合をして検索を行う。
【００１３】
例えば、パターンを“000B”とすると、このパターンの同一記号列は“000”である。検索を行う際には、テキストとパターンの左端を合わせて出発して、“000”の最後尾の記号“0”からテキストとの記号照合を開始する。最後尾の記号“0”の記号照合が失敗した場合には（図のテキストは“C”であるから“0”と不一致である）、同一記号列の数（ここでは３記号分）だけシフトして、再び記号照合を行っていく。詳細な動作内容は図５で後述する。
【００１４】
次にＫＭＰ法とＢＭ法の概要について説明する。図２はＫＭＰ法を説明するための図である。テキストの中に、パターンの出現する位置を見つけるために、まず、テキストとパターンの左端を合わせて出発し、パターンの先頭（左端）から後方へテキストと１記号ずつ順に照合していく。
【００１５】
そして、記号の不一致が生じたとき（図の○と△）、パターンのその位置から左方にある部分列（図の太実線枠）と同じものがパターンの先頭部にもあれば、パターンを図の大きさだけ右にずらす。なお、部分列は最大長のものを選ぶ。
【００１６】
例えば、パターン“EFGHEF2”を図のテキストの中から見つける場合、“EFGHEF”まで照合が成功し、最後尾の“２”で不一致が生じた場合、“２”の左方にある部分列“EF”と同じものがパターンの先頭にもあるので、次のパターン位置は、位置Ｐ１までパターンの先頭をずらすことができる。
【００１７】
図３、図４はＢＭ法を説明するための図である。テキストの中にパターンの出現する位置を見つけるために、まず、テキストとパターンの左端を合わせて出発し、パターンの最後尾（右端）から前方へテキストと１記号ずつ順に照合していく。
【００１８】
また、ＢＭ法では、記号の不一致が生じたとき（図の○と△）、パターンを右へどれだけずらすかについて２つのアルゴリズムがある。図３に第１のアルゴリズム、図４に第２のアルゴリズムを示す。
【００１９】
図３に対し、第１のアルゴリズムはＫＭＰ法と似たもので、パターンの右端から不一致の前の位置にある部分列（図の太実線枠）と同じ部分列が左方にもあれば、その部分列がテキストの部分列と重なるようにパターンを右へずらす。
【００２０】
例えば、パターン“NEFGHEF2”を図のテキストの中から見つける場合、“EF2”まで照合が成功し、次の“H”で不一致が生じた場合、“H”の前の位置の部分列“EF”と同じ部分列がパターンの左方にもあるので、この部分列が重なる位置までずらすことができる（パターンの先頭は位置Ｐ２にくる）。
【００２１】
図４に対し、第２のアルゴリズムでは、不一致の際のテキストの記号を見て、パターンの左方に同じ記号があるとき、これとテキストのその記号とが重なるようにパターンをずらす。
【００２２】
例えば、パターン“NKFGHEF2”を図のテキストの中から見つける場合、“EF2”まで照合が成功し、次の“H”で不一致が生じた場合、不一致が生じたときのテキストの記号は“K”である。“K”はパターン左方にもあるので、パターンとテキストの“K”が重なる位置までずらすことができる（パターンの先頭は位置Ｐ３にくる）。
【００２３】
なお、一般的には、ＫＭＰ法よりもＢＭ法の検索速度の方が高速であることが知られている。また、ＢＭ法の中では、第１のアルゴリズムよりも第２のアルゴリズムの検索速度の方が高速であることが知られている。
【００２４】
次に検索装置１０の検索動作について従来のＢＭ法と比較しながら詳しく説明する。なお、以降では検索装置１０の検索動作をパターン検索方法とも呼ぶ。
【００２５】
図５は検索手順を示す図である。テキストを図に示す記号列とし、パターンを“000B”とする。まず、パターン中の連続する最長の同一記号列“000”に着目する。そして、テキストとパターンの左端を合わせて出発し、最初に、パターン“000B”の中の“000”の最後尾の記号“０”からテキストとの照合を行う。
【００２６】
もしも、照合した結果、記号の不一致が発見されたら、パターンの先頭方向には、記号の照合が失敗した文字“0”しか存在しないので、この場合に１文字及び２文字のシフトをして記号を照合しても、一致しないのは明白である。
【００２７】
そこで、パターンを３文字（同一記号列の数分）シフトして、再度同様な記号照合を繰り返していく。図では３回目のシフトのときに“000”の一番右の文字が一致している。したがって、次は“000”の２番目の文字とテキストを比較することになる。
【００２８】
ここで、もし、記号の不一致が発見されれば、１文字分シフトをして記号を照合しても、一致しないのは明白であるので、２文字シフトして記号の照合を再開する。なお、一般には、同一記号列の先頭から最後尾まで１〜ｎの番号を付けた際に、ｋ（２≦ｋ≦ｎ）番目の記号照合に失敗したときは、パターンをｋ個分シフトすることになる。
【００２９】
このような検索処理を行うことにより、シフト回数を大幅に削減することが可能になる。なお、図の例では、３回目のシフトで、テキスト中にパターン“000B”が検出されている（この３回目シフト位置における照合順番は、“000”の最後尾の“0”、中間の“0”、先頭の“0”の順に照合し、その次に“B”を照合して、パターンを検出している）。
【００３０】
図６は従来のＢＭ法の検索手順を示す図である。テキストとパターンの記号列は図５と同様である。図はＢＭ法の第２のアルゴリズムの場合を示している。ＢＭ法では、パターンの最後尾の文字から検索を開始する。
【００３１】
もしも、記号が不一致であれば、テキスト中の不一致であった記号を見て、パターンの左方に同じ記号があるとき、この記号を不一致が発見された位置にくるようにパターンをシフトする。ここでは、テキスト中の不一致であった記号は“0”であるので、この“0”と、“000B”の最後尾から２番目の“0”とが重なるようにパターンをシフトする。
【００３２】
このような処理は０回目シフトから２回目シフト、３回目シフトから６回目シフトにかけて行われている。また、２回目から３回目にシフトする際、テキスト中の“A”とパターン中の“B”は不一致であり、“A”はパターン左方に存在しないので、２回目シフトから３回目シフトへ行く際には、パターン４文字分のシフトが行われている。
【００３３】
図に示すように、ＢＭ法の場合では合計７回のシフト回数を要している。このように、同一記号列を含むパターンの場合では、検索装置１０による検索処理の方が高速であることがわかる。
【００３４】
次に検索効率を導入した場合の検索装置について説明する。図７は検索装置の構成を示す図である。検索装置２０は、検索効率計算部２１、検索処理部２２から構成される。なお、以降では検索装置２０の検索動作を単に検索方法とも呼ぶ。
【００３５】
検索効率計算部２１は、パターンの各記号に対する検索効率を計算する。検索効率とは、記号照合が失敗したときのシフト操作のとりうる値であるシフト量に、そのシフト量が生じるための確率をかけた総和であるシフト量の期待値と、記号照合回数との比で表される。検索処理部２２は、検索効率の値が大きい記号の順に記号照合をして検索を行う。
【００３６】
次に検索効率及び検索装置２０の動作について詳しく説明する。検索速度を上げるためには、シフト量を大きくとって、無駄な記号照合をなくさなければならない。この場合に、パターン中のどこの記号から照合を行えば、照合が失敗したときに、シフト量を大きくとれるかといった指標が必要である。この指標を示すものが検索効率である。検索効率は、シフト量の期待値／記号照合回数である。
【００３７】
まず最初に、パターン検索方法に検索効率を導入した場合について説明する。具体例として、テキストがA、B、Cの３種類の記号からなり、パターンが“AAABCAB”の場合を考える。また、このパターン中の各記号に対して、パターン先頭から（１）、（２）、…、（７）の番号を付ける。
【００３８】
パターン検索方法では同一記号列に着目して検索を行うので、“AAA”に対する（３）、（２）、（１）のそれぞれの記号の検索効率を求めることになる。検索効率をＥ[（ｎ）]と表記すると（ｎはパターンに付けた番号である）、Ｅ[（３）]〜Ｅ[（１）]は以下の式で表される（Ｅ[（３）]とは、“AAA”の最後尾の“A”に関する検索効率、Ｅ[（２）]とは、“AAA”の中間の“A”に関する検索効率、Ｅ[（１）]とは、“AAA”の左端の“A”に関する検索効率のことである）。
【００３９】
【数１】
Ｅ[（３）]＝（Ｐ（Ａ）×０＋Ｐ（Ｂ）×３＋Ｐ（Ｃ）×３）／１＝２…（１ａ）
Ｅ[（２）]＝（Ｐ（Ａ）×０＋Ｐ（Ｂ）×２＋Ｐ（Ｃ）×２）／２＝４／６…（１ｂ）
Ｅ[（１）]＝（Ｐ（Ａ）×０＋Ｐ（Ｂ）×１＋Ｐ（Ｃ）×１）／３＝２／９…（１ｃ）
各式中のＰ（Ａ）は、テキスト中に記号“Ａ”が出現する確率、Ｐ（Ｂ）はテキスト中に記号“Ｂ”が出現する確率、Ｐ（Ｃ）はテキスト中に記号“Ｃ”が出現する確率である。ここでは、Ｐ（Ａ）＝Ｐ（Ｂ）＝Ｐ（Ｃ）＝１／３と仮定する。
【００４０】
一方、Ｅ[（ｎ）]の式の分子は、シフト量の期待値であり、確率変数であるシフト量の確率を重みとする加重平均で示されている。例えば、式（１ａ）のＰ（Ａ）×０に対し、“AAA”の最後尾の“A”をテキストと照合した際に、テキストの記号が“A”であったならば一致する。この場合、シフト操作はないので、シフト量は０である。すなわち、“AAA”の最後尾の“A”をテキストと照合した際に、シフト量が０となるためには、その位置のテキストに“A”が出現すればよく、その確率はＰ（Ａ）である。
【００４１】
また、式（１ａ）のＰ（Ｂ）×３及びＰ（Ｃ）×３に対し、“AAA”の最後尾の“A”をテキストと照合した際に、テキストの記号が“B”または“C”であったならば不一致である。この場合、シフト操作は図５で上述したように３記号分行うので、シフト量は３である。すなわち、“AAA”の最後尾の“A”をテキストと照合した際に、シフト量が３となるためには、その位置のテキストに“B”または“C”が出現すればよく、その確率はＰ（Ｂ）、Ｐ（Ｃ）である。
【００４２】
したがって、期待値とは、確率変数の値に、それぞれの値が実現する確率をかけたものの和であるから、式（１ａ）の分子（シフト量の期待値）は、Ｐ（Ａ）×０＋Ｐ（Ｂ）×３＋Ｐ（Ｃ）×３となる。式（１ｂ）、式（１ｃ）に対しても同様の考え方である。
【００４３】
一方、各式中の分母は、パターンとテキストとの記号照合回数を示している。“AAA”の最後尾の“A”の場合は、一番最初に記号照合されるので式（１ａ）では１となっている。
【００４４】
同様にして、“AAA”の中間の“A”の場合は、最後尾の“A”の照合が終わってその次に記号照合されるので、中間の“A”の照合を行うためにはパターンとテキストとの記号照合回数は２回要する。このため、式（１ｂ）では２となっている。さらに、“AAA”の左端の“A”の場合は、最後尾の“A”、中間の“A”の照合が終わってその次に記号照合されるので、左端の“A”の照合を行うためにはパターンとテキストとの記号照合回数は３回要する。このため、式（１ｃ）では３となっている。
【００４５】
このようにして計算した検索効率の値の大小を比べると、（１）＜（２）＜（３）である。したがって、検索効率が大きい値のものほど、照合が失敗したときに、シフト量を大きくとれることを示しているので、値の大きい（３）、（２）、（１）の順で（“AAA”の最後尾、中間、左端の順で）、記号照合を行えばよいことがわかる。
【００４６】
次にＢＭ法の第２のアルゴリズムに検索効率を導入した場合について説明する。例としては、上記と同様に、テキストはA、B、Cの３種類の記号からなり、パターンを“AAABCAB”とする。また、パターン先頭から（１）、（２）、…、（７）の番号を付ける。パターンの各記号に対してＢＭ法にもとづく検索効率Ｅ[（７）]〜Ｅ[（１）]は以下の式で表される。
【００４７】
【数２】
Ｅ[（７）]＝（Ｐ（Ａ）×１＋Ｐ（Ｂ）×０＋Ｐ（Ｃ）×２）／１＝１…（２ａ）
Ｅ[（６）]＝（Ｐ（Ａ）×０＋Ｐ（Ｂ）×２＋Ｐ（Ｃ）×１）／２＝１／２…（２ｂ）
Ｅ[（５）]＝（Ｐ（Ａ）×２＋Ｐ（Ｂ）×１＋Ｐ（Ｃ）×０）／３＝１／３…（２ｃ）
Ｅ[（４）]＝（Ｐ（Ａ）×１＋Ｐ（Ｂ）×０＋Ｐ（Ｃ）×４）／４＝５／１２…（２ｄ）
Ｅ[（３）]＝（Ｐ（Ａ）×０＋Ｐ（Ｂ）×３＋Ｐ（Ｃ）×３）／５＝２／５…（２ｅ）
Ｅ[（２）]＝（Ｐ（Ａ）×０＋Ｐ（Ｂ）×２＋Ｐ（Ｃ）×２）／６＝４／１８…（２ｆ）
Ｅ[（１）]＝（Ｐ（Ａ）×０＋Ｐ（Ｂ）×１＋Ｐ（Ｃ）×１）／７＝２／２１…（２ｇ）
ここで、例えば、式（２ｃ）について説明すると、Ｐ（Ｃ）×０に対し、“AAABCAB”の先頭から５番目の“C”をテキストと照合した際に、テキストの記号が“C”であったならば一致する。この場合、シフト操作はないので、シフト量は０である。すなわち、“AAABCAB”の先頭から５番目の“C”をテキストと照合した際に、シフト量が０となるためには、その位置のテキストに“C”が出現すればよく、その確率はＰ（Ｃ）である。
【００４８】
また、Ｐ（Ａ）×２に対し、“AAABCAB”の先頭から５番目の“C”をテキストと照合した際に、テキストの記号が“A”であったならば不一致である。この場合、シフト操作は図４で上述したように２記号分行うので（不一致であった位置に（３）の“A”を合わせることになるから）、シフト量は２である。すなわち、“AAABCAB”の先頭から５番目の“C”をテキストと照合した際に、シフト量が２となるためには、その位置のテキストに“A”が出現すればよく、その確率はＰ（Ａ）である。
【００４９】
さらに、Ｐ（Ｂ）×１に対し、“AAABCAB”の先頭から５番目の“C”をテキストと照合した際に、テキストの記号が“B”であったならば不一致である。この場合、シフト操作は図４で上述したように１記号分行うので（不一致であった位置に（４）の“B”を合わせることになるから）、シフト量は１である。すなわち、“AAABCAB”の先頭から５番目の“C”をテキストと照合した際に、シフト量が１となるためには、その位置のテキストに“B”が出現すればよく、その確率はＰ（Ｂ）である。
【００５０】
一方、式（２ｃ）の分母に対し、“AAABCAB”の先頭から５番目の“C”の場合は、パターン最後尾から３番目に照合されるので、パターンとテキストとの記号照合回数は３回要するために３となっている。なお、その他の式についても上記と同様な考え方である。
【００５１】
ここで、パターン検索方法に導入した場合の式（１ａ）〜（１ｃ）によるそれぞれの検索効率の値及びＢＭ法に導入した場合の式（２ａ）〜（２ｇ）によるそれぞれの検索効率の値の大小を比べると、式（１ａ）＝２が最も大きく、その次に式（２ａ）＝１が大きい。
【００５２】
したがって、パターン検索方法とＢＭ法を組み合わせる場合には、最初にパターン検索方法でパターン“AAABCAB”の（３）から照合し、その後にＢＭ法で（７）、（６）、（５）、（４）、（２）、（１）の順で、それぞれの記号の照合を行えば、検索性能を向上させることができる。
【００５３】
次にパターン検索方法及びＫＭＰ法を組み合わせ、これらに検索効率を導入した場合について説明する。例としては、テキストはA、B、Cの３種類の記号からなり、パターンを“ABCABAAA”とする。また、パターン先頭から（１）、（２）、…、（８）の番号を付ける。パターンの各記号に対して、パターン検索方法にもとづく検索効率Ｅ[（８）]〜Ｅ[（６）]は以下の式で表される。
【００５４】
【数３】
Ｅ[（８）]＝（Ｐ（Ａ）×０＋Ｐ（Ｂ）×３＋Ｐ（Ｃ）×３）／１＝２…（３ａ）
Ｅ[（７）]＝（Ｐ（Ａ）×０＋Ｐ（Ｂ）×２＋Ｐ（Ｃ）×２）／２＝４／６…（３ｂ）
Ｅ[（６）]＝（Ｐ（Ａ）×０＋Ｐ（Ｂ）×１＋Ｐ（Ｃ）×１）／３＝２／９…（３ｃ）
パターン検索方法における各式の考え方については上述したので説明は省略する。一方、ＫＭＰ法にもとづく検索効率Ｅ[（１）]〜Ｅ[（８）]は以下の式で表される。
【００５５】
【数４】
Ｅ[（１）]＝（Ｐ（Ａ）×０＋Ｐ（Ｂ）×１＋Ｐ（Ｃ）×１）／１＝２／３…（４ａ）
Ｅ[（２）]＝（Ｐ（Ａ）×１＋Ｐ（Ｂ）×０＋Ｐ（Ｃ）×１）／２＝２／６…（４ｂ）
Ｅ[（３）]＝（Ｐ（Ａ）×１＋Ｐ（Ｂ）×１＋Ｐ（Ｃ）×０）／３＝２／９…（４ｃ）
Ｅ[（４）]＝（Ｐ（Ａ）×０＋Ｐ（Ｂ）×１＋Ｐ（Ｃ）×１）／４＝２／１２…（４ｄ）
Ｅ[（５）]＝（Ｐ（Ａ）×１＋Ｐ（Ｂ）×０＋Ｐ（Ｃ）×１）／５＝２／１５…（４ｅ）
Ｅ[（６）]＝（Ｐ（Ａ）×０＋Ｐ（Ｂ）×３＋Ｐ（Ｃ）×３）／６＝２／６…（４ｆ）
Ｅ[（７）]＝（Ｐ（Ａ）×０＋Ｐ（Ｂ）×１＋Ｐ（Ｃ）×１）／７＝２／２１…（４ｇ）
Ｅ[（８）]＝（Ｐ（Ａ）×０＋Ｐ（Ｂ）×１＋Ｐ（Ｃ）×１）／８＝２／２４…（４ｈ）
ここで、例えば、式（４ｆ）について説明すると、Ｐ（Ａ）×０に対し、“ABCABAAA”の先頭から６番目の“A”をテキストと照合した際に、テキストの記号が“A”であったならば一致する。この場合、シフト操作はないので、シフト量は０である。すなわち、“ABCABAAA”の先頭から６番目の“A”をテキストと照合した際に、シフト量が０となるためには、その位置のテキストに“A”が出現すればよく、その確率はＰ（Ａ）である。
【００５６】
また、式（４ｆ）のＰ（Ｂ）×３及びＰ（Ｃ）×３に対し、“ABCABAAA”の先頭から６番目の“A”をテキストと照合した際に、テキストの記号が“B”または“C”であったならば不一致である。この場合、シフト操作は図２で上述したように３記号分行うので（不一致となった位置から左方に部分列“AB”があり、この部分列はパターン先頭にもあるから、３記号分ずらすことになる）、シフト量は３である。すなわち、“ABCABAAA”の先頭から６番目の“A”をテキストと照合した際に、シフト量が３となるためには、その位置のテキストに“B”または“C”が出現すればよく、その確率はＰ（Ｂ）、Ｐ（Ｃ）である。
【００５７】
一方、式（４ｆ）の分母に対し、“ABCABAAA”の先頭から６番目の“A”の場合は、パターン先頭から６番目に照合されるので、パターンとテキストとの記号照合回数は６回要するため６となっている。なお、その他の式についても上記と同様な考え方である。
【００５８】
ここで、パターン検索方法に導入した場合の式（３ａ）〜（３ｃ）によるそれぞれの検索効率の値及びＫＭＰ法に導入した場合の式（４ａ）〜（４ｈ）によるそれぞれの検索効率の値の大小を比べると、式（３ａ）＝２が最も大きく、その次に式（３ｂ）＝式（４ａ）＝２／３が大きい。
【００５９】
したがって、最初にパターン検索方法でパターン“ABCABAAA”の（８）、（７）を照合し、その後にＫＭＰ法で、（１）、（２）、（３）、（４）、（５）、（６）の順で、それぞれの記号の照合を行えば、検索性能を向上させることができる。
【００６０】
次にパターン検索方法をソフトウェアで実施した場合について説明する。図８は同一記号列の最大長をカウントする際のプログラムを示す図である。プログラム２００は、同一記号列の最大長をカウントする。例えば、パターンが“AAAB”の場合は３、“AABBBBCC”の場合は４を返す。
【００６１】
図９はパターン検索方法の処理手順を示すフローチャートである。なお、Ｔｘｔはテキスト、Ｐａｔはパターンを示す。
〔Ｓ１〕記号照合開始順番をテーブルＴ[ｉ]に記憶する。例えば、パターンを“000B”として、パターン先頭から（０）〜（３）の番号を付ける。同一記号列の最後尾から検索を行うから、Ｔ[０]＝（２）、Ｔ[１]＝（１）、Ｔ[２]＝（０）、Ｔ[３]＝（３）となる。
〔Ｓ２〕記号照合失敗時のシフト量をテーブルＳ[Ｔ[ｉ]]に記憶する。例えば、パターン“000B”の場合、同一記号列の最後尾が不一致であったときには３記号分シフトするので、Ｓ[Ｔ[０]]＝３と表記される。
〔Ｓ３〕初期設定として、ｉ＝０、shift＝０とする。
〔Ｓ４〕ｉがパターンの記号数（パターンの長さ）になるまで、以下のステップＳ５〜Ｓ７のループ処理を行う。
〔Ｓ５〕シフト位置shift[Ｔ[ｉ]]のときのテキストＴｘｔ[shift[Ｔ[ｉ]]と、パターンＰａｔ[Ｔ[ｉ]]との記号を比較し、不一致ならばステップＳ６へ、一致ならばステップＳ７へ行く。
〔Ｓ６〕あらかじめ記憶してあるシフト量だけシフトする（ｉ＝０、shift＝shift＋Ｓ[Ｔ[ｉ]]）。
〔Ｓ７〕次に照合すべき記号へ移る（ｉ＋＋）。
【００６２】
次に検索方法についてフローチャートを用いて説明する。図１０は検索方法の処理手順を示すフローチャートである。
〔Ｓ１１〕パターンとテキストの記号照合を行う際に、記号照合が失敗したときのシフト操作のとりうる値であるシフト量に、前記シフト量が生じるための確率をかけた総和であるシフト量の期待値と、記号照合回数と、の比である検索効率を計算する。
〔Ｓ１２〕検索効率の値が大きい記号の順に、記号照合をして検索を行う。
【００６３】
図１１はパターン検索方法及びＢＭ法を組み合わせて検索効率を導入したときの検索方法の処理手順を示すフローチャートである。
〔Ｓ２１〕パターンの中から同一記号列を特定する。
〔Ｓ２２〕同一記号列中の記号に対して検索効率を計算する。
〔Ｓ２３〕すべての記号に対してＢＭ法の検索効率を計算する。
〔Ｓ２４〕検索効率から記号照合順番を決定する。
〔Ｓ２５〕記号照合順番にもとづき検索を実行する。
【００６４】
図１２はパターン検索方法及びＫＭＰ法を組み合わせて検索効率を導入したときの検索方法の処理手順を示すフローチャートである。
〔Ｓ３１〕パターンの中から同一記号列の位置を特定する。
〔Ｓ３２〕同一記号列中の記号に対して検索効率を計算する。
〔Ｓ３３〕すべての記号に対してＫＭＰ法の検索効率を計算する。
〔Ｓ３４〕検索効率から記号照合順番を決定する。
〔Ｓ３５〕記号照合順番にもとづき検索を実行する。
【００６５】
図１３はパターン検索方法とＢＭ法とＫＭＰ法を組み合わせて検索効率を導入したときの検索方法の処理手順を示すフローチャートである。
〔Ｓ４１〕パターンの中から同一記号列の位置を特定する。
〔Ｓ４２〕同一記号列中の記号に対して検索効率を計算する。
〔Ｓ４３〕すべての記号に対してＢＭ法の検索効率を計算する。
〔Ｓ４４〕すべての記号に対してＫＭＰ法の検索効率を計算する。
〔Ｓ４５〕検索効率から記号照合順番を決定する。
〔Ｓ４６〕記号照合順番にもとづき検索を実行する。
【００６６】
以上説明したように、検索装置２０及び検索方法によれば、テキストやバイナリデータの検索時に、検索効率を用いて、記号照合手順を動的に変更し、無駄な記号照合を避けることとした。これにより、従来のＫＭＰ法またはＢＭ法だけで検索するよりも、高速にパターンを検索することが可能になる。
【００６７】
次に画像受信装置について説明する。図１４は画像通信システムの構成を示す図である。画像通信システム１００は、ネットワーク１０１で接続される、画像送信装置１２０と画像受信装置１１０とから構成される。
【００６８】
画像受信装置１１０に対し、ストリームデータ受信部１１１は、画像送信装置１２０から送信されたＭＰＥＧ４等のストリームデータを受信し格納する。ヘッダサーチ部１１２は、検索装置１０または検索装置２０の機能を有し、ストリームデータからヘッダ情報（同一記号列が含まれる）を検索し、ヘッダ位置情報を出力する。
【００６９】
情報分離部１１３は、ヘッダ位置情報にもとづき、ストリーム信号からヘッダ情報と画像情報を分離する。ヘッダ処理部１１４は、ヘッダ情報を処理し、デコーダ１１５は、画像情報をデコードして画面に表示する。
【００７０】
なお、上記の説明では、画像通信システムに適用したが、これ以外にもデータベースやファイルなどのサーチ、ＨＤＬＣ等の通信での同期フラグ検出など、様々な分野の検索システムや検索ソフトウェアに適用可能である。
【００７１】
（付記１）パターンの検索を行う検索装置において、
パターン中の連続する同一の記号からなる同一記号列を認識する同一記号列認識部と、
前記同一記号列の最後尾の記号から、パターンとテキストとの記号照合を行い、前記最後尾の記号の記号照合が失敗した場合には、同一記号の数だけパターンをシフトして再度記号照合をして検索を行う検索処理部と、
を有することを特徴とする検索装置。
【００７２】
（付記２）前記検索処理部は、前記同一記号列の先頭から最後尾まで１〜ｎの番号を付けて、前記同一記号列の最後尾の記号から先頭に向かって記号照合を行う際に、ｋ（２≦ｋ≦ｎ）番目の記号照合に失敗したときは、パターンをｋ個分シフトして再度記号照合を行うことを特徴とする付記１記載の検索装置。
【００７３】
（付記３）パターンの検索を行う検索装置において、
パターンとテキストの記号照合を行う際に、記号照合が失敗したときのシフト操作のとりうる値であるシフト量に、前記シフト量が生じるための確率をかけた総和であるシフト量の期待値と、記号照合回数と、の比である検索効率を計算する検索効率計算部と、
前記検索効率の値が大きい記号の順に、記号照合をして検索を行う検索処理部と、
を有することを特徴とする検索装置。
【００７４】
（付記４）前記検索効率計算部は、パターン中の連続する同一の記号からなる同一記号列の最後尾から検索を行うパターン検索方法、ＢＭ法、ＫＭＰ法の少なくとも１つに前記検索効率を導入することを特徴とする付記３記載の検索装置。
【００７５】
（付記５）前記検索処理部は、前記パターン検索方法、前記ＢＭ法、前記ＫＭＰ法を任意に組み合わせて、前記検索効率の値が大きい記号の順にもとづいて検索を行うことを特徴とする付記４記載の検索装置。
【００７６】
（付記６）同一記号列を含むパターンの検索を行うパターン検索方法において、
パターン中の連続する同一の記号からなる同一記号列を認識し、
前記同一記号列の最後尾の記号から、パターンとテキストとの記号照合を行い、
前記最後尾の記号の記号照合が失敗した場合には、同一記号の数だけパターンをシフトして再度記号照合をして検索を行うことを特徴とするパターン検索方法。
【００７７】
（付記７）前記同一記号列の先頭から最後尾まで１〜ｎの番号を付けて、前記同一記号列の最後尾の記号から先頭に向かって記号照合を行う際に、ｋ（２≦ｋ≦ｎ）番目の記号照合に失敗したときは、パターンをｋ個分シフトして再度記号照合を行うことを特徴とする付記６記載のパターン検索方法。
【００７８】
（付記８）パターンの検索を行う検索方法において、
パターンとテキストの記号照合を行う際に、記号照合が失敗したときのシフト操作のとりうる値であるシフト量に、前記シフト量が生じるための確率をかけた総和であるシフト量の期待値と、記号照合回数と、の比である検索効率を計算し、
前記検索効率の値が大きい記号の順に、記号照合をして検索を行うことを特徴とする検索方法。
【００７９】
（付記９）パターン中の連続する同一の記号からなる同一記号列の最後尾から検索を行うパターン検索方法、ＢＭ法、ＫＭＰ法の少なくとも１つに前記検索効率を導入することを特徴とする付記８記載の検索方法。
【００８０】
（付記１０）前記パターン検索方法、前記ＢＭ法、前記ＫＭＰ法を任意に組み合わせて、前記検索効率の値が大きい記号の順にもとづいて検索を行うことを特徴とする付記９記載の検索方法。
【００８１】
（付記１１）画像受信制御を行う画像受信装置において、
画像圧縮されたストリームデータを受信するストリームデータ受信部と、
パターン中の連続する同一の記号からなる同一記号列を認識する同一記号列認識部と、前記同一記号列の最後尾の記号から、パターンとストリームデータとの記号照合を行い、前記最後尾の記号の記号照合が失敗した場合には、同一記号の数だけパターンをシフトして再度記号照合をして、ヘッダの検索を行う検索処理部と、から構成されるヘッダサーチ部と、
を有することを特徴とする画像受信装置。
【００８２】
（付記１２）前記検索処理部は、前記同一記号列の先頭から最後尾まで１〜ｎの番号を付けて、前記同一記号列の最後尾の記号から先頭に向かって記号照合を行う際に、ｋ（２≦ｋ≦ｎ）番目の記号照合に失敗したときは、パターンをｋ個分シフトして再度記号照合を行うことを特徴とする付記１１記載の画像受信装置。
【００８３】
（付記１３）画像受信制御を行う画像受信装置において、
画像圧縮されたストリームデータを受信するストリームデータ受信部と、
パターンと前記ストリームデータの記号照合を行う際に、記号照合が失敗したときのシフト操作のとりうる値であるシフト量に、前記シフト量が生じるための確率をかけた総和であるシフト量の期待値と、記号照合回数と、の比である検索効率を計算する検索効率計算部と、前記検索効率の値が大きい記号の順に、記号照合をして、ヘッダの検索を行う検索処理部と、から構成されるヘッダサーチ部と、
を有することを特徴とする画像受信装置。
【００８４】
(付記１４) 前記検索効率計算部は、パターン中の連続する同一の記号からなる同一記号列の最後尾から検索を行うパターン検索方法、ＢＭ法、ＫＭＰ法の少なくとも１つに前記検索効率を導入することを特徴とする付記１３記載の画像受信装置。
【００８５】
（付記１５）前記検索処理部は、前記パターン検索方法、前記ＢＭ法、前記ＫＭＰ法を任意に組み合わせて、前記検索効率の値が大きい記号の順にもとづいて検索を行うことを特徴とする付記１４記載の画像受信装置。
【００８６】
【発明の効果】
以上説明したように、本発明の検索装置は、パターン中の同一記号列の最後尾の記号から、パターンとテキストとの記号照合を行い、記号照合が失敗した場合には、同一記号の数だけパターンをシフトする構成とした。これにより、無駄な記号照合を避けることができ、高速に検索を行うことが可能になる。
【図面の簡単な説明】
【図１】検索装置の原理図である。
【図２】ＫＭＰ法を説明するための図である。
【図３】ＢＭ法を説明するための図である。
【図４】ＢＭ法を説明するための図である。
【図５】検索手順を示す図である。
【図６】従来のＢＭ法の検索手順を示す図である。
【図７】検索装置の構成を示す図である。
【図８】同一記号列の最大長をカウントする際のプログラムを示す図である。
【図９】パターン検索方法の処理手順を示すフローチャートである。
【図１０】検索方法の処理手順を示すフローチャートである。
【図１１】パターン検索方法及びＢＭ法を組み合わせて検索効率を導入したときの検索方法の処理手順を示すフローチャートである。
【図１２】パターン検索方法及びＫＭＰ法を組み合わせて検索効率を導入したときの検索方法の処理手順を示すフローチャートである。
【図１３】パターン検索方法とＢＭ法とＫＭＰ法を組み合わせて検索効率を導入したときの検索方法の処理手順を示すフローチャートである。
【図１４】画像通信システムの構成を示す図である。
【符号の説明】
１０検索装置
１１同一記号列認識部
１２検索処理部[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a search device, and more particularly to a search device that searches for a pattern.
[0002]
[Prior art]
With the development of the information society in recent years, the amount of information handled has increased, and in order to realize advanced multimedia services and wider areas, the speed of symbol string (character string) search processing is strongly increased. It has been demanded. In the search process, a specified symbol string called a pattern is searched from a given symbol string called text.
[0003]
In the conventional simplest search algorithm, if a mismatch is detected in the search process, the pattern is shifted by one symbol and the matching is repeated again. As typical algorithms that enable faster processing, there are Knuth, Morris, There is a KMP method proposed by Pratt and a BM method proposed by Boyer and Morre.
[0004]
The basic search operation of the KMP method and the BM method is as follows: when matching pattern symbols (the KMP method matches the pattern from the beginning to the front to the back, and the BM method uses the pattern from the tail to the back to the previous direction. When a mismatch is detected, the pattern is greatly moved based on a certain condition. By performing such a search, the search performance can be improved as compared with a simple method in which a pattern is always shifted by one symbol when a mismatch of symbols occurs.
[0005]
[Problems to be solved by the invention]
In the field of information communication, a pattern having a structure including consecutive identical symbols is often searched. Examples of the pattern in which the same symbol continues include an HDLC flag and a header existing in a stream compressed by MPEG.
[0006]
When a search process is performed on a pattern having such a specific structure by the conventional KMP method or the BM method as described above, there is a problem in that useless symbol matching occurs and the search performance cannot be improved. It was.
[0007]
Even the KMP method and the BM method, which represent fast search algorithms, are not effective for patterns of any structure, and the search performance for patterns having the same symbol string is not necessarily the best.
[0008]
The present invention has been made in view of these points, and an object of the present invention is to provide a search device that performs a search process for a pattern having a structure including consecutive identical symbols at high speed.
[0009]
[Means for Solving the Problems]
In the present invention, in order to solve the above-described problem, the same symbol string recognition unit for recognizing the same symbol string composed of the same consecutive symbols in the pattern in the search apparatus 10 for searching for a pattern as shown in FIG. 11, a symbol intends row matching search processing unit 12 of the pattern and the text, the search apparatus 10, characterized in that it has a provided.
[0010]
Here, search processing unit 12, with a number of 1~n from the head of the same symbol string to the end, from the end of the symbols of the same symbol string, starting symbols matching towards the beginning of the same symbol string When the symbol collation of the symbol k (1 ≦ k ≦ n) among the symbols in the same symbol string has failed , the pattern for k symbols from the failed position to the tail end is obtained. Shift and perform symbol collation again from the last symbol of the same symbol string .
[0011]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings. Figure 1 illustrates the principle of a search device. The search device 10 searches for a pattern from text (or binary data).
[0012]
The same symbol string recognition unit 11 recognizes the same symbol string composed of the same consecutive symbols in the pattern. The search processing unit 12 starts symbol matching (or simply referred to as matching) between the pattern and the text from the last symbol in the same symbol string. If the symbol collation of the last symbol fails, the pattern is shifted by the number of the same symbols, and the symbol collation is performed again to perform the search.
[0013]
For example, if the pattern is “000B”, the same symbol string of this pattern is “000”. When performing a search, the text and the left end of the pattern are combined to start symbol matching with the text from the last symbol “0” of “000”. If the symbol matching of the last symbol “0” fails (the text in the figure is “C”, it does not match “0”), it is shifted by the number of the same symbol string (here, 3 symbols). Then, the symbol collation is performed again. Detailed operation contents will be described later with reference to FIG.
[0014]
Next, an outline of the KMP method and the BM method will be described. FIG. 2 is a diagram for explaining the KMP method. In order to find the position where the pattern appears in the text, first the text and the left end of the pattern are matched, and the text and the text are collated one by one from the beginning (left end) to the back.
[0015]
Then, when a symbol mismatch occurs (○ and △ in the figure), if there is the same partial sequence (thick solid line frame in the figure) on the left side of the pattern from the position, Shift to the right by the size of the figure. The substring is selected with the maximum length.
[0016]
For example, when the pattern “EFGHEF2” is found in the text of the figure, if the matching succeeds up to “EFGHEF” and a mismatch occurs at the last “2”, the substring “EF” to the left of “2” Since the same pattern as "" is also present at the head of the pattern, the head of the pattern can be shifted to the position P1 for the next pattern position.
[0017]
3 and 4 are diagrams for explaining the BM method. In order to find the position where the pattern appears in the text, the text and the left end of the pattern are first matched, and the text and the text are collated in order one by one from the tail (right end) of the pattern.
[0018]
In the BM method, there are two algorithms for how much the pattern is shifted to the right when a symbol mismatch occurs (circles and triangles in the figure). FIG. 3 shows the first algorithm, and FIG. 4 shows the second algorithm.
[0019]
In contrast to FIG. 3, the first algorithm is similar to the KMP method, and if the same partial sequence as the partial sequence (bold solid line frame in the figure) at the position before the mismatch from the right end of the pattern is on the left side, Shift the pattern to the right so that the subsequence overlaps the text subsequence.
[0020]
For example, when the pattern “NEFGHEF2” is found in the text of the figure, if the matching succeeds up to “EF2” and a mismatch occurs at the next “H”, the substring “EF” at the position before “H” Since the same partial sequence is also on the left side of the pattern, it can be shifted to the position where the partial sequences overlap (the beginning of the pattern comes to the position P2).
[0021]
In contrast to FIG. 4, the second algorithm looks at the symbol of the text when there is a mismatch, and when there is the same symbol on the left side of the pattern, the pattern is shifted so that it overlaps that symbol of the text.
[0022]
For example, when the pattern “NKFGHEF2” is found in the text of the figure, if matching succeeds up to “EF2” and a mismatch occurs at the next “H”, the symbol of the text when the mismatch occurs is “K” It is. Since “K” is also on the left side of the pattern, it can be shifted to a position where the pattern and text “K” overlap (the beginning of the pattern comes to position P3).
[0023]
In general, it is known that the search speed of the BM method is higher than that of the KMP method. In the BM method, it is known that the search speed of the second algorithm is higher than that of the first algorithm.
[0024]
Search operation of search apparatus 10 in the following be described in detail while comparing with the conventional BM method for. Hereinafter, the search operation of the search device 10 is also referred to as a pattern search method.
[0025]
Figure 5 is a diagram illustrating a search procedure. The text is a symbol string shown in the figure, and the pattern is “000B”. First, attention is focused on the longest of the same symbol string "000" to be continuous in pattern. Then, the text and the left end of the pattern are combined, and the text is first collated from the last symbol “0” of “000” in the pattern “000B”.
[0026]
If a symbol mismatch is found as a result of collation, there is only the character “0” that failed to collate in the head direction of the pattern. In this case, the symbol is shifted by one character and two characters. It is clear that even if they are matched, they do not match.
[0027]
Therefore, the pattern is shifted by three characters (the number of the same symbol string), and similar symbol matching is repeated again. In the figure, the rightmost character of “000” matches during the third shift. Therefore, next, the text is compared with the second character of “000”.
[0028]
Here, if a mismatch between the symbols is found, it is clear that even if the symbols are matched by shifting one character, it is clear that they do not match. In general, when numbers 1 to n are assigned from the beginning to the end of the same symbol string and the k (2 ≦ k ≦ n) th symbol collation fails, the pattern is shifted by k. It will be.
[0029]
By performing such a search process, the number of shifts can be greatly reduced. In the example shown in the figure, the pattern “000B” is detected in the text by the third shift (the matching order at the third shift position is “0” at the end of “000”, “ The pattern is detected by collating in the order of “0” and leading “0”, and then collating “B”.
[0030]
FIG. 6 is a diagram showing a conventional BM method search procedure. The text and pattern symbol strings are the same as in FIG. The figure shows the case of the second algorithm of the BM method. In the BM method, the search is started from the last character of the pattern.
[0031]
If the symbols do not match, look at the unmatched symbol in the text, and if there is the same symbol to the left of the pattern, shift the pattern so that this symbol is in the position where the mismatch was found. Here, since the symbol in the text that does not match is “0”, the pattern is shifted so that this “0” and the second “0” from the end of “000B” overlap.
[0032]
Such processing is performed from the 0th shift to the 2nd shift, the 3rd shift to the 6th shift. Also, when shifting from the second time to the third time, “A” in the text and “B” in the pattern do not match, and “A” does not exist to the left of the pattern, so the shift from the second time to the third time shifts. When going, the pattern is shifted by 4 characters.
[0033]
As shown in the figure, the BM method requires a total of seven shifts. Thus, in the case of patterns including the same symbol string, it can be seen that the search process by the search device 10 is faster.
[0034]
The search device in the case of introducing search efficiency in the following be described. FIG. 7 is a diagram showing the configuration of the search device. The search device 20 includes a search efficiency calculation unit 21 and a search processing unit 22. Hereinafter, the search operation of the search device 20 is also simply referred to as a search method.
[0035]
The search efficiency calculation unit 21 calculates the search efficiency for each symbol of the pattern. Search efficiency refers to the expected value of the shift amount, which is the sum of the shift amount that can be taken by the shift operation when symbol matching fails and the probability that the shift amount will occur, and the number of symbol matches. It is expressed as a ratio. The search processing unit 22 performs a search by collating symbols in descending order of the search efficiency value.
[0036]
Next, the search efficiency and the operation of the search device 20 will be described in detail. In order to increase the search speed, it is necessary to increase the amount of shift and eliminate unnecessary symbol matching. In this case, it is necessary to provide an index as to which symbol in the pattern is used for collation, and whether the shift amount can be increased when the collation fails. What shows this index is the search efficiency. The search efficiency is the expected value of the shift amount / number of symbol collations.
[0037]
First, a case where search efficiency is introduced into the pattern search method will be described. As a specific example, consider a case where the text is composed of three types of symbols A, B, and C, and the pattern is “AAABCAB”. Further, numbers (1), (2),..., (7) are assigned to the symbols in the pattern from the beginning of the pattern.
[0038]
In the pattern search method, the search is performed by paying attention to the same symbol string. Therefore, the search efficiency of the respective symbols (3), (2), and (1) with respect to “AAA” is obtained. When the search efficiency is expressed as E [(n)] (n is a number given to the pattern), E [(3)] to E [(1)] are expressed by the following equations (E [(3 )] Is the search efficiency for “A” at the end of “AAA”, E [(2)] is the search efficiency for “A” in the middle of “AAA”, and E [(1)] This is the search efficiency for “A” at the left end of “AAA”).
[0039]
[Expression 1]
E [(3)] = (P (A) × 0 + P (B) × 3 + P (C) × 3) / 1 = 2 (1a)
E [(2)] = (P (A) × 0 + P (B) × 2 + P (C) × 2) / 2 = 4/6 (1b)
E [(1)] = (P (A) × 0 + P (B) × 1 + P (C) × 1) / 3 = 2/9 (1c)
P (A) in each expression is the probability that the symbol “A” appears in the text, P (B) is the probability that the symbol “B” appears in the text, and P (C) is the symbol “C” in the text. "Is the probability of appearance. Here, it is assumed that P (A) = P (B) = P (C) = 1/3.
[0040]
On the other hand, the numerator of the equation E [(n)] is an expected value of the shift amount, and is represented by a weighted average using the probability of the shift amount as a random variable as a weight. For example, P (A) × 0 in the formula (1a) matches if the last symbol “A” of “AAA” is collated with text and the symbol of the text is “A”. In this case, since there is no shift operation, the shift amount is zero. That is, when “A” at the end of “AAA” is collated with text, in order for the shift amount to be zero, “A” should appear in the text at that position, and the probability is P (A ).
[0041]
Further, when “A” at the end of “AAA” is collated with text for P (B) × 3 and P (C) × 3 in Expression (1a), the symbol of the text is “B” or “ If it is “C”, it is inconsistent. In this case, the shift operation is performed for three symbols as described above with reference to FIG. That is, when “A” at the end of “AAA” is collated with text, in order for the shift amount to be 3, “B” or “C” should appear in the text at that position, and the probability Are P (B) and P (C).
[0042]
Therefore, since the expected value is the sum of the value of the random variable multiplied by the probability that each value is realized, the numerator (the expected value of the shift amount) of Equation (1a) is P (A) × 0 + P (B) × 3 + P (C) × 3. The same concept applies to formula (1b) and formula (1c).
[0043]
On the other hand, the denominator in each expression indicates the number of symbol matching between the pattern and the text. In the case of “A” at the end of “AAA”, symbol collation is performed first, and therefore 1 in Formula (1a).
[0044]
Similarly, in the case of “A” in the middle of “AAA”, the last “A” is collated and then the symbol collation is performed. The number of symbol verifications between text and text is two. For this reason, it is 2 in the formula (1b). Furthermore, in the case of “A” at the left end of “AAA”, since the last “A” and the middle “A” are collated and then the symbol collation is performed, the left end “A” is collated. For this purpose, the number of symbol verifications between the pattern and text is three. For this reason, it is 3 in the formula (1c).
[0045]
When the magnitudes of the search efficiency values calculated in this way are compared, (1) <(2) <(3). Therefore, the higher the search efficiency, the larger the shift amount can be obtained when the collation fails. Therefore, the values are increased in the order of (3), (2), (1) (“AAA It will be understood that symbol matching should be performed in the order of “tail, middle, left end”.
[0046]
Next, a case where search efficiency is introduced into the second algorithm of the BM method will be described. As an example, the text is composed of three types of symbols A, B, and C, and the pattern is “AAABCAB”, as described above. Also, numbers (1), (2),..., (7) are added from the beginning of the pattern. Retrieval efficiencies E [(7)] to E [(1)] based on the BM method for each symbol of the pattern are expressed by the following equations.
[0047]
[Expression 2]
E [(7)] = (P (A) × 1 + P (B) × 0 + P (C) × 2) / 1 = 1 (2a)
E [(6)] = (P (A) × 0 + P (B) × 2 + P (C) × 1) / 2 = 1/2 (2b)
E [(5)] = (P (A) × 2 + P (B) × 1 + P (C) × 0) / 3 = 1/3 (2c)
E [(4)] = (P (A) × 1 + P (B) × 0 + P (C) × 4) / 4 = 5/12 (2d)
E [(3)] = (P (A) × 0 + P (B) × 3 + P (C) × 3) / 5 = 2/5 (2e)
E [(2)] = (P (A) × 0 + P (B) × 2 + P (C) × 2) / 6 = 4/18 (2f)
E [(1)] = (P (A) × 0 + P (B) × 1 + P (C) × 1) / 7 = 2/21 (2g)
Here, for example, the expression (2c) will be explained. When the fifth “C” from the beginning of “AAABCAB” is collated with the text for P (C) × 0, the text symbol is “C”. If there is a match. In this case, since there is no shift operation, the shift amount is zero. That is, when the fifth “C” from the beginning of “AAABCAB” is collated with the text, in order for the shift amount to be 0, “C” should appear in the text at that position, and the probability is P (C).
[0048]
For P (A) × 2, if the fifth “C” from the beginning of “AAABCAB” is collated with the text and the symbol of the text is “A”, there is a mismatch. In this case, the shift operation is performed for two symbols as described above with reference to FIG. 4 (because “A” in (3) is matched with the mismatched position), the shift amount is 2. That is, when the fifth “C” from the beginning of “AAABCAB” is collated with text, in order for the shift amount to be 2, “A” should appear in the text at that position, and the probability is P (A).
[0049]
Further, when the fifth character “C” from the head of “AAABCAB” is collated with text for P (B) × 1, if the symbol of the text is “B”, it is inconsistent. In this case, since the shift operation is performed for one symbol as described above with reference to FIG. 4 (because “B” in (4) is aligned with the position where there is a mismatch), the shift amount is 1. That is, when the fifth “C” from the head of “AAABCAB” is collated with the text, in order for the shift amount to be 1, “B” should appear in the text at that position, and the probability is P (B).
[0050]
On the other hand, for the denominator of equation (2c), the fifth “C” from the beginning of “AAABCAB” is matched third from the end of the pattern, so the number of symbol matching between the pattern and text is three. It is 3 because it requires. The other formulas have the same concept as described above.
[0051]
Here, the values of the respective search efficiencies according to the equations (1a) to (1c) when introduced into the pattern search method and the values of the respective search efficiencies according to the equations (2a) to (2g) when introduced into the BM method. Comparing the magnitude, the formula (1a) = 2 is the largest, and the formula (2a) = 1 is the second largest.
[0052]
Therefore, when combining the pattern search method and the BM method, the pattern search method first matches the pattern “AAABCAB” from (3), and then uses the BM method (7), (6), (5), ( If the respective symbols are collated in the order of 4), (2), and (1), the search performance can be improved.
[0053]
Next, the case where the pattern search method and the KMP method are combined and the search efficiency is introduced into these will be described. As an example, the text consists of three types of symbols A, B, and C, and the pattern is “ABCABAAA”. Also, numbers (1), (2),..., (8) are added from the beginning of the pattern. For each symbol of the pattern, search efficiencies E [(8)] to E [(6)] based on the pattern search method are expressed by the following equations.
[0054]
[Equation 3]
E [(8)] = (P (A) × 0 + P (B) × 3 + P (C) × 3) / 1 = 2 (3a)
E [(7)] = (P (A) × 0 + P (B) × 2 + P (C) × 2) / 2 = 4/6 (3b)
E [(6)] = (P (A) × 0 + P (B) × 1 + P (C) × 1) / 3 = 2/9 (3c)
Since the concept of each formula in the pattern search method has been described above, a description thereof will be omitted. On the other hand, search efficiencies E [(1)] to E [(8)] based on the KMP method are expressed by the following equations.
[0055]
[Expression 4]
E [(1)] = (P (A) × 0 + P (B) × 1 + P (C) × 1) / 1 = 2/3 (4a)
E [(2)] = (P (A) × 1 + P (B) × 0 + P (C) × 1) / 2 = 2/6 (4b)
E [(3)] = (P (A) × 1 + P (B) × 1 + P (C) × 0) / 3 = 2/9 (4c)
E [(4)] = (P (A) × 0 + P (B) × 1 + P (C) × 1) / 4 = 2/12 (4d)
E [(5)] = (P (A) × 1 + P (B) × 0 + P (C) × 1) / 5 = 2/15 (4e)
E [(6)] = (P (A) × 0 + P (B) × 3 + P (C) × 3) / 6 = 2/6 (4f)
E [(7)] = (P (A) × 0 + P (B) × 1 + P (C) × 1) / 7 = 2/21 (4 g)
E [(8)] = (P (A) × 0 + P (B) × 1 + P (C) × 1) / 8 = 2/24 (4h)
Here, for example, the expression (4f) will be described. When the sixth “A” from the beginning of “ABCABAAA” is collated with the text for P (A) × 0, the text symbol is “A”. If there is a match. In this case, since there is no shift operation, the shift amount is zero. That is, when the sixth “A” from the beginning of “ABCABAAA” is collated with text, in order for the shift amount to be zero, “A” should appear in the text at that position, and the probability is P (A).
[0056]
In addition, when P (B) × 3 and P (C) × 3 in Expression (4f) are compared with the text “A” in the sixth from the beginning of “ABCABAAA”, the symbol of the text is “B”. Or if it is “C”, it is inconsistent. In this case, since the shift operation is performed for three symbols as described above with reference to FIG. 2 (the partial sequence “AB” is located on the left side from the mismatched position, and this partial sequence is also at the top of the pattern, so the three symbols are included. The shift amount is 3. That is, when the sixth “A” from the beginning of “ABCABAAA” is compared with the text, the shift amount should be 3, so that “B” or “C” appears in the text at that position. The probabilities are P (B) and P (C).
[0057]
On the other hand, with respect to the denominator of the formula (4f), when “A” is the sixth “A” from the head of “ABCABAAA”, the sixth time from the head of the pattern is collated, so the number of symbol matching between the pattern and text is six. Therefore, it is 6. The other formulas have the same concept as described above.
[0058]
Here, the values of the respective search efficiencies according to the equations (3a) to (3c) when introduced into the pattern search method and the values of the respective search efficiencies according to the equations (4a) to (4h) when introduced into the KMP method. Comparing the size, the formula (3a) = 2 is the largest, and then the formula (3b) = the formula (4a) = 2/3 is the largest.
[0059]
Therefore, (8) and (7) of the pattern “ABCABAAA” are first collated by the pattern search method, and then (1), (2), (3), (4), (5), If the respective symbols are collated in the order of (6), the search performance can be improved.
[0060]
Description will be given of a case where next to were carried out pattern search method in the software. FIG. 8 is a diagram showing a program for counting the maximum length of the same symbol string. The program 200 counts the maximum length of the same symbol string. For example, 3 is returned when the pattern is “AAAB”, and 4 is returned when “AABBBBCC”.
[0061]
FIG. 9 is a flowchart showing the processing procedure of the pattern search method. Txt indicates text and Pat indicates a pattern.
[S1] The symbol collation start order is stored in the table T [i]. For example, assuming that the pattern is “000B”, numbers (0) to (3) are added from the top of the pattern. Since the search is performed from the end of the same symbol string, T [0] = (2), T [1] = (1), T [2] = (0), and T [3] = (3).
[S2] The shift amount at the time of symbol collation failure is stored in the table S [T [i]]. For example, in the case of the pattern “000B”, if the end of the same symbol string does not match, the pattern is shifted by 3 symbols, so S [T [0]] = 3.
[S3] As initial settings, i = 0 and shift = 0.
[S4] The following loop processing of steps S5 to S7 is performed until i becomes the number of symbols of the pattern (pattern length).
[S5] The text Txt [shift [T [i]] at the shift position shift [T [i]] and the pattern Pat [T [i]] are compared with each other. If so, go to step S7.
[S6] Shift by a shift amount stored in advance (i = 0, shift = shift + S [T [i]]).
[S7] Move to the next symbol to be verified (i ++).
[0062]
It will be described with reference to a flowchart search method to the next. FIG. 10 is a flowchart showing the processing procedure of the search method.
[S11] When performing symbol collation between the pattern and text, the shift amount, which is the sum of the shift amount, which can be taken by the shift operation when symbol collation fails, multiplied by the probability for the shift amount to occur. The search efficiency that is the ratio of the expected value and the number of symbol matching is calculated.
[S12] A search is performed by collating symbols in the order of the symbols having the largest search efficiency values.
[0063]
FIG. 11 is a flowchart showing the processing procedure of the search method when the search efficiency is introduced by combining the pattern search method and the BM method.
[S21] The same symbol string is specified from the pattern.
[S22] The search efficiency is calculated for the symbols in the same symbol string.
[S23] The search efficiency of the BM method is calculated for all symbols.
[S24] The symbol collation order is determined from the search efficiency.
[S25] A search is executed based on the symbol collation order.
[0064]
FIG. 12 is a flowchart showing the processing procedure of the search method when the search efficiency is introduced by combining the pattern search method and the KMP method.
[S31] The position of the same symbol string is specified from the pattern.
[S32] The search efficiency is calculated for the symbols in the same symbol string.
[S33] The search efficiency of the KMP method is calculated for all symbols.
[S34] The symbol collation order is determined from the search efficiency.
[S35] A search is executed based on the symbol collation order.
[0065]
FIG. 13 is a flowchart showing the processing procedure of the search method when the search efficiency is introduced by combining the pattern search method, the BM method, and the KMP method.
[S41] The position of the same symbol string is specified from the pattern.
[S42] The search efficiency is calculated for the symbols in the same symbol string.
[S43] The search efficiency of the BM method is calculated for all symbols.
[S44] The search efficiency of the KMP method is calculated for all symbols.
[S45] The symbol collation order is determined from the search efficiency.
[S46] A search is executed based on the symbol collation order.
[0066]
As described above, according to the search device 20 and the search method, when searching text or binary data, the symbol matching procedure is dynamically changed using the search efficiency to avoid useless symbol matching. This makes it possible to search for a pattern at a higher speed than when searching only by the conventional KMP method or BM method.
[0067]
Next to the images receiving apparatus will be described. FIG. 14 is a diagram showing the configuration of the image communication system. The image communication system 100 includes an image transmission device 120 and an image reception device 110 that are connected via a network 101.
[0068]
For the image receiving apparatus 110, the stream data receiving unit 111 receives and stores stream data such as MPEG4 transmitted from the image transmitting apparatus 120. Header search unit 112 has a function of search device 10 or the search unit 20 searches the header information from the stream data (including the same symbol string), and outputs the header position information.
[0069]
The information separation unit 113 separates the header information and the image information from the stream signal based on the header position information. The header processing unit 114 processes the header information, and the decoder 115 decodes the image information and displays it on the screen.
[0070]
In the above description, it is applied to the images communication system, the search, such as a database or file other than this, such as synchronization flag detection in communication HDLC etc., apply to the search system, search software in various fields Is possible.
[0071]
(Supplementary note 1) In a search device for searching for a pattern,
The same symbol string recognition unit for recognizing the same symbol string consisting of consecutive identical symbols in the pattern;
The symbol matching between the pattern and text is performed from the last symbol of the same symbol string, and if the symbol matching of the last symbol fails, the pattern is shifted by the number of the same symbol and the symbol matching is performed again. A search processing unit for performing a search,
A search device comprising:
[0072]
(Additional remark 2) When the said search process part attaches the number of 1-n from the head of the said same symbol string to the tail, and performs symbol collation from the last symbol of the said same symbol string toward the head, The search device according to appendix 1, wherein when the k (2 ≦ k ≦ n) th symbol collation fails, the symbol collation is performed again after shifting the pattern by k patterns.
[0073]
(Supplementary Note 3) In a search device for searching for a pattern,
When performing symbol matching between a pattern and text, an expected value of the shift amount, which is a sum of the shift amount that can be taken by the shift operation when symbol matching fails and the probability for the shift amount to occur, and , A search efficiency calculation unit that calculates a search efficiency that is a ratio of the number of symbol matching,
A search processing unit for performing a search by performing symbol matching in the order of the symbols having the highest search efficiency values;
A search device comprising:
[0074]
(Additional remark 4) The said search efficiency calculation part introduce | transduces the said search efficiency into at least one of the pattern search method, BM method, and KMP method of searching from the tail end of the same symbol sequence which consists of the same continuous symbol in a pattern The search device according to supplementary note 3, wherein:
[0075]
(Additional remark 5) The said search process part performs the search based on the order of the symbol with a large value of the said search efficiency, combining the said pattern search method, the said BM method, and the said KMP method arbitrarily. The described search device.
[0076]
(Additional remark 6) In the pattern search method which searches the pattern containing the same symbol string,
Recognize the same symbol string consisting of consecutive identical symbols in the pattern,
From the last symbol of the same symbol string, perform symbol matching between the pattern and text,
A pattern search method characterized in that, when the symbol matching of the last symbol fails, the pattern is shifted by the number of the same symbols and the symbol matching is performed again for the search.
[0077]
(Supplementary Note 7) k (2 ≦ k ≦) is used when symbol matching is performed from the last symbol of the same symbol string to the beginning by assigning numbers 1 to n from the beginning to the last of the same symbol sequence. n) The pattern search method according to appendix 6, wherein when the first symbol collation fails, the pattern is shifted by k patterns and the symbol collation is performed again.
[0078]
(Supplementary note 8) In a search method for searching for a pattern,
When performing symbol matching between a pattern and text, an expected value of the shift amount, which is a sum of the shift amount that can be taken by the shift operation when symbol matching fails and the probability for the shift amount to occur, and , Calculate the search efficiency, which is the ratio of the number of symbol matches,
A search method, wherein a search is performed by comparing symbols in descending order of the search efficiency value.
[0079]
(Supplementary Note 9) The supplementary note is characterized in that the search efficiency is introduced into at least one of a pattern search method, a BM method, and a KMP method for performing a search from the end of the same symbol string composed of the same continuous symbols in the pattern. 8. The search method according to 8.
[0080]
(Additional remark 10) The search method of Additional remark 9 characterized by searching according to the order of the symbol with the said largest search efficiency value, combining the said pattern search method, the said BM method, and the said KMP method arbitrarily.
[0081]
(Supplementary Note 11) In an image receiving apparatus that performs image reception control,
A stream data receiving unit for receiving image-compressed stream data;
The same symbol string recognition unit that recognizes the same symbol string composed of the same consecutive symbols in the pattern, and the symbol of the pattern and stream data are collated from the last symbol of the same symbol string, and the last symbol If the symbol matching fails, the header search unit configured by a search processing unit that shifts the pattern by the number of the same symbols and performs symbol matching again to search for the header,
An image receiving apparatus comprising:
[0082]
(Supplementary Note 12) The search processing unit assigns numbers 1 to n from the head to the tail of the same symbol string, and performs symbol matching from the tail symbol to the head of the same symbol string. 12. The image receiving apparatus according to appendix 11, wherein when the k (2 ≦ k ≦ n) -th symbol collation fails, the pattern is shifted by k patterns and symbol collation is performed again.
[0083]
(Supplementary Note 13) In an image receiving apparatus that performs image reception control,
A stream data receiving unit for receiving image-compressed stream data;
Expected shift amount, which is the sum of the shift amount, which can be taken by the shift operation when symbol collation fails, and the probability that the shift amount occurs when performing symbol collation between the pattern and the stream data A search efficiency calculation unit that calculates a search efficiency that is a ratio of the value and the number of symbol matching, a search processing unit that searches for a header by performing symbol matching in the order of a symbol having a large value of the search efficiency, A header search unit comprising:
An image receiving apparatus comprising:
[0084]
(Additional remark 14) The said search efficiency calculation part introduce | transduces the said search efficiency into at least one of the pattern search method, BM method, and KMP method which search from the tail end of the same symbol sequence which consists of the same continuous symbol in a pattern The image receiving device as set forth in appendix 13, wherein:
[0085]
(Supplementary Note 15) The supplementary note 14 is characterized in that the search processing unit performs a search based on an order of symbols having a large value of the search efficiency by arbitrarily combining the pattern search method, the BM method, and the KMP method. The image receiving device described.
[0086]
【The invention's effect】
As described above, the search device according to the present invention performs symbol matching between a pattern and text from the last symbol of the same symbol string in the pattern. The pattern is shifted. As a result, useless symbol matching can be avoided, and a search can be performed at high speed.
[Brief description of the drawings]
FIG. 1 is a principle diagram of the search system.
FIG. 2 is a diagram for explaining a KMP method.
FIG. 3 is a diagram for explaining a BM method;
FIG. 4 is a diagram for explaining a BM method;
FIG. 5 is a diagram showing the search procedure.
FIG. 6 is a diagram showing a search procedure of a conventional BM method.
FIG. 7 is a diagram illustrating a configuration of a search device.
FIG. 8 is a diagram showing a program for counting the maximum length of the same symbol string.
FIG. 9 is a flowchart showing a processing procedure of a pattern search method.
FIG. 10 is a flowchart showing a processing procedure of a search method.
FIG. 11 is a flowchart showing a processing procedure of a search method when search efficiency is introduced by combining a pattern search method and a BM method.
FIG. 12 is a flowchart showing a processing procedure of a search method when search efficiency is introduced by combining a pattern search method and a KMP method.
FIG. 13 is a flowchart showing a processing procedure of a search method when a search efficiency is introduced by combining a pattern search method, a BM method, and a KMP method.
FIG. 14 is a diagram illustrating a configuration of an image communication system.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 10 Search apparatus 11 Same symbol sequence recognition part 12 Search processing part

Claims

In a search device that searches for patterns,
The same symbol string recognition unit for recognizing the same symbol string consisting of consecutive identical symbols in the pattern;
And line intends search processing unit symbols match between the pattern and the text,
Have
The search processing unit
Numbering 1 to n from the beginning to the end of the same symbol string, starting symbol matching from the last symbol of the same symbol string toward the beginning of the same symbol string ,
When the symbol collation of the symbols assigned k (1 ≦ k ≦ n) among the symbols in the same symbol string has failed, the pattern corresponding to k symbols from the failed position toward the tail end. Shift and perform symbol collation again from the last symbol of the same symbol string ,
A search device characterized by that.