JP4649512B2

JP4649512B2 - Character string search method and apparatus

Info

Publication number: JP4649512B2
Application number: JP2008500385A
Authority: JP
Inventors: 直広古川; 尚司池田; 康介小西; 健永崎
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2006-02-14
Filing date: 2006-02-14
Publication date: 2011-03-09
Anticipated expiration: 2026-02-14
Also published as: JPWO2007094078A1; WO2007094078A1

Description

本発明は、文字コードや手書きストロークから構成される文字列の検索の方法及びその装置に関する。 The present invention relates to a method and an apparatus for searching a character string composed of a character code and a handwritten stroke.

書類に記載された文章など検索対象の文字列から、ユーザ等によって指定された文字列（パターン）あるいは検索キーを見つけ出す処理が文字列検索である。文字列検索で用いられる文字列の表現形式は、ＳＪＩＳやＵｎｉｃｏｄｅなどの規格にそった文字コードの列からなる文字列が一般的である。近年、画面上でのペンの動きをスキャンできるタッチパネルの普及や、紙面上に記入したストロークを取得できるデジタルペン（例えば、国際公開第０１／７１４７３号公報参照）によって、手書きの文字列を容易に電子化できるようになってきた。従って、検索対象やパターン文字列（検索キー）が手書きのストロークの形式で表現された場合での文字列検索の方法が必要となってきた。
文字列検索の方法として、特開２００５−２５１２２２号公報には、文字列ストロークを構成するサンプリング点列同士を動的計画法で照合する方法１が開示されている。
また、特開平１０−０５５４０９号公報には、ストロークを線方向特徴等から細かなセグメントの列に変換し、セグメント列同士を照合する方法２が開示されている。
さらに、特開２００２−２５９９１２号公報には、文字列を構成するストローク集合から文字を構成する可能性があるストロークの部分集合（文字切り出し候補）を切り出し、各文字切り出し候補の形状から相応しい文字コードを識別する文字識別手段を用いて、文字切り出し候補と文字識別候補の組の列からなる候補文字ラティスを作成し、照合する方法３が開示されている。Character string search is a process of finding a character string (pattern) or a search key designated by a user or the like from a character string to be searched such as a sentence described in a document. The expression format of the character string used in the character string search is generally a character string made up of character code strings conforming to standards such as SJIS and Unicode. In recent years, the spread of touch panels that can scan the movement of pens on the screen, and digital pens that can acquire strokes written on paper (for example, see International Publication No. 01/71473) make it easy to write handwritten character strings. It has become possible to digitize. Therefore, a character string search method in the case where a search target and a pattern character string (search key) are expressed in a handwritten stroke format has become necessary.
As a method for searching for a character string, Japanese Patent Application Laid-Open No. 2005-251222 discloses a method 1 for collating sampling point sequences constituting a character string stroke with a dynamic programming method.
Japanese Patent Application Laid-Open No. 10-055409 discloses a method 2 in which a stroke is converted into a line of fine segments from a line direction feature or the like, and segment lines are collated.
Further, Japanese Patent Laid-Open No. 2002-259912 discloses a subset of strokes (character segmentation candidates) that may constitute a character from a stroke set that constitutes a character string, and an appropriate character code from the shape of each character segmentation candidate. A method 3 is disclosed in which a candidate character lattice that includes a set of character extraction candidates and character identification candidates is created and collated using character identification means for identifying the character.

しかし、特開２００５−２５１２２２号公報に開示された従来の方法１と特開平１０−０５５４０９号公報に開示された従来の方法２は、ストロークのサンプリング点の順序に依存するため、同じ字形でも書き順が異なると照合が困難になる問題があった。また検索対象とキーワードの手書き文字列が異なる利用者によって記入された場合、検索精度が低下する問題があった。さらにこれらの方法は、ストロークベースの検索であるため、検索対象又はパターンの文字列が手書きストロークではなく文字コードで表現されていた場合、検索できない問題があった。
また、特開２００２−２５９９１２号公報に開示された従来の方法３では、文字切り出し候補や文字識別候補がいつも正しいとは限らず、文字切り出し候補や文字識別候補に誤りが含まれる場合は正しく文字列が検索できない問題があった。
本発明は、このような問題に鑑みてなされたものである。
すなわち、本発明の目的は、検索対象とパターンとで記入者や書き順が異なる場合でも適用可能な文字列検索方法を提供することである。
また、本発明の目的は、文字切り出し候補や文字識別候補に誤りが存在してもそれらを許容する文字列検索方法を提供することである。
さらに、本発明の目的は、検索対象とパターンの文字列の表現形式として文字コードと手書きストロークの任意の組合せを許容する文字列検索方法を提供することである。
前述した課題を解決するために本願で開示する代表的な発明は以下の通りである。
利用者からの検索すべき文字列を文字コードと手書きストロークとの２種類のいずれか又は両方の表現形式で受け取るパターン文字列入力部と、文字コードと手書きストロークとの２種類のいずれか又は両方の表現形式からなる文字列を有する書類の情報を管理する書類管理部と、検索対象となる書類中の文字列と入力された検索すべき文字列を共に文字切り出しグラフに変換し、文字切り出しグラフ同士を照合することによってパターン文字列が検索対象の文字列に出現する箇所を抽出する文字列照合部と、書類又は文字列検索結果を表示する表示部と、を有する文字列検索システム。
本発明によって、利用者は書類中から探し出したい文字列を高精度に見つけることができる。また自分自身以外の書込みも見つけ出すことができる。さらに、文字コードや手書きストロークの区別なく書類中の全文字列を検索できる効果がある。However, since the conventional method 1 disclosed in Japanese Patent Laid-Open No. 2005-251222 and the conventional method 2 disclosed in Japanese Patent Laid-Open No. 10-055409 depend on the order of sampling points of the stroke, the same character shape can be written. If the order is different, there is a problem that collation becomes difficult. Further, when the search target and the handwritten character string of the keyword are entered by different users, there is a problem that the search accuracy is lowered. Furthermore, since these methods are stroke-based searches, there is a problem that the search target or pattern character string cannot be searched if the character string is expressed by a character code instead of a handwritten stroke.
Further, in the conventional method 3 disclosed in Japanese Patent Laid-Open No. 2002-259912, the character cutout candidates and the character identification candidates are not always correct. If the character cutout candidates and the character identification candidates include an error, the characters are correctly displayed. There was a problem that the column could not be searched.
The present invention has been made in view of such problems.
That is, an object of the present invention is to provide a character string search method that can be applied even when the writer and the writing order are different between the search target and the pattern.
Another object of the present invention is to provide a character string search method that allows an error in character extraction candidates and character identification candidates even if errors exist.
Furthermore, the objective of this invention is providing the character string search method which accept | permits arbitrary combinations of a character code and a handwritten stroke as an expression form of the character string of a search object and a pattern.
In order to solve the above-described problems, typical inventions disclosed in the present application are as follows.
A pattern character string input unit for receiving a character string to be searched from a user in one or both of two types of expression formats of a character code and a handwritten stroke, and either or both of a character code and a handwritten stroke The document management unit that manages the information of documents that have character strings in the expression format, and the character string in the document to be searched and the input character string to be searched are both converted into a character segmentation graph, and the character segmentation graph A character string search system comprising: a character string collation unit that extracts a portion where a pattern character string appears in a character string to be searched by collating each other; and a display unit that displays a document or character string search result.
According to the present invention, a user can find a character string to be searched for in a document with high accuracy. You can also find other writing than yourself. Furthermore, there is an effect that all character strings in a document can be searched without distinguishing between character codes and handwritten strokes.

図１は、本発明の実施の形態の文字列検索システムの構成図である。
図２は、本発明の実施の形態で実現可能な文字列検索の種類を示す図である。
図３は、本発明の実施の形態で対象となる書類の１例を示す図である。
図４Ａ及び図４Ｂは、本発明の実施の形態の検索対象とパターン各々の文字切り出しグラフを示す図である。
図５は、本発明の実施の形態の文字切り出しグラフ照合の手法を説明する図である。
図６Ａ及び図６Ｂは、本発明の実施の形態の文字コード情報のデータ構造を示す図である。
図７Ａ及び図７Ｂは、本発明の実施の形態のストローク情報のデータ構造を示す図である。
図８は、本発明の実施の形態の書類情報のデータ構造を示す図である。
図９Ａから図９Ｄは、本発明の実施の形態の文字切り出しグラフのデータ構造を示す図である。
図１０は、本発明の実施の形態の文字列検索結果のデータ構造を示す図である。
図１１は、本発明の実施の形態の文字列検索候補のデータ構造を示す図である。
図１２は、本発明の実施の形態の前処理の説明図である。
図１３は、本発明の実施の形態の文字切り出し作成処理の説明図である。
図１４は、本発明の実施の形態の文字列検索処理の説明図である。
図１５は、本発明の実施の形態の文字切り出しグラフ照合処理の説明図である。FIG. 1 is a configuration diagram of a character string search system according to an embodiment of this invention.
FIG. 2 is a diagram showing types of character string search that can be realized in the embodiment of the present invention.
FIG. 3 is a diagram showing an example of a document to be processed in the embodiment of the present invention.
4A and 4B are diagrams showing character extraction graphs of search objects and patterns according to the embodiment of the present invention.
FIG. 5 is a diagram for explaining a character cutout graph matching method according to the embodiment of this invention.
6A and 6B are diagrams showing a data structure of character code information according to the embodiment of the present invention.
7A and 7B are diagrams showing a data structure of stroke information according to the embodiment of the present invention.
FIG. 8 is a diagram showing a data structure of document information according to the embodiment of the present invention.
9A to 9D are diagrams showing a data structure of the character segmentation graph according to the embodiment of the present invention.
FIG. 10 is a diagram illustrating a data structure of a character string search result according to the embodiment of this invention.
FIG. 11 is a diagram illustrating a data structure of a character string search candidate according to the embodiment of this invention.
FIG. 12 is an explanatory diagram of preprocessing according to the embodiment of this invention.
FIG. 13 is an explanatory diagram of character cutout creation processing according to the embodiment of this invention.
FIG. 14 is an explanatory diagram of character string search processing according to the embodiment of this invention.
FIG. 15 is an explanatory diagram of the character cutout graph matching process according to the embodiment of this invention.

最初に本発明の文字列検索システムの構成例を示す。次に文字列検索システムで管理される書類情報中から利用者が指示したパターン文字列を見つけ出す文字列検索処理フローを説明する。
本発明の文字列検索システムは、図１に示すように、以下の部分から構成される。文字列検索装置１００が有するパターン文字列入力部１０１は、文字列検索で探すべき文字列（パターン文字列）を入力するために、キーボード又はデジタルペン１０５などと接続可能な入力インタフェースである。書類管理部１０２は、ハードディスク等の記憶部と、記憶部の読み出しや書き込みを制御する制御部によって実現され、文字コードとペンなどで記入されたストロークの情報からなる書類を管理する。
文字列照合部１０３は、記憶部に記憶される各プログラムモジュールを演算部で実行することによって、書類中にパターン文字列が出現する箇所を見つけ出し、記憶部に格納された書類やパターン文字列入力部から入力される文字列から文字切り出しグラフを生成したり、照合を行ったりする手順を実現する。書類や文字列検索結果の表示は表示部１０４によって行なわれる。
パターン文字列入力部１０１では、キーボード入力などによる文字コード形式、デジタルペンやタッチパネル入力などによるストローク形式の２種類の表現形式のいずれかでパターン文字列を受け取る。
書類管理部１０２では、各書類情報を図８の表８００に示したデータ構造で管理する。項目８０１は、書類情報を同定するためのＩＤである。項目８０２、８０３は、それぞれ書類中に含まれる文字コードの総数、ストロークの総数を示している。項目８０５及び８０６〜８０７は、書類に含まれる各文字コードのＩＤが格納されている。項目８０８及び８０９〜８１０は、書類に含まれる各ストロークのＩＤが格納されている。さらに、書類では、後に説明する図１２に示した前処理によって書類中のストロークから作成された文字切り出しグラフ（例えば、図４Ａ及び図４Ｂ）を有している。項目８０４は、書類中に含まれる文字切り出しグラフの総数を示している。項目８１１及び８１２〜８１３は、書類に含まれる各文字切り出しグラフのＩＤが格納されている。文字コード、ストローク及び文字切り出しグラフの各データ構造については後に説明する。
文字列照合部１０３では、入力されたパターン文字列と書類中の文字列とを比較する。文字列の比較は、図５と図１４を用いて後に詳細に説明する。
表示部１０４では、利用者に対して書類管理部で蓄積された書類や、文字列検索結果などをモニタ画面等に表示する。書類を紙等に印刷しても構わない。
本発明で取扱う書類の例を図３に示す。本例の書類３００は、文字コードからなる文字列３０１と、ペン３１０でそれぞれ記入されたストローク３０２及び３０３からなる。本実施形態では、ストロークの取得手段として国際公開第０１／７１４７３号公報に開示されたデジタルペンを用いることとする。デジタルペンによって、紙面上に記載されたストロークを電子化できる。
ここでストローク３０３について考える。ストロークの前半部分の解釈として、まず先頭の文字を「＃（シャープ）」とみなすか、「井（漢字の井戸の井）」とみなすかの２つの解釈がありえる。また先頭文字を「＃（シャープ）」をみなした場合は、＃後の文字列は数字とみなせる。この解釈にも「７」１桁なのか、「１７」２桁なのかの解釈がありえる。また先頭文字を「井（漢字の井戸の井）」とみなした場合は、続く文字は漢字やかな文字であろうと推測すると、二文字目は「口（くち）」と解釈できる。
このように、（１）＃７、（２）＃１７、（３）井口などの解釈が成り立ち、この部分を見ただけでは人間でも判別不能である。このように文字認識には不完全性が存在する。従って、手書き文字列を文字認識して一旦文字コード列に変換して、文字列照合する方法では、高精度に手書き文字列を検索できない問題がある。
文字コードの情報は図６Ｂの表６００に示したデータ構造で保持される。例えば、図６Ａの書類６５０上の文字コード６５１「エ」は、項目６０１〜６０７に記載された情報によって表される。
項目６０１は、その文字コードを同定するＩＤである。項目６０２は、文字コードの値を示す。項目６０３〜６０７は文字の属性情報であり、項目６０３と６０４は、それぞれ文字矩形の左上点の書類上の座標をミリメートル単位で示している。項目６０５はフォントの種類を、項目６０６はフォントサイズを、項目６０７は斜体や太字などのスタイルを示している。
ストロークの情報は図７Ｂの表７００に示されるデータ構造で保持される。例えば、図７Ａに示す書類７５０上のストローク７５１は、項目７０１〜７０４に記載された情報によって表される。項目７０１は、そのストロークを同定するＩＤである。項目７０２は、ストロークの記入開始時刻を示す。
項目７０３は、ストローク内に存在するサンプリング点数を示す。各サンプリング点の情報は表７３０に保持されている。項目７０４は、そのストロークに該当のサンプリング点集合の先頭を指すポインタを示す。各サンプリング点は書類上のＸＹ座標値７３１，７３２を有し、またサンプリング点の記入時刻と項目７０２に記載されたストローク記入開始時刻との差分を７３３に保持する。その他のそれぞれのストロークについても、表７１０、７２０のようなデータ構造で情報が保持される。また、デジタルペン以外の入力装置によって手書き文字が入力された場合でも、ストロークごとに分解してストロークＩＤ、サンプリング点数、ポインタ、座標を保持する表７３０によってストローク情報を管理することが可能である。
文字列検索システム１００は検索対象となる書類の入力時に、図１２に示す前処理を実行する。ステップ１２０２で書類を入力する。入力書類中のストローク情報を取得し（ステップ１２０３）、ステップ１２０４のストロークレイアウト解析処理によって、書類中のストローク集合を図や文字列単位に分割する。例えば、図３の書類３００上のストロークを、丸囲みと吹出し線を構成するストローク集合３０２と注釈文字列を構成するストローク集合３０３に分割する。
次に、分割されたストローク集合毎にステップ１２０５を実行し、文字切り出しグラフを得る。文字切り出しグラフとは、正しい文字の切り出しを一意に決定することが難しいことから、可能性のある文字切り出しの複数の仮説を一つの有向非循環グラフで表したものである。
例えば、図４Ａの４１１、４１２と４１３の部分は、ストローク４０１の右側部分の解釈が１文字と２文字の両方が考えられる。１文字としての解釈結果を表した４１１と、２文字としての解釈結果を表した２つのエッジ４１２、４１３の両方をグラフ４１０が有すことによって、それら多重の仮説を表現できる。得られた文字切り出しグラフを図９Ｂ〜図９Ｄに示したデータ構造で書類管理部１０２に保存し、その文字切り出しグラフＩＤを書類のデータ構造８００の項目８１１及び８１２〜８１３に格納する。
文字切り出しグラフのデータ構造を図９Ｄの表９００に示す。ここでグラフ９５０に示した文字切り出しグラフの場合を説明する。項目９０１は、文字切り出しグラフを同定するＩＤである。項目９０２は、その文字切り出しグラフが記載される書類のＩＤを示す。項目９０３及び９０４は、それぞれ文字切り出しグラフのノードとエッジの総数を示す。項目９０５〜９０８は、第１番目のエッジの情報を示しており、項目９０５はエッジの開始ノードの番号、項目９０６はエッジの終点ノードの番号を示す。各エッジは、ストローク集合からなる文字切り出し候補とその文字切り出し候補に対する文字識別候補を有し、項目９０７及び９０８は、それぞれのＩＤを示す。このようなエッジの情報が項目９０４に記載された数だけ繰返し表現される。
文字切り出し候補のデータ構造は表９２０に示すとおりである。項目９２１は、その文字切り出し候補を同定するＩＤを示す。項目９２２は、文字切り出し候補に含まれるストロークの本数を示す。項目９２３〜９２４は、文字候補に含まれる各ストロークのＩＤを示す。
文字識別候補のデータ構造は表９３０に示すとおりである。項目９３１は、その文字識別候補を同定するＩＤを示す。項目９３１は文字識別処理によって出力された文字識別結果の数を示す。項目９３３、９３４は第１位の文字識別結果を示す。項目９３３は文字識別された文字コードを示す。項目９３４はその時の類似度を示す。本例の場合、第１位は「＃（シャープ）」であり、第２位は類似度０．０２の僅差で漢字の「井」となっている。
図１２のステップ１２０５に示した文字切り出しグラフを作成する処理について、図１３を使って詳細に説明する。ステップ１３０２において、該当ストローク集合を入力後、ステップ１３０３でストローク集合を、文字を構成すると考え得る部分集合に切り出す。この作業は文字切り出しグラフのエッジの文字切り出し候補を作成することに等しい。
このとき、切り出しを一意に確定するのが困難な場合がある。例えば、図４Ａの書類４００上のストローク集合４０１の右側の部分が、「７」１文字なのか、「１」と「７」の２文字なのか判断が困難な場合がある。このような場合、エッジ４１１と、エッジ４１２と４１３の組とを作成することによって、「７」１文字と、「１」「７」２文字との両方の解釈を一つの切り出しグラフで同時に表現できる。従って文字切り出しグラフを使用することによって、文字切り出しの不完全性を許容した文字列検索が可能となる。
次のステップ１３０４で各文字切り出し候補のストローク形状から文字コードを識別する。得られた文字識別候補は、図９Ｃの表９３０に示したデータ構造で保持される。例えば、前述した図９Ａのエッジ９５１の場合、この文字切り出し候補の形状からだけでは「＃（シャープ）」か「井」かの識別が困難であるが、表９３０に示したデータ構造でその両方の仮説を同時に表現できる。従って、文字切り出しグラフを使用することによって、文字識別の不完全性を許容した文字列検索が可能となる。
最後にステップ１３０５で、文字切り出し候補と文字識別候補の情報を各エッジに対応付けて格納して、文字切り出しグラフを出力する。
なお、文字切り出しグラフは、書類中のストロークだけでなく、文字コードに対しても作成可能である。文字コードの列に対し、文字コード一つ一つを文字切り出しエッジに変換していくことによって、一本道の文字切り出しグラフが作成できる。このとき各エッジの文字識別候補は該当の文字コード１結果で類似度１．０からなるとする。
このようにして、前処理で得られた文字切り出しグラフを利用して、入力されたパターン文字列に対し、文字列を検索する（図１４）。
まずステップ１４０２で、検索すべきパターン文字列を入力する。このパターン文字列の表現形式は、文字コードと手書きストロークのどちらにも対応する。次に、ステップ１４０３で、パターン文字列から文字切り出しグラフを作成する。このパターン文字列から作成された文字切り出しグラフを、以降パターン文字切り出しグラフとよぶ。
前述した図１２のステップ１２０３から１２０５、及び図１３で説明したのと同様の文字切り出しグラフの作成手順に従って、文字コード又は手書きストロークからパターン文字切り出しグラフを作成する。
次に、書類管理部１０２に管理される書類が有す文字切り出しグラフが存在すれば、それを取得する（ステップ１４０４及び１４０５）。これを検索対象文字切り出しグラフとよぶ。先のパターン文字切り出しグラフと検索対象文字切り出しグラフとを照合し、パターン文字切り出しグラフの出現を検出する（ステップ１４０６）。本処理の詳細は後に説明する。
文字切り出しグラフを照合した結果、すなわち文字列検索結果は図１０の表１０００に示したデータ構造で保持される。
項目１００１は、その文字列検索結果を同定するＩＤを示す。項目１００２はその結果の指標値であり、大きければより確かな検索結果を意味する。項目１００３及び１００４は、それぞれ該当の検索対象文字切り出しグラフとパターン文字切り出しグラフのＩＤである。
項目１００５は、検索対象文字切り出しグラフとパターン文字切り出しグラフの各エッジの一致箇所の数を示す。項目１００６及び１００７は、第１番目の一致箇所の情報を示しており、それぞれ検索対象文字切り出しグラフでの該当エッジの番号とパターン文字切り出しグラフでの該当エッジの番号を示す。
ステップ１４０６で得られた文字列探索結果を、ステップ１４０７で文字列検索候補に登録する。文字列検索候補のデータ構造を図１１の表１１００に示す。文字列検索結果のスコアに従い、この表にその文字列検索結果のＩＤを登録していく。従って、表１１００の文字列検索結果＃１のＩＤ１１０３が、常に最高のスコアを有す文字列検索結果へのＩＤを示す。現実的には、文字列検索結果総数１１０２には１０候補までなど最大値を設定していてもよい。この場合、それ以上の候補順位の文字列検索結果ＩＤはリストから外していくことになる。
ステップ１４０５から１４０７の処理を、全ての検索対象文字切り出しグラフに対し実行し、全てが終了すると、文字列検索候補を出力する（１４０８）。
以上説明した処理によって文字列検索結果を得る。パターン文字列及び検索対象文字列の表現形式を文字切り出しグラフに変換してから照合を行なうため、各々がもともと文字コードでも手書きストロークでも区別することなく検索することが可能である（図２）。
最後に、文字切り出しグラフ照合について図５と図１５を用いて説明する。本実施形態では、グラフ照合手法として動的計画法を適用したものについて説明する。
まず、検索対象とパターンの文字切り出しグラフを入力する（１５０２）。検索対象文字切り出しグラフのノードをＴ０，Ｔ１，…，Ｔｍ、パターン文字切り出しグラフのノードをＰ０，Ｐ１，…，Ｐｎとする。次に（ｎ＋１）行（ｍ＋１）列の照合テーブルを作成し、各値に初期値０を代入する（１５０３）。照合テーブルの各値は、図５のマトリクス５００内の各ノードＰｉＴｊ（０≦ｉ≦ｎ，０≦ｊ≦ｍ）のスコアに対応する。
次に、Ｐｉ行毎にマトリクス５００内の各ノードのスコアを計算していく（ステップ１５０４）。各行において、左列から順々に一つずつ計算していく。例えば、いまマトリクスノードＰ１Ｔ２（５０１）を計算するとする。このときＰ１Ｔ２に接続される既計算のマトリクスノードのスコアからの遷移スコアを加算して計算される。
遷移元のマトリクスノードＰａＴｂの対象は、０≦ａ≦１，０≦ｂ≦２でかつ、ＰａＴｂを始点、Ｐ１Ｔ２を終点とするエッジ（ＰａＴｂ，Ｐ１Ｔ２）がマトリクス５００に存在するものである。パターン文字切り出しグラフ５６０にエッジ（Ｐａ，Ｐｂ）が存在し、かつ検索対象文字切り出しグラフ５５０にも（Ｔｂ，Ｔ２）が存在する場合、マトリクスエッジ（ＰａＴｂ，Ｐ１Ｔ２）がマトリクス５００に存在する。
本例の場合、遷移元ノードはＰ０Ｔ０，Ｐ０Ｔ１が対象となる。また余分なストロークが存在した場合（挿入）を考慮し、ギャップ遷移（縦又は横の遷移）も認めることとする。つまり本例の場合、パターン文字切り出しグラフに挿入が起きた場合のＰ０Ｔ２、検索対象文字切り出しグラフに挿入が起きた場合のＰ１Ｔ１も遷移元ノードとする。このとき、Ｐ１Ｔ２のスコアＳ（Ｐ１Ｔ２）は、ＰａＰｂのスコアＳ（ＰａＴｂ）とＰａＴｂからＰ１Ｔ２への遷移スコアＴ（ＰａＴｂ，Ｐ１Ｔ２）を用いて、
Ｓ（Ｐ１Ｔ２）＝ｍａｘ（Ｓ（ＰａＴｂ）＋Ｔ（ＰａＴｂ，Ｐ１Ｔ２））・・・・・・・・・・（数式１）
と表せる。
遷移スコアＴ（ＰａＴｂ，Ｐ１Ｔ２）の計算方法は何通りか考えられるが、本実施例では、文字識別候補の情報を利用した１方法を示す。エッジ（Ｐａ，Ｐ１）の文字識別結果が文字コードＣであったときの文字識別の類似度を、ｓ（Ｐａ，Ｐ１，Ｃ）とする。ただしＣに該当の文字識別結果が存在しない場合はｓ（Ｐａ，Ｐ１，Ｃ）＝γ＜＜１．０とする。このとき、遷移スコアは下式のとおりとする。
Ｔ（ＰａＴｂ，Ｐ１Ｔ２）＝ｍａｘ（ｓ（Ｐａ，Ｐ１，Ｃ）＋ｓ（Ｔｂ，Ｔ２，Ｃ））・・・・・・（数式２）
数式２では類似度の和としたが、積でもよい。
Ｔ（ＰａＴｂ，Ｐ１Ｔ２）＝ｍａｘ（ｓ（Ｐａ，Ｐ１，Ｃ）×ｓ（Ｔｂ，Ｔ２，Ｃ））・・・・・・（数式３）
積とした場合は、二つの類似度のどちらか一方が低ければ積算なので、和算より低い値となる。従って、積の場合は両方とも高類似度の共通の文字コードが存在すると有利になる。
またギャップ遷移の場合の遷移コストを下式のように定義する。
Ｔ（ＰｉＴｊ−１，ＰｉＴｊ）＝α＜＜１．０・・・・・・・・・・・・・・・・・・（数式４）
Ｔ（Ｐｉ−１Ｔｊ，ＰｉＴｊ）＝β＜＜１．０（ｉ≠１，ｎ），１．０（ｉ＝１ｏｒｎ）・・・・・（数式５）
ここで、横方向のギャップ遷移においてｉ＝１ｏｒｎのとき（イニシャルギャップ）、１．０と満点のギャップスコアとした理由は、パターン文字列が検索対象の文字列の先頭／末尾に出現するとは限らないので、イニシャルギャップペナルティを１．０とした。
このようにして、マトリクス内の各ノードのスコアを左から右、上行から下行の順に計算していき、照合テーブルを埋めていく。照合テーブルの計算の次に、文字切り出しグラフの照合結果つまり文字列探索結果を表号テーブルから求める（ステップ１５０５）。計算順序とは逆に右下端のマトリクスノードＰｎＴｍから、左上方向へ逆順にたどっていく。
現在ＰｉＴｊに着目しているとする。ＰｉＴｊの遷移元ノードのＰａＴｂのうち、一番スコアが大きいものを次の着目ノードとし、エッジ（Ｔｂ，Ｔｊ）と（Ｐａ，Ｐｉ）のエッジ番号をそれぞれ抽出し、一時的に記憶する。最終的にＰ０Ｔ０までたどりついたら、記憶した順と逆順で、図１０の項目１００６と１００７、及び１００８と１００９、・・・に、検索対象エッジ番号とパターンエッジ番号の組を登録していき、文字列検索結果１０００を完成させる。
ここでマトリクス５００の特性について補足する。マトリクス内のエッジ（ＰａＴｂ，ＰｉＴｊ）は常にａ≦ｉ，ｂ≦ｊである。従って、上行から下行へ順々にスコアを計算していけるため、動的計画法を適用できる。
また検索対象文字列が文字コードのみであった場合は、文字切り出しグラフ５５０は分岐のない一本鎖構造となるため、ｊ−１≦ｂ≦ｊとなる。つまりマトリクス内の各エッジは横方向には高々一つまでしか遷移しないエッジとなる。
同様に、パターン文字列が文字コードのみであった場合は、文字切り出しグラフ５６０は分岐のない一本鎖構造となるため、ｉ−１≦ａ≦ｉとなる。つまりマトリクス内の各エッジは縦方向には高々一つまでしか遷移しないエッジとなる。
検索対象文字列とパターン文字列の両方が手書きのときのみ、エッジ５１１や５１２に示したような、縦横共に１よりも大きい遷移のエッジが存在することになる。逆に言えば、このようなエッジをマトリクス内に作成することによって、グラフとグラフとの照合を可能とした。
最後に、文字列検索結果を出力し（ステップ１５０６）、文字切り出しグラフ照合を終了する。
以上が本発明に係る実施形態の説明である。
なお検索実行時に、検索対象の書類の範囲を限定したり、検索対象の文字列を文字コードや手書きストロークのみに限定したりして文字列検索してもよい。
また、パターン文字列を文字切り出しグラフに変換し検索するため、パターン文字列が文字コードとストロークの組合せからなる場合でも検索可能となる。これは、それぞれの表現形式毎に文字切り出しグラフに変換し、各々のグラフの終端ノードと始端ノードとを結合してパターン文字切り出しグラフを作成することができるためである。検索対象の文字列も、同様に文字コードとストロークの組合せからなる文字列を、座標情報等を利用して一つの検索対象文字切り出しグラフにまとめることによって、検索可能となる。
また、前述した実施形態では、遷移スコアの式（数式２又は数式３）で文字識別結果の類似度を利用したが、文字識別結果の順位を利用したり、文字切り出し候補のパターンから線分方向特徴などの特徴量を抽出し、特徴量同士で内積演算したりしてもよい。
ただ、文字コードから文字切り出しグラフを作成した場合、文字切り出し候補が存在しないが、例えば、事前に文字コードとそのコードの標準的な字形から抽出した文字切り出し候補の特徴量とを組で登録した文字切り出し候補特徴量辞書を用意しておき、文字コードに対応した文字切り出し候補特徴量を辞書から取得し代入することによって、文字切り出し候補特徴量を用いた遷移スコアの計算が可能となる。First, a configuration example of the character string search system of the present invention is shown. Next, a character string search processing flow for finding a pattern character string designated by the user from document information managed by the character string search system will be described.
As shown in FIG. 1, the character string search system of the present invention comprises the following parts. A pattern character string input unit 101 included in the character string search device 100 is an input interface that can be connected to a keyboard, a digital pen 105, or the like in order to input a character string (pattern character string) to be searched for in the character string search. The document management unit 102 is realized by a storage unit such as a hard disk and a control unit that controls reading and writing of the storage unit, and manages a document including character code and stroke information entered with a pen or the like.
The character string collating unit 103 executes each program module stored in the storage unit by the calculation unit to find a place where the pattern character string appears in the document, and inputs the document or pattern character string stored in the storage unit. A procedure for generating a character cut-out graph from a character string input from the section or performing collation is realized. The display unit 104 displays documents and character string search results.
The pattern character string input unit 101 receives a pattern character string in one of two types of expression formats: a character code format by keyboard input or the like, and a stroke format by digital pen or touch panel input.
The document management unit 102 manages each document information with the data structure shown in the table 800 of FIG. An item 801 is an ID for identifying document information. Items 802 and 803 respectively indicate the total number of character codes and the total number of strokes included in the document. Items 805 and 806 to 807 store the ID of each character code included in the document. The items 808 and 809 to 810 store the ID of each stroke included in the document. Further, the document has a character cutout graph (for example, FIGS. 4A and 4B) created from strokes in the document by the preprocessing shown in FIG. 12 described later. An item 804 indicates the total number of character cutout graphs included in the document. Items 811 and 812 to 813 store the ID of each character segmentation graph included in the document. Each data structure of the character code, stroke, and character cutout graph will be described later.
The character string matching unit 103 compares the input pattern character string with the character string in the document. Comparison of character strings will be described in detail later with reference to FIGS.
The display unit 104 displays a document accumulated in the document management unit, a character string search result, and the like on a monitor screen or the like for the user. The document may be printed on paper or the like.
An example of a document handled in the present invention is shown in FIG. The document 300 of this example is composed of a character string 301 made up of character codes and strokes 302 and 303 written with a pen 310, respectively. In this embodiment, a digital pen disclosed in International Publication No. 01/71473 is used as a stroke acquisition unit. The stroke described on the paper surface can be digitized by the digital pen.
Here, the stroke 303 is considered. There are two possible interpretations of the first half of the stroke: first, the first character is regarded as “# (sharp)”, or “well (kanji well)”. If the first character is regarded as “# (sharp)”, the character string after # can be regarded as a number. This interpretation can also be interpreted as “7” with one digit or “17” with two digits. In addition, if the first character is regarded as “I (well character well)”, the second character can be interpreted as “mouth” if the subsequent character is assumed to be a kanji character.
In this way, interpretations such as (1) # 7, (2) # 17, (3) Iguchi are established, and it is impossible for a human to discriminate just by looking at this part. Thus, there is imperfection in character recognition. Therefore, the method of recognizing a handwritten character string, converting it into a character code string, and collating the character string has a problem that the handwritten character string cannot be searched with high accuracy.
The character code information is held in the data structure shown in the table 600 of FIG. 6B. For example, the character code 651 “D” on the document 650 in FIG. 6A is represented by information described in the items 601 to 607.
An item 601 is an ID for identifying the character code. An item 602 indicates a character code value. Items 603 to 607 are character attribute information, and items 603 and 604 indicate the coordinates of the upper left point of the character rectangle on the document in millimeters. An item 605 indicates a font type, an item 606 indicates a font size, and an item 607 indicates a style such as italic or bold.
The stroke information is held in the data structure shown in the table 700 of FIG. 7B. For example, the stroke 751 on the document 750 shown in FIG. 7A is represented by information described in the items 701 to 704. An item 701 is an ID for identifying the stroke. An item 702 indicates a stroke entry start time.
An item 703 indicates the number of sampling points existing in the stroke. Information on each sampling point is held in a table 730. Item 704 indicates a pointer pointing to the head of the sampling point set corresponding to the stroke. Each sampling point has XY coordinate values 731 and 732 on the document, and the difference between the sampling point entry time and the stroke entry start time described in the item 702 is held in 733. Information about each of the other strokes is also stored in a data structure as shown in Tables 710 and 720. Further, even when handwritten characters are input by an input device other than a digital pen, the stroke information can be managed by a table 730 that disassembles each stroke and holds the stroke ID, the number of sampling points, the pointer, and the coordinates.
The character string search system 100 executes preprocessing shown in FIG. 12 when inputting a document to be searched. In step 1202, a document is input. Stroke information in the input document is acquired (step 1203), and a stroke set analysis process in step 1204 divides the stroke set in the document into figures and character strings. For example, the strokes on the document 300 in FIG. 3 are divided into a stroke set 302 that forms a circle and a balloon, and a stroke set 303 that forms an annotation character string.
Next, step 1205 is executed for each divided stroke set to obtain a character segmentation graph. A character segmentation graph is a representation of a plurality of possible character segmentation hypotheses as a single directed acyclic graph because it is difficult to uniquely determine the correct segmentation of characters.
For example, in the portions 411, 412 and 413 in FIG. 4A, the interpretation of the right portion of the stroke 401 can be both one character and two characters. Since the graph 410 has both 411 representing the interpretation result as one character and two edges 412 and 413 representing the interpretation result as two characters, the multiple hypotheses can be expressed. The obtained character cutout graph is stored in the document management unit 102 in the data structure shown in FIGS. 9B to 9D, and the character cutout graph ID is stored in the items 811 and 812 to 813 of the document data structure 800.
The data structure of the character cutout graph is shown in the table 900 of FIG. 9D. Here, the case of the character segmentation graph shown in the graph 950 will be described. An item 901 is an ID for identifying a character cutout graph. An item 902 indicates the ID of a document in which the character cutout graph is described. Items 903 and 904 indicate the total number of nodes and edges of the character segmentation graph, respectively. Items 905 to 908 indicate the information of the first edge, the item 905 indicates the number of the start node of the edge, and the item 906 indicates the number of the end node of the edge. Each edge has a character cutout candidate consisting of a stroke set and a character identification candidate for the character cutout candidate, and items 907 and 908 indicate their IDs. Such edge information is repeatedly expressed by the number described in the item 904.
The data structure of character extraction candidates is as shown in Table 920. An item 921 indicates an ID for identifying the character segmentation candidate. An item 922 indicates the number of strokes included in the character cutout candidates. Items 923 to 924 indicate the ID of each stroke included in the character candidate.
The data structure of character identification candidates is as shown in Table 930. An item 931 indicates an ID for identifying the character identification candidate. An item 931 indicates the number of character identification results output by the character identification process. Items 933 and 934 indicate the first character identification result. An item 933 indicates the character code for which the character is identified. An item 934 indicates the degree of similarity at that time. In the case of this example, the first place is “# (sharp)”, and the second place is “well” of the Chinese character with a close difference of 0.02.
The process for creating the character segmentation graph shown in step 1205 of FIG. 12 will be described in detail with reference to FIG. In step 1302, after inputting the corresponding stroke set, in step 1303, the stroke set is cut into subsets that can be considered to constitute characters. This operation is equivalent to creating a character cutout candidate at the edge of the character cutout graph.
At this time, it may be difficult to uniquely determine the cutout. For example, it may be difficult to determine whether the right portion of the stroke set 401 on the document 400 in FIG. 4A is “7” 1 character or “1” and “7”. In such a case, by creating an edge 411 and a pair of edges 412 and 413, the interpretation of both “7” 1 character and “1” “7” 2 characters can be expressed simultaneously in one cutout graph. it can. Therefore, by using the character cutout graph, it is possible to perform a character string search that allows incomplete character cutout.
In the next step 1304, a character code is identified from the stroke shape of each character extraction candidate. The obtained character identification candidates are held in the data structure shown in the table 930 of FIG. 9C. For example, in the case of the edge 951 in FIG. 9A described above, it is difficult to identify “# (sharp)” or “well” only from the shape of the character extraction candidate, but both of them can be obtained from the data structure shown in Table 930. Can be expressed simultaneously. Therefore, by using the character segmentation graph, it is possible to perform a character string search that allows incomplete character identification.
Finally, in step 1305, character cutout candidates and character identification candidate information are stored in association with each edge, and a character cutout graph is output.
Note that the character cutout graph can be created not only for strokes in a document but also for character codes. By converting each character code into a character cut edge for a character code string, a single-way character cut graph can be created. At this time, it is assumed that the character identification candidate of each edge has a similarity of 1.0 as a result of the corresponding character code 1.
In this manner, a character string is searched for the input pattern character string using the character segmentation graph obtained in the preprocessing (FIG. 14).
First, in step 1402, a pattern character string to be searched is input. The expression form of the pattern character string corresponds to both a character code and a handwritten stroke. Next, in step 1403, a character cutout graph is created from the pattern character string. A character cutout graph created from this pattern character string is hereinafter referred to as a pattern character cutout graph.
In accordance with steps 1203 to 1205 in FIG. 12 described above and a character cutout graph generation procedure similar to that described with reference to FIG. 13, a pattern character cutout graph is generated from a character code or a handwritten stroke.
Next, if there is a character cutout graph for the document managed by the document management unit 102, it is acquired (steps 1404 and 1405). This is called a search target character segmentation graph. The previous pattern character cutout graph is compared with the search target character cutout graph to detect the appearance of the pattern character cutout graph (step 1406). Details of this processing will be described later.
The result of collating the character cutout graph, that is, the character string search result is held in the data structure shown in the table 1000 of FIG.
An item 1001 indicates an ID for identifying the character string search result. An item 1002 is an index value of the result, and a larger value means a more reliable search result. Items 1003 and 1004 are IDs of the search target character cutout graph and the pattern character cutout graph, respectively.
An item 1005 indicates the number of matching portions of each edge of the search target character cutout graph and the pattern character cutout graph. Items 1006 and 1007 indicate information on the first matching portion, and indicate the corresponding edge number in the search target character cutout graph and the corresponding edge number in the pattern character cutout graph, respectively.
In step 1407, the character string search result obtained in step 1406 is registered as a character string search candidate. The data structure of the character string search candidate is shown in Table 1100 of FIG. According to the score of the character string search result, the ID of the character string search result is registered in this table. Accordingly, the ID 1103 of the character string search result # 1 in the table 1100 always indicates the ID to the character string search result having the highest score. Actually, a maximum value such as up to 10 candidates may be set for the total number of character string search results 1102. In this case, character string search result IDs with higher candidate ranks are removed from the list.
The processing of steps 1405 to 1407 is executed for all the search target character cutout graphs. When all the processing is completed, a character string search candidate is output (1408).
A character string search result is obtained by the processing described above. Since the pattern character string and the expression format of the search target character string are converted to a character cutout graph and collation is performed, it is possible to search without distinguishing either the character code or the handwritten stroke originally (FIG. 2).
Finally, character cutout graph matching will be described with reference to FIGS. In the present embodiment, a description will be given of an application of dynamic programming as a graph matching method.
First, a character extraction graph of a search target and a pattern is input (1502). It is assumed that the search target character cutout graph nodes are T0, T1,..., Tm, and the pattern character cutout graph nodes are P0, P1,. Next, a collation table of (n + 1) rows (m + 1) columns is created, and an initial value 0 is substituted for each value (1503). Each value of the collation table corresponds to the score of each node PiTj (0 ≦ i ≦ n, 0 ≦ j ≦ m) in the matrix 500 of FIG.
Next, the score of each node in the matrix 500 is calculated for each Pi row (step 1504). In each row, calculate one by one from the left column. For example, assume that the matrix node P1T2 (501) is calculated. At this time, it is calculated by adding the transition score from the score of the already calculated matrix node connected to P1T2.
The target of the transition source matrix node PaTb is an object in which 0 ≦ a ≦ 1, 0 ≦ b ≦ 2 and an edge (PaTb, P1T2) having PaTb as a start point and P1T2 as an end point exists in the matrix 500. When the edge (Pa, Pb) exists in the pattern character cutout graph 560 and (Tb, T2) also exists in the search target character cutout graph 550, the matrix edge (PaTb, P1T2) exists in the matrix 500.
In the case of this example, the transition source nodes are P0T0 and P0T1. In addition, a gap transition (longitudinal or lateral transition) is also allowed in consideration of an extra stroke (insertion). That is, in this example, P0T2 when insertion occurs in the pattern character cutout graph and P1T1 when insertion occurs in the search target character cutout graph are also set as the transition source nodes. At this time, the score S (P1T2) of P1T2 is calculated using the PaSb score S (PaTb) and the transition score T (PaTb, P1T2) from PaTb to P1T2.
S (P1T2) = max (S (PaTb) + T (PaTb, P1T2)) (Equation 1)
It can be expressed.
Several methods of calculating the transition score T (PaTb, P1T2) are conceivable. In this embodiment, one method using information of character identification candidates is shown. Assume that the similarity of character identification when the character identification result of the edge (Pa, P1) is the character code C is s (Pa, P1, C). However, if there is no corresponding character identification result in C, s (Pa, P1, C) = γ << 1.0. At this time, the transition score is as follows.
T (PaTb, P1T2) = max (s (Pa, P1, C) + s (Tb, T2, C)) (Equation 2)
In Equation 2, the sum of the similarities is used, but a product may be used.
T (PaTb, P1T2) = max (s (Pa, P1, C) × s (Tb, T2, C)) (Equation 3)
In the case of a product, if one of the two similarities is low, the product is integrated, so the value is lower than the sum. Therefore, in the case of products, it is advantageous if there is a common character code with high similarity.
The transition cost in the case of gap transition is defined as follows:
T (PiTj-1, PiTj) = α << 1.0 (Equation 4)
T (Pi-1Tj, PiTj) = β << 1.0 (i ≠ 1, n), 1.0 (i = 1 or n) (Equation 5)
Here, when i = 1 or n in the gap transition in the horizontal direction (initial gap), the reason why the gap score is a perfect score of 1.0 is that the pattern character string appears at the beginning / end of the character string to be searched. However, the initial gap penalty was set to 1.0.
In this way, the score of each node in the matrix is calculated in order from left to right and from the top row to the bottom row, and the collation table is filled. After the collation table calculation, the collation result of the character cutout graph, that is, the character string search result is obtained from the symbol table (step 1505). Contrary to the calculation order, the matrix node PnTm at the lower right corner is traced in the reverse order in the upper left direction.
Assume that PiTj is currently focused. Of the PaTb of the transition source node of PiTj, the node with the highest score is set as the next node of interest, and the edge numbers of edges (Tb, Tj) and (Pa, Pi) are extracted and temporarily stored. When it finally reaches P0T0, a set of search object edge numbers and pattern edge numbers is registered in the items 1006 and 1007 and 1008 and 1009,... In FIG. A column search result 1000 is completed.
Here, the characteristics of the matrix 500 will be supplemented. Edges (PaTb, PiTj) in the matrix are always a ≦ i and b ≦ j. Therefore, since the score can be calculated in order from the upper line to the lower line, dynamic programming can be applied.
If the character string to be searched is only a character code, the character segmentation graph 550 has a single-chain structure without branching, and therefore j−1 ≦ b ≦ j. That is, each edge in the matrix is an edge that changes only at most once in the horizontal direction.
Similarly, when the pattern character string is only a character code, the character cutout graph 560 has a single-chain structure without branching, and therefore i-1 ≦ a ≦ i. In other words, each edge in the matrix is an edge that transitions to at most one in the vertical direction.
Only when both the search target character string and the pattern character string are handwritten, there are transition edges larger than 1 in both vertical and horizontal directions, as indicated by edges 511 and 512. In other words, by creating such an edge in the matrix, the graph can be collated.
Finally, the character string search result is output (step 1506), and the character cutout graph collation is terminated.
The above is the description of the embodiment according to the present invention.
When performing a search, a character string search may be performed by limiting the range of documents to be searched or by limiting the character string to be searched to only a character code or a handwritten stroke.
In addition, since the pattern character string is converted into a character cutout graph and searched, the search can be performed even when the pattern character string includes a combination of a character code and a stroke. This is because it is possible to create a pattern character cutout graph by converting each graph into a character cutout graph and combining the end node and the start end node of each graph. Similarly, a search target character string can be searched by collecting character strings formed by combinations of character codes and strokes into one search target character cutout graph using coordinate information or the like.
In the above-described embodiment, the similarity of the character identification result is used in the transition score formula (Formula 2 or Formula 3). However, the rank of the character identification result is used, or the line segment direction is determined from the character cutout candidate pattern. A feature amount such as a feature may be extracted and an inner product operation may be performed between the feature amounts.
However, when a character cutout graph is created from a character code, there are no character cutout candidates. For example, a character code and a feature amount of a character cutout candidate extracted from the standard character form of the code are registered in pairs. By preparing a character cutout candidate feature amount dictionary and acquiring and substituting the character cutout candidate feature amount corresponding to the character code from the dictionary, it is possible to calculate a transition score using the character cutout candidate feature amount.

書類管理システムに適用可能である。特に文字コードからなる書類に手書きで書込んだ注釈をあわせて管理する書類管理システムで有効である。 Applicable to document management system. This is particularly effective in a document management system that manages annotations written by hand on a document consisting of character codes.

Claims

A character string search method in a character string search system having a pattern character string input unit, a document management unit, and a character string collation unit,
In the pattern character string input unit, a pattern character string input step for receiving a pattern character string, which is a character string to be searched, from the user in one or both of two types of character code and handwritten stroke;
Search target that reads the character string to be searched from the document management unit that manages the information of the document having the character string input in one or both of the character code and the handwritten stroke as the search target character string information A character string reading step;
Each of the search target character string and the pattern character string is converted into a character segmentation graph representing a plurality of possible character segmentation hypotheses as a single directed acyclic graph, and the dynamic programming is used to and string matching step of the pattern character string and extracts a portion that appears in the document by matching cut out the character segmentation graph each other to calculate the degree of coincidence of a character,
And a display step of displaying the appearance location of the extracted pattern character string as a character string search result.

The character string search method according to claim 1,
In the character string collating step, a character identification result is used for a score calculation used in the dynamic programming method.

The character string search method according to claim 1 or 2,
A character string search method characterized in that, in the character string matching step, a geometric feature value generated from a character segmentation candidate is used for score calculation used in the dynamic programming.

  The character string search method according to any one of claims 1 to 3,
  Preparing in advance character code graph correspondence information in which a character code and a character cutout graph corresponding to the character code are associated;
  In the character string collating step, the character string search method characterized by converting the pattern character string or the search target character string input by a character code into a character cutout graph using the character code graph correspondence information. .

  A pattern character string input unit for receiving a pattern character string, which is a character string to be searched, in either or both of a character code and a handwritten stroke;
  A document management unit for managing information on a document having a character string input in one or both of a character code and a handwritten stroke as information on a search target character string;
  Convert each of the search target character string and the pattern character string into a character cut-out graph representing a plurality of possible character cut-out hypotheses as one directed acyclic graph, using dynamic programming, A character string matching unit that extracts a portion where the pattern character string appears in the document by calculating a matching degree of the cut out characters and comparing the character cutting graphs;
  A character string search system comprising: a display unit configured to display an appearance portion of the extracted pattern character string as a character string search result.

The character string search system according to claim 5,
The character string collating unit uses a character identification result for score calculation used in the dynamic programming method.

In the character string search system according to claim 5 or 6,
The character string search system, wherein the character string matching unit uses a geometric feature amount generated from a character segmentation candidate for score calculation used in the dynamic programming.

  The character string search system according to any one of claims 5 to 7,
  The character string search system stores in advance character code graph correspondence information in which a character code and a character cutout graph corresponding to the character code are associated with each other,
  The character string collating unit converts the pattern character string or the search target character string input by a character code into a character cutout graph using the character code graph correspondence information. .

  A string search system,
  A digital pen that converts input of handwritten characters into stroke information of electronic data and outputs it, an input device that accepts input of stroke information from the digital pen, and a storage device that stores document information including a search target character string And an arithmetic unit,
  The input device receives input of stroke information of a handwritten character added to the document or stroke information of a pattern character string which is a character string to be searched from the digital pen, and transmits the input to the arithmetic device,
  The arithmetic unit is:
  Retrieval in which a plurality of hypotheses of possible character segmentation are represented by a single directed acyclic graph for stroke information of handwritten characters added to the document or a search target character string included in the document stored in the storage Converted to the target character segmentation graph, and expressed a single directed acyclic graph with multiple hypotheses of possible character segmentation of the pattern character string input with the stroke information or character code of the input pattern character string Convert to a pattern character cutout graph,
  Furthermore, in order to search for a search target character string input in either or both of handwritten characters and character codes, a pattern character string input in either or both of handwritten characters and character codes A character string search system characterized by using collation between the search target character cutout graph and the pattern character cutout graph by calculating the degree of coincidence of the cut out characters using a planning method.