JP2003242144A

JP2003242144A - Similar kanji retrieval method

Info

Publication number: JP2003242144A
Application number: JP2002082079A
Authority: JP
Inventors: Junichi Aoe; 順一青江; Kazuhiko Tsuda; 和彦津田; Fukutsugu Nin; 福継任; Masao Fuchida; 正雄泓田; Kazuhiro Morita; 和弘森田
Original assignee: Individual
Current assignee: Individual
Priority date: 2002-02-18
Filing date: 2002-02-18
Publication date: 2003-08-29

Abstract

<P>PROBLEM TO BE SOLVED: To enable the retrieval of similar kanji (Chinese characters) in a short time only by input of optional kanji information by generating a retrieval vector from a plurality of pieces of kanji information inputted, and performing a similarity retrieval with a preliminarily constituted vector database to eliminate the labor and time for selecting the radical or the number of storks of a kanji character. <P>SOLUTION: For a registered kanji character obtained from an input means 1, kanji constituting element information is generated by a generation means 2, and similar kanji constituting elements for facilitating the retrieval from constituting elements are newly generated and added thereto by a generation means 3. These pieces of element information are combined together by a means 4, and a vector database mainly having components of the elements to the registered kanji character is generated and stored by a means 5. In kanji retrieval, the vectors of kanji information inputted by the input means 1 are acquired from the database by a means 7, the input vectors are combined by a means 8 to extend the retrieval information. The similarity calculation of the combined vector with the vector space of the vector database is carried out to output kanji candidates with ranking in a short time. <P>COPYRIGHT: (C)2003,JPO

Description

【発明の詳細な説明】【０００１】【発明の属する技術分野】本発明は，ワードプロセッ
サ，携帯端末などにおいて，文字や文書を入力する仮名
漢字変換，音声認識などの入力支援における読みの分か
らない難しい漢字入力を短時間で行える類似漢字検索支
援装置に関する。【０００２】【従来の技術】ワードプロセッサ，携帯端末などの普及
により，読みの分からない難漢字を入力する機会は増大
しており，また，アジア圏の国際コミュニケーションに
より難漢字の数も多くなってきている。難漢字はその読
みも不明な場合が多いので，以下の２種類の入力支援方
法が実用化されている。【０００３】第１の従来法では，システムが準備したパ
レットから部首や画数を入力して該当する漢字を絞り込
み検索するパレット入力方式があるが，部首の大半は馴
染みが薄く，容易に入力できないこと，また画数を数え
るのは時間がかかり，短時間の漢字検索は困難である。【０００４】第２の従来法では，手書き入力による漢字
認識方式であり、第１の従来法の欠点を改善している
が、画数の多い難しい漢字を短時間にマウスなどで入力
するのは難しいこと，また手の不自由な身障者や高齢者
に対しては利用が難しいなどの問題点がある。さらに，
近年進歩してきている音声認識を利用した入力支援にも
容易に対応できない欠点がある。【０００５】文書検索では，文書中に存在する単語や文
字列の統計データをとり，それらの成分に対する重みを
計算した文書に対するベクトルを構成し，そのベクトル
類似手法により，類似文書を検索する手法があり，本発
明はこのベクトル類似手法を漢字検索に組み合わせる。【０００６】【発明が解決しようとする課題】本発明は斯かる事情に
鑑みてなされたものであり，パレット方式における部首
や画数の選択の手間を省くこと，また漢字認識方式の手
書き入力の制約と音声認識入力との非連動性の問題を解
決することである。さらに従来法では正確な部首や画数
の選択，正確な入力が常に要求され，検索情報に自由度
がないので利用する側も不便があったので，本発明で
は，表意文字である漢字検索に，意味や形の類似と関連
性を導入し，より広い検索情報が入力指定できる装置を
特徴とする。【０００７】【課題を解決するための手段】請求項に係る漢字検索の
データ構築方法は，検索対象となる難漢字に対する検索
情報（類似漢字，部首，部首の読みなど）を自由に入力
し，その検索情報に対する入力ベクトルと，漢字情報の
構成要素からあらかじめ構築されたベクトルデータベー
スとで類似度を計算し，類似度の高い順に漢字候補を出
力できる類似漢字検索方法及び装置。漢字情報から直接
得られる構成要素集合に関連・類似する構成情報を新た
に加えて拡張し，全ての構成要素の頻度の逆比を利用し
た構成要素の重みを成分とする漢字ベクトルとそのデー
タベースを構築することを特徴とする。【０００８】【発明の実施の形態】漢字検索方法は，複数の入力漢字
情報に対する入力漢字ベクトルを併合する手段と，併合
ベクトルの類似漢字をベクトルデータベースから類似度
検索して，ランキング出力することを特徴とする。以上
により，入力者は漢字構成要素を画面などから選択する
手間がなくなり，検索したい漢字情報（類似漢字，部
首，部首の読みなど）を入力するだけで，意図する漢字
一覧を優先度情報付きで検索できる。【０００９】【実施例】以下，本発明をその実施の形態を示す図面を
参照して具体的に説明する。【００１０】図１は，本発明に係る類似検索装置（以
下，本発明という）の構成を示すブロック図である。図
中１は，キーボード，音声入力，手書き入力，ファイル
入力，範囲指定などの入力手段１から得られた登録漢字
に対して，構成要素集合生成手段２で漢字構成要素集合
が生成される。例えば，“泓”なる漢字に対しては，
“さんずい（部首）”と“弘”の構成要素が決定される
が，さらに“弘”から“弓”と“ム”の構成要素が生成
されるので，要素集合｛“泓”，“さんずい（部
首）”，“ム”“弓”｝が得られる。この構成要素と
は，漢字の部品（パーツ）分解の要素であり，新しい技
術によるデータではないので，人手で容易に構築でき
る。【００１１】本発明では，漢字の部品分解に新情報を加
える。例えば，“鼈”なる難漢字対して，類似している
関連構成要素“亀”，“申”，“かめ”などを生成し，
新情報として追加するのが類似漢字集合生成手段３であ
る。これら漢字部品と新情報の二つの要素集合を併合す
るのが構成要素併合手段４であり，全登録漢字の要素の
全体集合から登録漢字に対する重みベクトル生成し格納
するのがベクトル生成格納手段５である。このベクトル
の成分の重み計算の例としては，各漢字の構成要素集合
から全ての構成要素の頻度を集計し，頻度の多い構成要
素には小さな重みを，頻度の少ない構成要素には大きな
重みを計算する方法が一般的であり，漢字Ｋに対して次
の個別漢字ベクトルＶＥＣ（Ｋ）を生成する。ＶＥＣ（Ｋ）＝（Ｗ（ｘ１），Ｗ（ｘ２），．．．
，Ｗ（ｘｎ））【００１２】ここで，ｘｉ（０＜ｉ＜ｎ＋１）は構成要
素であり，ｗ（ｘｉ）は計算された重みである。これら
のベクトルの集合は，全体ベクトルとしてデータベース
に格納される。【００１３】データベースを作成する際の漢字登録は，
画数の少ない漢字から順に登録し，段階的にベクトルデ
ータベースを構築する。例えば，“土”が３画で登録さ
れたとすると，それぞれの構成要素は“十”，“？（横
棒）”で表現される。このとき，“十”は２画漢字とし
て先に登録済みであるので，“十”の部品“？（横
棒）”と“１（縦棒）は生成でき，漢字“土”の構成要
素は“十”，“？（横棒）”と“１（縦棒）と決定でき
る。次に，３画の漢字“士”も同様にして，“土”と同
じ構成要素が生成できるが，ここで構成要素集合の包含
率の高い漢字同士“土”と“士”は，類似度の高い構成
要素と判定でき，類似構成要素集合の自動生成に利用で
きる。この意味で，”工“の構成要素が”丁“，？（横
棒）”，“１（縦棒）”で，“十”と“丁”が既に類似
構成要素であるならば，“土”は“土”と“工”は類似
性のある構成要素として考えられる。このように構成要
素の類似性は要素集合の包含関係で決定される。【００１４】例えば，“壥”なる難漢字を検索する場
合，パレット入力であれば，部首“土”しか検索できな
かったが，本発明では，類似要素により入力情報として
““土”，“土”，“工”のいずれを入力しても検索情
報として，検索することが可能となるので，入力できる
情報が拡張できる。【００１５】類似漢字検索装置では，検索したい漢字に
類似する一つ以上の漢字をキーボード，音声認識，範囲
指定などの検索条件入力手段１１より入力し，それぞれ
のベクトルをデータベースから検索するが漢字ベクトル
取得手段７で，それら複数のベクトルを併合（ベクトル
を構成要素の集合と考えると，和集合をとることを意味
する）した検索ベクトルを生成するのが漢字ベクトル併
合手段８で，併合ベクトルとベクトルデータベースの照
合を行い検索するのが類似漢字検索手段９である。例え
ば，“泓”を検索したい漢字とすると，類似漢字の入力
例として，（“弘”，“清”），（さんずい，弘），
（弓，ム），（弘）などが考えられる。最初の入力組に
対して，“弘”の主要な構成要素は，“弓”，“ム”，
“ゆみ“などであり，”清“の主要な構成要素は“さん
ずい，”青“，”王“，”月“などである。これらの構
成要素を併合した検索ベクトルとデータベースの類似度
計算により，その類似度の高い漢字（例えば，“泓”，
“肱”，“弸”など）が順に出力される。【００１６】【発明の効果】以上のように本発明装置は漢字検索の構
成要素とその類似構成要素の集合から構成要素の重みを
成分とする漢字ベクトルの集合を全体ベクトルとして記
憶装置に格納しておき，入力された複数の漢字に対する
それぞれの個別漢字ベクトルをベクトルデータベースか
ら取り出して併合し，その併合ベクトルを検索漢字ベク
トルとして，ベクトルデータベースから類似検索を行う
ので，漢字の部首や画数を選択する手間をなくし，頭の
中で直感的に思い浮かぶことができる任意の漢字情報を
入力するだけで，類似漢字が短時間で検索できるという
優れた効果がある。特に，手書きによる漢字入力に不便
を感じる身障者や高齢者は，音声入力と連動させること
で，より優れた漢字入力装置となる。また，パレット入
力や手書き入力では，システム側で準備されたデータベ
ースを利用者が変更するのは困難であったが，本発明で
は，漢字情報が図１の登録手法により容易に追加できる
ので，拡張性においても優れたものとなる。Description: BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a kana-kanji conversion for inputting characters and documents in a word processor, a portable terminal, etc. The present invention relates to a similar kanji search support device capable of inputting kanji in a short time. 2. Description of the Related Art With the spread of word processors, portable terminals and the like, opportunities to input difficult-to-read Chinese characters are increasing, and the number of Chinese characters is increasing due to international communication in the Asian region. I have. Since the reading of difficult Chinese characters is often unknown, the following two types of input support methods have been put to practical use. In the first conventional method, there is a palette input method in which a radical or stroke number is input from a palette prepared by the system to narrow down and search for a corresponding kanji, but most of the radicals are not familiar and are easily input. It is difficult to count the number of strokes, and it is difficult to search for kanji in a short time. The second conventional method is a kanji recognition method based on handwriting input, which solves the drawbacks of the first conventional method, but it is difficult to input a difficult kanji with many strokes in a short time with a mouse or the like. In addition, it is difficult to use for handicapped people and the elderly. further,
There is a disadvantage that input support using voice recognition, which has been progressing in recent years, cannot be easily handled. In the document search, a method of obtaining statistical data of words and character strings existing in a document, constructing a vector for the document in which weights for those components are calculated, and searching for a similar document by the vector similarity method. Yes, the present invention combines this vector similarity technique with Kanji search. SUMMARY OF THE INVENTION The present invention has been made in view of the above circumstances, and eliminates the need for selecting a radical and the number of strokes in a pallet system. An object of the present invention is to solve the problem of inconsistency between constraints and speech recognition input. Furthermore, in the conventional method, accurate selection of radicals and stroke numbers and accurate input are always required, and there is no flexibility in the search information, which is inconvenient for the user. It is characterized by a device that introduces similarity and relevance in meaning and shape, and can input and specify wider search information. [0007] According to the kanji search data construction method according to the claims, search information (similar kanji, radical, reading of radical, etc.) for difficult kanji to be searched is freely input. A similar kanji search method and apparatus capable of calculating a similarity between an input vector for the search information and a vector database constructed in advance from the components of the kanji information, and outputting kanji candidates in descending order of similarity. The kanji vector and its database are expanded by newly adding related / similar configuration information to the component set obtained directly from the kanji information, and using the weight of the component using the inverse ratio of the frequency of all components. It is characterized by building. [0008] A kanji search method includes means for merging input kanji vectors for a plurality of pieces of input kanji information, and similarity search of merged vectors from a vector database for similarity search and ranking output. Features. This eliminates the need for the input user to select kanji components from the screen, etc., and simply inputs the kanji information (similar kanji, radicals, reading of radicals, etc.) that the user wants to search. You can search with The present invention will be specifically described below with reference to the drawings showing an embodiment thereof. FIG. 1 is a block diagram showing a configuration of a similarity search apparatus according to the present invention (hereinafter, referred to as the present invention). In FIG. 1, a kanji component set is generated by a component set generation unit 2 for registered kanji obtained from an input unit 1 such as a keyboard, voice input, handwriting input, file input, and range designation. For example, for the Chinese character “Hohong”,
The components of “Sanzui (radical)” and “Hong Kong” are determined, but the components of “Humi” and “Mu” are generated from “Hong Kong”. (Radical) "," mu "," bow "｝ are obtained. These constituent elements are elements for disassembling kanji parts (parts) and are not data based on a new technology, so that they can be easily constructed manually. In the present invention, new information is added to the kanji parts disassembly. For example, for the difficult kanji character “tortoise”, similar related components “turtle”, “monkey”, “turtle”, etc. are generated.
The similar kanji set generating means 3 is added as new information. The component merging means 4 merges the two element sets of the kanji parts and the new information. The vector generation storage means 5 generates and stores a weight vector for the registered kanji from the entire set of elements of all registered kanji. is there. As an example of calculating the weight of the components of this vector, the frequencies of all the components are summed from the component set of each kanji, and the components with high frequency are given a small weight, and the components with low frequency are given large weight. The calculation method is general, and the following individual kanji vector VEC (K) is generated for the kanji K. VEC (K) = (W (x1), W (x2),.
, W (xn)) where xi (0 <i <n + 1) is a component and w (xi) is a calculated weight. A set of these vectors is stored in the database as an overall vector. [0013] Kanji registration when creating a database,
Kanji is registered in ascending order of strokes, and a vector database is constructed step by step. For example, if “soil” is registered in three strokes, each component is represented by “ten” and “? (Horizontal bar)”. At this time, since “ten” has already been registered as a two-stroke kanji, the parts “?” (Horizontal bar) and “1” (vertical bar) of “ten” can be generated, and the component of the kanji “earth” is “Ten”, “? (Horizontal bar) "and" 1 (vertical bar). Next, in the same manner, the same constituent element as “Doshi” can be generated for the three strokes of Chinese character “Koji”. And can be used for automatic generation of a similar component set. In this sense, the component of “engine” is “cho”,? (Horizontal bar) ”,“ 1 (vertical bar) ”, if“ ten ”and“ cho ”are already similar components,“ soil ”is“ soil ”and“ ko ”is a similar component In this way, the similarity of the constituent elements is determined by the inclusion relation of the element set. However, according to the present invention, it is possible to search as search information regardless of whether “Soil”, “Soil”, or “Engine” is input as input information by similar elements. Information that can be expanded. In the similar kanji search device, one or more kanji similar to the kanji to be searched are input from search condition input means 11 such as a keyboard, voice recognition, range designation, etc., and respective vectors are searched from a database. A kanji vector merging unit 8 generates a retrieval vector obtained by merging the plurality of vectors (meaning that the vector is considered to be a union if the vector is considered to be a set of constituent elements). The similar kanji search means 9 performs a search by collating the database. For example, if you want to search for “Hohong” as a kanji, you can enter (“Hiro”, “Qing”), (Sanzui, Hiro),
(Bow, mu), (Hiro), etc. are conceivable. For the first input set, the main components of “Hiro” are “bow”, “mu”,
The main components of “Qing” are “Sanui,” “Blue,” “King,” and “Moon.” The similarity between the search vector and the database that combines these components is calculated. , Chinese characters with high similarity (for example,
"Elbow", "black", etc.) are output in order. As described above, the apparatus according to the present invention stores a set of kanji vectors having a component weight as a component from a set of kanji search components and their similar components in a storage device as an entire vector. In advance, the individual kanji vectors for multiple input kanji are extracted from the vector database and merged, and the merged vector is used as a search kanji vector to perform a similar search from the vector database. There is an excellent effect that similar kanji can be searched in a short time only by inputting arbitrary kanji information that can be intuitively remembered in the mind without the trouble of doing. In particular, a handicapped person or an elderly person who feels inconvenienced by handwritten kanji input becomes a better kanji input device by linking it with voice input. In addition, it is difficult for the user to change the database prepared by the system in the palette input or the handwriting input. It also has excellent properties.

【図面の簡単な説明】【図１】本発明装置の構成を示すブロック図である。【符号の説明】１入力手段２構成要素集合生成手段３類似構成要素集合生成手段４構成要素集合併合手段５ベクトル生成格納手段６ベクトルデータベース７漢字ベクトル取得手段８漢字ベクトル併合手段９類似検索手段１０出力手段１１検索条件入力手段[Brief description of the drawings] FIG. 1 is a block diagram illustrating a configuration of a device of the present invention. [Explanation of symbols] 1 Input means 2 Component set generation means 3 Similar component set generation means 4 Means of merging components 5 Vector generation storage means 6 Vector database 7 Kanji vector acquisition means 8 Kanji vector merging means 9 Similar search means 10 Output means 11 Search condition input means

───────────────────────────────────────────────────── フロントページの続き (72)発明者森田和弘徳島市北矢三町２丁目３番56号Ｆターム(参考） 5B009 LB03 LC01 5B075 ND03 PP28 PR06 QM08 UU02 ────────────────────────────────────────────────── ─── Continuation of front page (72) Inventor Kazuhiro Morita 2-3-56 Kita-Yama-cho, Tokushima-shi F term (reference) 5B009 LB03 LC01 5B075 ND03 PP28 PR06 QM08 UU02

Claims

Claims: 1. A kanji constituent element set generating means for generating a kanji constituent element set from input means such as a keyboard, voice recognition, range designation, etc., and elements related / similar to each constituent element. A similar element set generating means for generating a set, a component element merging means for merging the two sets, and a kanji vector having a component weight as a component from all the element sets for all kanji generated above. , Vector generation storage means for storing the vector database, vector acquisition means for acquiring from the vector database a plurality of kanji information vectors for the intended kanji input from the input means, and vector merging for merging the plurality of kanji information vectors Search and output kanji candidates with ranking by means and similarity calculation of merged vector and vector database Similar Kanji search method characterized by comprising a similarity search unit that.