JP2003242144A - Similar kanji retrieval method - Google Patents

Similar kanji retrieval method

Info

Publication number
JP2003242144A
JP2003242144A JP2002082079A JP2002082079A JP2003242144A JP 2003242144 A JP2003242144 A JP 2003242144A JP 2002082079 A JP2002082079 A JP 2002082079A JP 2002082079 A JP2002082079 A JP 2002082079A JP 2003242144 A JP2003242144 A JP 2003242144A
Authority
JP
Japan
Prior art keywords
kanji
vector
input
similar
retrieval
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
JP2002082079A
Other languages
Japanese (ja)
Inventor
Junichi Aoe
順一 青江
Kazuhiko Tsuda
和彦 津田
Fukutsugu Nin
福継 任
Masao Fuchida
正雄 泓田
Kazuhiro Morita
和弘 森田
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to JP2002082079A priority Critical patent/JP2003242144A/en
Publication of JP2003242144A publication Critical patent/JP2003242144A/en
Pending legal-status Critical Current

Links

Abstract

<P>PROBLEM TO BE SOLVED: To enable the retrieval of similar kanji (Chinese characters) in a short time only by input of optional kanji information by generating a retrieval vector from a plurality of pieces of kanji information inputted, and performing a similarity retrieval with a preliminarily constituted vector database to eliminate the labor and time for selecting the radical or the number of storks of a kanji character. <P>SOLUTION: For a registered kanji character obtained from an input means 1, kanji constituting element information is generated by a generation means 2, and similar kanji constituting elements for facilitating the retrieval from constituting elements are newly generated and added thereto by a generation means 3. These pieces of element information are combined together by a means 4, and a vector database mainly having components of the elements to the registered kanji character is generated and stored by a means 5. In kanji retrieval, the vectors of kanji information inputted by the input means 1 are acquired from the database by a means 7, the input vectors are combined by a means 8 to extend the retrieval information. The similarity calculation of the combined vector with the vector space of the vector database is carried out to output kanji candidates with ranking in a short time. <P>COPYRIGHT: (C)2003,JPO

Description

【発明の詳細な説明】 【0001】 【発明の属する技術分野】本発明は,ワードプロセッ
サ,携帯端末などにおいて,文字や文書を入力する仮名
漢字変換,音声認識などの入力支援における読みの分か
らない難しい漢字入力を短時間で行える類似漢字検索支
援装置に関する。 【0002】 【従来の技術】ワードプロセッサ,携帯端末などの普及
により,読みの分からない難漢字を入力する機会は増大
しており,また,アジア圏の国際コミュニケーションに
より難漢字の数も多くなってきている。難漢字はその読
みも不明な場合が多いので,以下の2種類の入力支援方
法が実用化されている。 【0003】第1の従来法では,システムが準備したパ
レットから部首や画数を入力して該当する漢字を絞り込
み検索するパレット入力方式があるが,部首の大半は馴
染みが薄く,容易に入力できないこと,また画数を数え
るのは時間がかかり,短時間の漢字検索は困難である。 【0004】第2の従来法では,手書き入力による漢字
認識方式であり、第1の従来法の欠点を改善している
が、画数の多い難しい漢字を短時間にマウスなどで入力
するのは難しいこと,また手の不自由な身障者や高齢者
に対しては利用が難しいなどの問題点がある。さらに,
近年進歩してきている音声認識を利用した入力支援にも
容易に対応できない欠点がある。 【0005】文書検索では,文書中に存在する単語や文
字列の統計データをとり,それらの成分に対する重みを
計算した文書に対するベクトルを構成し,そのベクトル
類似手法により,類似文書を検索する手法があり,本発
明はこのベクトル類似手法を漢字検索に組み合わせる。 【0006】 【発明が解決しようとする課題】本発明は斯かる事情に
鑑みてなされたものであり,パレット方式における部首
や画数の選択の手間を省くこと,また漢字認識方式の手
書き入力の制約と音声認識入力との非連動性の問題を解
決することである。さらに従来法では正確な部首や画数
の選択,正確な入力が常に要求され,検索情報に自由度
がないので利用する側も不便があったので,本発明で
は,表意文字である漢字検索に,意味や形の類似と関連
性を導入し,より広い検索情報が入力指定できる装置を
特徴とする。 【0007】 【課題を解決するための手段】請求項に係る漢字検索の
データ構築方法は,検索対象となる難漢字に対する検索
情報(類似漢字,部首,部首の読みなど)を自由に入力
し,その検索情報に対する入力ベクトルと,漢字情報の
構成要素からあらかじめ構築されたベクトルデータベー
スとで類似度を計算し,類似度の高い順に漢字候補を出
力できる類似漢字検索方法及び装置。漢字情報から直接
得られる構成要素集合に関連・類似する構成情報を新た
に加えて拡張し,全ての構成要素の頻度の逆比を利用し
た構成要素の重みを成分とする漢字ベクトルとそのデー
タベースを構築することを特徴とする。 【0008】 【発明の実施の形態】漢字検索方法は,複数の入力漢字
情報に対する入力漢字ベクトルを併合する手段と,併合
ベクトルの類似漢字をベクトルデータベースから類似度
検索して,ランキング出力することを特徴とする。以上
により,入力者は漢字構成要素を画面などから選択する
手間がなくなり,検索したい漢字情報(類似漢字,部
首,部首の読みなど)を入力するだけで,意図する漢字
一覧を優先度情報付きで検索できる。 【0009】 【実施例】以下,本発明をその実施の形態を示す図面を
参照して具体的に説明する。 【0010】図1は,本発明に係る類似検索装置(以
下,本発明という)の構成を示すブロック図である。図
中1は,キーボード,音声入力,手書き入力,ファイル
入力,範囲指定などの入力手段1から得られた登録漢字
に対して,構成要素集合生成手段2で漢字構成要素集合
が生成される。例えば,“泓”なる漢字に対しては,
“さんずい(部首)”と“弘”の構成要素が決定される
が,さらに“弘”から“弓”と“ム”の構成要素が生成
されるので,要素集合{“泓”,“さんずい(部
首)”,“ム”“弓”}が得られる。この構成要素と
は,漢字の部品(パーツ)分解の要素であり,新しい技
術によるデータではないので,人手で容易に構築でき
る。 【0011】本発明では,漢字の部品分解に新情報を加
える。例えば,“鼈”なる難漢字対して,類似している
関連構成要素“亀”,“申”,“かめ”などを生成し,
新情報として追加するのが類似漢字集合生成手段3であ
る。これら漢字部品と新情報の二つの要素集合を併合す
るのが構成要素併合手段4であり,全登録漢字の要素の
全体集合から登録漢字に対する重みベクトル生成し格納
するのがベクトル生成格納手段5である。このベクトル
の成分の重み計算の例としては,各漢字の構成要素集合
から全ての構成要素の頻度を集計し,頻度の多い構成要
素には小さな重みを,頻度の少ない構成要素には大きな
重みを計算する方法が一般的であり,漢字Kに対して次
の個別漢字ベクトルVEC(K)を生成する。 VEC(K)=(W(x1),W(x2), ...
,W(xn)) 【0012】ここで,xi(0<i<n+1)は構成要
素であり,w(xi)は計算された重みである。これら
のベクトルの集合は,全体ベクトルとしてデータベース
に格納される。 【0013】データベースを作成する際の漢字登録は,
画数の少ない漢字から順に登録し,段階的にベクトルデ
ータベースを構築する。例えば,“土”が3画で登録さ
れたとすると,それぞれの構成要素は“十”,“?(横
棒)”で表現される。このとき,“十”は2画漢字とし
て先に登録済みであるので,“十”の部品“?(横
棒)”と“1(縦棒)は生成でき,漢字“土”の構成要
素は“十”,“?(横棒)”と“1(縦棒)と決定でき
る。次に,3画の漢字“士”も同様にして,“土”と同
じ構成要素が生成できるが,ここで構成要素集合の包含
率の高い漢字同士“土”と“士”は,類似度の高い構成
要素と判定でき,類似構成要素集合の自動生成に利用で
きる。この意味で,”工“の構成要素が”丁“,?(横
棒)”,“1(縦棒)”で,“十”と“丁”が既に類似
構成要素であるならば,“土”は“土”と“工”は類似
性のある構成要素として考えられる。このように構成要
素の類似性は要素集合の包含関係で決定される。 【0014】例えば,“壥”なる難漢字を検索する場
合,パレット入力であれば,部首“土”しか検索できな
かったが,本発明では,類似要素により入力情報として
““土”,“土”,“工”のいずれを入力しても検索情
報として,検索することが可能となるので,入力できる
情報が拡張できる。 【0015】類似漢字検索装置では,検索したい漢字に
類似する一つ以上の漢字をキーボード,音声認識,範囲
指定などの検索条件入力手段11より入力し,それぞれ
のベクトルをデータベースから検索するが漢字ベクトル
取得手段7で,それら複数のベクトルを併合(ベクトル
を構成要素の集合と考えると,和集合をとることを意味
する)した検索ベクトルを生成するのが漢字ベクトル併
合手段8で,併合ベクトルとベクトルデータベースの照
合を行い検索するのが類似漢字検索手段9である。例え
ば,“泓”を検索したい漢字とすると,類似漢字の入力
例として,(“弘”,“清”),(さんずい,弘),
(弓,ム),(弘)などが考えられる。最初の入力組に
対して,“弘”の主要な構成要素は,“弓”,“ム”,
“ゆみ“などであり,”清“の主要な構成要素は“さん
ずい,”青“,”王“,”月“などである。これらの構
成要素を併合した検索ベクトルとデータベースの類似度
計算により,その類似度の高い漢字(例えば,“泓”,
“肱”,“弸”など)が順に出力される。 【0016】 【発明の効果】以上のように本発明装置は漢字検索の構
成要素とその類似構成要素の集合から構成要素の重みを
成分とする漢字ベクトルの集合を全体ベクトルとして記
憶装置に格納しておき,入力された複数の漢字に対する
それぞれの個別漢字ベクトルをベクトルデータベースか
ら取り出して併合し,その併合ベクトルを検索漢字ベク
トルとして,ベクトルデータベースから類似検索を行う
ので,漢字の部首や画数を選択する手間をなくし,頭の
中で直感的に思い浮かぶことができる任意の漢字情報を
入力するだけで,類似漢字が短時間で検索できるという
優れた効果がある。特に,手書きによる漢字入力に不便
を感じる身障者や高齢者は,音声入力と連動させること
で,より優れた漢字入力装置となる。また,パレット入
力や手書き入力では,システム側で準備されたデータベ
ースを利用者が変更するのは困難であったが,本発明で
は,漢字情報が図1の登録手法により容易に追加できる
ので,拡張性においても優れたものとなる。
Description: BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a kana-kanji conversion for inputting characters and documents in a word processor, a portable terminal, etc. The present invention relates to a similar kanji search support device capable of inputting kanji in a short time. 2. Description of the Related Art With the spread of word processors, portable terminals and the like, opportunities to input difficult-to-read Chinese characters are increasing, and the number of Chinese characters is increasing due to international communication in the Asian region. I have. Since the reading of difficult Chinese characters is often unknown, the following two types of input support methods have been put to practical use. In the first conventional method, there is a palette input method in which a radical or stroke number is input from a palette prepared by the system to narrow down and search for a corresponding kanji, but most of the radicals are not familiar and are easily input. It is difficult to count the number of strokes, and it is difficult to search for kanji in a short time. The second conventional method is a kanji recognition method based on handwriting input, which solves the drawbacks of the first conventional method, but it is difficult to input a difficult kanji with many strokes in a short time with a mouse or the like. In addition, it is difficult to use for handicapped people and the elderly. further,
There is a disadvantage that input support using voice recognition, which has been progressing in recent years, cannot be easily handled. In the document search, a method of obtaining statistical data of words and character strings existing in a document, constructing a vector for the document in which weights for those components are calculated, and searching for a similar document by the vector similarity method. Yes, the present invention combines this vector similarity technique with Kanji search. SUMMARY OF THE INVENTION The present invention has been made in view of the above circumstances, and eliminates the need for selecting a radical and the number of strokes in a pallet system. An object of the present invention is to solve the problem of inconsistency between constraints and speech recognition input. Furthermore, in the conventional method, accurate selection of radicals and stroke numbers and accurate input are always required, and there is no flexibility in the search information, which is inconvenient for the user. It is characterized by a device that introduces similarity and relevance in meaning and shape, and can input and specify wider search information. [0007] According to the kanji search data construction method according to the claims, search information (similar kanji, radical, reading of radical, etc.) for difficult kanji to be searched is freely input. A similar kanji search method and apparatus capable of calculating a similarity between an input vector for the search information and a vector database constructed in advance from the components of the kanji information, and outputting kanji candidates in descending order of similarity. The kanji vector and its database are expanded by newly adding related / similar configuration information to the component set obtained directly from the kanji information, and using the weight of the component using the inverse ratio of the frequency of all components. It is characterized by building. [0008] A kanji search method includes means for merging input kanji vectors for a plurality of pieces of input kanji information, and similarity search of merged vectors from a vector database for similarity search and ranking output. Features. This eliminates the need for the input user to select kanji components from the screen, etc., and simply inputs the kanji information (similar kanji, radicals, reading of radicals, etc.) that the user wants to search. You can search with The present invention will be specifically described below with reference to the drawings showing an embodiment thereof. FIG. 1 is a block diagram showing a configuration of a similarity search apparatus according to the present invention (hereinafter, referred to as the present invention). In FIG. 1, a kanji component set is generated by a component set generation unit 2 for registered kanji obtained from an input unit 1 such as a keyboard, voice input, handwriting input, file input, and range designation. For example, for the Chinese character “Hohong”,
The components of “Sanzui (radical)” and “Hong Kong” are determined, but the components of “Humi” and “Mu” are generated from “Hong Kong”. (Radical) "," mu "," bow "} are obtained. These constituent elements are elements for disassembling kanji parts (parts) and are not data based on a new technology, so that they can be easily constructed manually. In the present invention, new information is added to the kanji parts disassembly. For example, for the difficult kanji character “tortoise”, similar related components “turtle”, “monkey”, “turtle”, etc. are generated.
The similar kanji set generating means 3 is added as new information. The component merging means 4 merges the two element sets of the kanji parts and the new information. The vector generation storage means 5 generates and stores a weight vector for the registered kanji from the entire set of elements of all registered kanji. is there. As an example of calculating the weight of the components of this vector, the frequencies of all the components are summed from the component set of each kanji, and the components with high frequency are given a small weight, and the components with low frequency are given large weight. The calculation method is general, and the following individual kanji vector VEC (K) is generated for the kanji K. VEC (K) = (W (x1), W (x2),.
, W (xn)) where xi (0 <i <n + 1) is a component and w (xi) is a calculated weight. A set of these vectors is stored in the database as an overall vector. [0013] Kanji registration when creating a database,
Kanji is registered in ascending order of strokes, and a vector database is constructed step by step. For example, if “soil” is registered in three strokes, each component is represented by “ten” and “? (Horizontal bar)”. At this time, since “ten” has already been registered as a two-stroke kanji, the parts “?” (Horizontal bar) and “1” (vertical bar) of “ten” can be generated, and the component of the kanji “earth” is “Ten”, “? (Horizontal bar) "and" 1 (vertical bar). Next, in the same manner, the same constituent element as “Doshi” can be generated for the three strokes of Chinese character “Koji”. And can be used for automatic generation of a similar component set. In this sense, the component of “engine” is “cho”,? (Horizontal bar) ”,“ 1 (vertical bar) ”, if“ ten ”and“ cho ”are already similar components,“ soil ”is“ soil ”and“ ko ”is a similar component In this way, the similarity of the constituent elements is determined by the inclusion relation of the element set. However, according to the present invention, it is possible to search as search information regardless of whether “Soil”, “Soil”, or “Engine” is input as input information by similar elements. Information that can be expanded. In the similar kanji search device, one or more kanji similar to the kanji to be searched are input from search condition input means 11 such as a keyboard, voice recognition, range designation, etc., and respective vectors are searched from a database. A kanji vector merging unit 8 generates a retrieval vector obtained by merging the plurality of vectors (meaning that the vector is considered to be a union if the vector is considered to be a set of constituent elements). The similar kanji search means 9 performs a search by collating the database. For example, if you want to search for “Hohong” as a kanji, you can enter (“Hiro”, “Qing”), (Sanzui, Hiro),
(Bow, mu), (Hiro), etc. are conceivable. For the first input set, the main components of “Hiro” are “bow”, “mu”,
The main components of “Qing” are “Sanui,” “Blue,” “King,” and “Moon.” The similarity between the search vector and the database that combines these components is calculated. , Chinese characters with high similarity (for example,
"Elbow", "black", etc.) are output in order. As described above, the apparatus according to the present invention stores a set of kanji vectors having a component weight as a component from a set of kanji search components and their similar components in a storage device as an entire vector. In advance, the individual kanji vectors for multiple input kanji are extracted from the vector database and merged, and the merged vector is used as a search kanji vector to perform a similar search from the vector database. There is an excellent effect that similar kanji can be searched in a short time only by inputting arbitrary kanji information that can be intuitively remembered in the mind without the trouble of doing. In particular, a handicapped person or an elderly person who feels inconvenienced by handwritten kanji input becomes a better kanji input device by linking it with voice input. In addition, it is difficult for the user to change the database prepared by the system in the palette input or the handwriting input. It also has excellent properties.

【図面の簡単な説明】 【図1】本発明装置の構成を示すブロック図である。 【符号の説明】 1 入力手段 2 構成要素集合生成手段 3 類似構成要素集合生成手段 4 構成要素集合併合手段 5 ベクトル生成格納手段 6 ベクトルデータベース 7 漢字ベクトル取得手段 8 漢字ベクトル併合手段 9 類似検索手段 10 出力手段 11 検索条件入力手段[Brief description of the drawings] FIG. 1 is a block diagram illustrating a configuration of a device of the present invention. [Explanation of symbols] 1 Input means 2 Component set generation means 3 Similar component set generation means 4 Means of merging components 5 Vector generation storage means 6 Vector database 7 Kanji vector acquisition means 8 Kanji vector merging means 9 Similar search means 10 Output means 11 Search condition input means

───────────────────────────────────────────────────── フロントページの続き (72)発明者 森田 和弘 徳島市北矢三町2丁目3番56号 Fターム(参考) 5B009 LB03 LC01 5B075 ND03 PP28 PR06 QM08 UU02   ────────────────────────────────────────────────── ─── Continuation of front page    (72) Inventor Kazuhiro Morita             2-3-56 Kita-Yama-cho, Tokushima-shi F term (reference) 5B009 LB03 LC01                 5B075 ND03 PP28 PR06 QM08 UU02

Claims (1)

【特許請求の範囲】 【請求項1】 キーボード,音声認識,範囲指定などの
入力手段から漢字を構成する要素集合を生成する漢字構
成要素集合生成手段と,それぞれの構成要素に関連・類
似する要素集合を生成する類似要素集合生成手段と,そ
の二つの集合を併合する構成要素集合併合手段と,以上
で生成された全漢字に対する全要素集合から構成要素の
重みを成分とする漢字ベクトルを生成し,そのベクトル
データベースを格納するベクトル生成格納手段と,入力
手段から入力された意図する漢字に対する複数の漢字情
報のベクトルをベクトルデータベースから取得するベク
トル取得手段と,複数漢字情報のベクトルを併合するベ
クトル併合手段と,併合ベクトルとベクトルデータベー
スとの類似計算により順位付きで漢字候補を検索出力す
る類似検索手段とを具備することを特徴とする類似漢字
検索手法。
Claims: 1. A kanji constituent element set generating means for generating a kanji constituent element set from input means such as a keyboard, voice recognition, range designation, etc., and elements related / similar to each constituent element. A similar element set generating means for generating a set, a component element merging means for merging the two sets, and a kanji vector having a component weight as a component from all the element sets for all kanji generated above. , Vector generation storage means for storing the vector database, vector acquisition means for acquiring from the vector database a plurality of kanji information vectors for the intended kanji input from the input means, and vector merging for merging the plurality of kanji information vectors Search and output kanji candidates with ranking by means and similarity calculation of merged vector and vector database Similar Kanji search method characterized by comprising a similarity search unit that.
JP2002082079A 2002-02-18 2002-02-18 Similar kanji retrieval method Pending JP2003242144A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2002082079A JP2003242144A (en) 2002-02-18 2002-02-18 Similar kanji retrieval method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP2002082079A JP2003242144A (en) 2002-02-18 2002-02-18 Similar kanji retrieval method

Publications (1)

Publication Number Publication Date
JP2003242144A true JP2003242144A (en) 2003-08-29

Family

ID=27785389

Family Applications (1)

Application Number Title Priority Date Filing Date
JP2002082079A Pending JP2003242144A (en) 2002-02-18 2002-02-18 Similar kanji retrieval method

Country Status (1)

Country Link
JP (1) JP2003242144A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101310004B1 (en) 2011-04-08 2013-09-24 샤프 가부시키가이샤 Scanning signal line drive circuit and display device equipped with same

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101310004B1 (en) 2011-04-08 2013-09-24 샤프 가부시키가이샤 Scanning signal line drive circuit and display device equipped with same

Similar Documents

Publication Publication Date Title
US20110316796A1 (en) Information Search Apparatus and Information Search Method
JP3220886B2 (en) Document search method and apparatus
CN108073576A (en) Intelligent search method, searcher and search engine system
CN102844755A (en) Method of extracting named entity
JP2005135113A (en) Electronic equipment, related word extracting method, and program
JP2018181148A (en) Information output program, information output method, and information processing apparatus
JP6363547B2 (en) Information processing apparatus and sentence imaging program
JP2010092357A (en) Facility-related information retrieval method and facility-related information retrieval system
JP2010272075A (en) Emotional information extraction device, emotion retrieval device, method thereof, and program
JP6811087B2 (en) Search device, search method, and program
JP2003242144A (en) Similar kanji retrieval method
CN115438048A (en) Table searching method, device, equipment and storage medium
JP5057516B2 (en) Document distance calculation device and program
JP6676698B2 (en) Information retrieval method and apparatus using relevance between reserved words and attribute language
JP2002318812A (en) Similar image retrieval device, similar image retrieval method and similar image retrieval program
TW201822031A (en) Method of creating chart index with text information and its computer program product capable of generating a virtual chart message catalog and schema index information to facilitate data searching
JP3862059B2 (en) Search expression expansion method and search system
JP5035848B2 (en) Item determination apparatus, item determination program, recording medium, and item determination method
JP3875510B2 (en) Information retrieval apparatus, method thereof, program thereof, and recording medium on which program is recorded
JP3233803B2 (en) Hard-to-read kanji search device
KR102500725B1 (en) Electronic apparatus that generates a summary of an electronic document based on key keywords and operating method thereof
KR100952077B1 (en) Apparatus and method for choosing entry using keywords
RU2679967C1 (en) Information by the keywords searching device
JP2009271772A (en) Text mining method, text mining apparatus and text mining program
Ni et al. Handwriting input system of chinese guqin notation