JP2015179385A

JP2015179385A - Material search device, material search system, material search method, and program

Info

Publication number: JP2015179385A
Application number: JP2014056283A
Authority: JP
Inventors: 伊藤　直之; Naoyuki Ito; 直之伊藤; 茂春富樫; Shigeharu Togashi
Original assignee: Dai Nippon Printing Co Ltd
Current assignee: Dai Nippon Printing Co Ltd
Priority date: 2014-03-19
Filing date: 2014-03-19
Publication date: 2015-10-08
Anticipated expiration: 2034-03-19
Also published as: JP6303669B2

Abstract

PROBLEM TO BE SOLVED: To provide a material search device and the like which search for books/materials that match a user's interest on the basis of a document image with comments.SOLUTION: A material search device 1 stores a feature word database 15 related to appearance frequency of a feature word included in a material to be searched in advance. A document input part 11 takes in an input document 33 image including comments the user brings in. A character recognition/comment extraction part 12 converts printed characters of the input document 33 and a handwritten character 39 into text data and extracts the kind and position of a comment mark 37. A feature word extraction part 13 extracts a feature word included in the text data. A feature word weighting part 14 applies weight, counts the appearance frequency of a feature word in accordance with the kind of comment mark 37, and creates feature data 47. A relevance computation part 16 computes relevance between a search index 31 and the feature data 47, whereas a search result display part 17 presents to a user a highly relevant material to search.

Description

本発明は、資料を検索する技術に関し、特に、利用者が提示する資料との関連度が高い書籍・資料を検索する資料検索技術に関する。 The present invention relates to a technology for searching for materials, and more particularly, to a material search technology for searching for books and materials having a high degree of association with materials presented by users.

従来、利用者が知識を得るために、キーワードを入力し、関連書籍を検索したり、あるいは、表示されている文章中のキーワードを選択することで、そのキーワードを含む書籍を検索したりする技術がある。 Conventionally, in order for a user to obtain knowledge, a keyword is input and a related book is searched, or a keyword including the keyword is searched by selecting a keyword in the displayed text. There is.

例えば、特許文献１には、端末から利用者が入力した検索キーワードが含まれる書籍を効率良く、また、重要度の高い順に提示するシステムが提案されている。 For example, Patent Document 1 proposes a system that efficiently presents books including a search keyword input by a user from a terminal and presents them in order of importance.

特開２０１３−２０６３８８号公報JP 2013-206388 A

しかしながら、特許文献１は、利用者がキーワードを入力する必要があり、利用者にとって煩雑であるという問題がある。また、入力されたキーワードの組み合せによっては、書籍数が絞り込めない、あるいは、適切な書籍が提示されないという問題がある。 However, Patent Document 1 has a problem that the user needs to input a keyword and is complicated for the user. In addition, depending on the combination of the input keywords, there is a problem that the number of books cannot be narrowed down or an appropriate book is not presented.

本発明は、前述した問題点に鑑みてなされたもので、その目的とするところは、利用者が持参する書込み入りのドキュメントの内容に関連した利用者の興味・関心に合った書籍・資料を提示することが可能な資料検索装置等を提供することにある。 The present invention has been made in view of the above-described problems, and the object of the present invention is to provide a book / material suitable for the interest / interest of the user related to the contents of the written document brought by the user. An object of the present invention is to provide a material retrieval apparatus that can be presented.

前述した目的を達成するために、第１の発明は、特徴語データとの関連度に基づいて資料を検索する資料検索装置において、書込みを含むドキュメント画像に文字認識処理を施しテキストデータを抽出するテキスト抽出手段と、前記書込みの種類と位置を抽出する書込み抽出手段と、検索対象資料の第１の特徴語とその重要度を含む検索用インデックスを記憶する記憶手段と、前記テキストデータから第２の特徴語を抽出する特徴語抽出手段と、前記書込みの種類と位置とを用いて前記第２の特徴語の重要度を算出し、前記テキストデータの前記特徴語データを作成する特徴データ作成手段と、前記検索用インデックスと前記特徴語データとの関連度を計算する関連度計算手段と、を具備することを特徴とする資料検索装置である。 In order to achieve the above-described object, the first invention extracts a text data by performing a character recognition process on a document image including writing in a material retrieval apparatus that retrieves a material based on the degree of association with feature word data. A text extraction means; a write extraction means for extracting the type and position of the writing; a storage means for storing a search index including a first feature word of the search target material and its importance; a second from the text data; A feature word extracting means for extracting a feature word of the text, and a feature data creating means for calculating the importance of the second feature word using the type and position of the writing and creating the feature word data of the text data And a degree-of-association calculating means for calculating the degree of association between the search index and the feature word data.

第１の発明により、図書館等が所蔵する書籍や資料から成る検索対象資料について、予め、それぞれの検索対象資料に含まれる第１の特徴語の重要度からなる検索用インデックスを記憶手段により記憶しておき、読み取らせた利用者の書込みを含むドキュメントに含まれる第２の特徴語との関連度を求め、関連度の大きい検索対象資料を提示することが可能になる。 According to the first invention, for a search target material composed of books and materials held by a library or the like, a storage index is stored in advance by the storage means including the importance of the first feature word included in each search target material. In addition, it is possible to obtain the degree of relevance with the second feature word included in the document including the user's writing that has been read, and to present the search target material having a high degree of relevance.

その際、書込み抽出手段により、ドキュメントに書き込まれた書込みの位置と種類を抽出し、特徴データ作成手段により、書込みの位置の第２の特徴語の重要度に、書込みの種類に応じた重み付けを行うことにより、利用者の興味・関心に合致する検索対象資料を提示することが可能になる。
書込みの種類は、例えば、下線、マーカー、囲み、×印、手書き文字等であり、複数の書込みの種類を設けることにより、利用者が興味・関心の有無を容易に表現することが可能になる。 At that time, the writing extraction means extracts the writing position and type written in the document, and the feature data creation means weights the importance of the second feature word at the writing position according to the writing type. By doing so, it becomes possible to present the search target material that matches the user's interest.
The writing type is, for example, underline, marker, box, cross mark, handwritten character, etc. By providing a plurality of writing types, the user can easily express the interest / interest. .

前記書込み抽出手段は、文字認識処理を施し、認識結果を前記テキストデータに加える。
これにより、利用者の手書きによるメモをテキストデータに加えて、メモ部分の第２の特徴語を検索に用いることが可能になる。 The writing extraction unit performs character recognition processing and adds a recognition result to the text data.
As a result, a user's handwritten memo can be added to the text data, and the second feature word of the memo portion can be used for the search.

前記特徴データ作成手段は、前記書込みの種類に応じて該当する第２の特徴語の重要度を変化することが望ましい。
これにより、書込みの種類により重要度を大きくまたは小さくして利用者の興味・関心に応じた特徴語データを作成し、より的確な資料を提示することが可能になる。 It is desirable that the feature data creation means changes the importance of the corresponding second feature word in accordance with the type of writing.
This makes it possible to create feature word data according to the user's interest and interest by increasing or decreasing the importance depending on the type of writing, and present more accurate data.

また、前記特徴データ作成手段は、前記書込みの種類に応じて該当する前記第２の特徴語を削除することが望ましい。
これにより、書込みの種類により第２の特徴語からはずすことが可能になり、より利用者の興味・関心に合致する資料を提示することが可能になる。 Further, it is desirable that the feature data creation unit deletes the second feature word corresponding to the type of writing.
Thereby, it becomes possible to remove from the second feature word depending on the type of writing, and it is possible to present materials that more closely match the user's interest.

前記検索対象資料の検索用インデックスを作成するインデックス作成手段を更に具備することが望ましい。
これにより、新たな検索対象資料について検索用インデックスを更新していくことが可能になる。 It is desirable to further comprise index creation means for creating a search index for the search target material.
This makes it possible to update the search index for new search target materials.

前記ドキュメント画像を読み取る画像読み取り手段を更に備えることが望ましい。
利用者が、持参したドキュメントを例えばスキャナで読み込ませることにより、検索キーワード等を利用者が入力することなく、ドキュメントの内容に適した資料を提示することが可能になり、利用者の負担を減じることが可能になる。
また、例えば、携帯端末等のカメラ機能を使用して利用者が撮影したドキュメント画像を、インターネット等のネットワークを介して画像読み取り手段により資料検索装置に取り込むことにより、ドキュメントの内容に適した資料を提示することが可能になる。 It is desirable to further comprise image reading means for reading the document image.
By loading a document that the user has brought with a scanner, for example, it becomes possible to present materials suitable for the content of the document without the user inputting a search keyword, etc., thereby reducing the burden on the user. It becomes possible.
In addition, for example, a document image taken by a user using a camera function of a mobile terminal or the like is loaded into a document retrieval device by an image reading unit via a network such as the Internet, so that a document suitable for the content of the document can be obtained. It becomes possible to present.

以上のように、第１の発明により、利用者が検索キーワードを装置に入力することなく、利用者のドキュメントを資料検索装置に画像として取り込むだけで、当該ドキュメントの内容に適した資料を検索することが可能になり、利用者の負担を軽減することが可能になる。
また、利用者の書込みに応じてドキュメントの特徴語の重要度を変化することにより、利用者の興味・関心により適合する資料を検索することが可能になる。 As described above, according to the first aspect of the present invention, a user can search for a material suitable for the contents of the document only by taking the user's document as an image into the material searching device without inputting a search keyword into the device. And the burden on the user can be reduced.
Further, by changing the importance of the feature word of the document in accordance with the user's writing, it becomes possible to search for a material that matches the user's interest / interest.

第２の発明は、特徴語データとの関連度に基づいて資料を検索する資料検索システムにおいて、書込みを含むドキュメント画像を読み取り、読み取った画像を送信する画像読み取り装置と、前記ドキュメント画像に文字認識処理を施しテキストデータを抽出するテキスト抽出手段と、前記書込みの種類と位置を抽出する書込み抽出手段と、を備え、抽出したデータを送信する抽出装置と、検索対象資料の第１の特徴語とその重要度を含む検索用インデックスを記憶する記憶手段と、前記テキストデータから第２の特徴語を抽出する特徴語抽出手段と、前記書込みの種類と位置とを用いて前記第２の特徴語の重要度を算出し、前記テキストデータの前記特徴語データを作成する特徴データ作成手段と、前記検索用インデックスと前記特徴語データとの関連度を計算する関連度計算手段と、を備えるサーバと、を具備することを特徴とする資料検索システムである。 According to a second aspect of the present invention, there is provided a document retrieval system that retrieves a document based on a degree of association with feature word data, an image reading device that reads a document image including writing, and transmits the read image; A text extracting means for performing processing to extract text data; a writing extracting means for extracting the type and position of the writing; an extracting device for transmitting the extracted data; and a first characteristic word of the search target material; Storage means for storing a search index including the degree of importance, feature word extraction means for extracting a second feature word from the text data, and the type and position of the writing, the second feature word Feature data creating means for calculating importance and creating the feature word data of the text data, the search index, and the feature word data , A relevance calculation means for calculating a degree of association with a document retrieval system characterized by comprising a server, a with a.

第２の発明により、利用者が検索キーワードをシステムに入力することなく、利用者のドキュメント画像を読み取らせるだけで、当該ドキュメントの内容に適した資料を提示することが可能になり、利用者の負担を軽減することが可能になる。
また、利用者の書込みに応じてドキュメントの特徴語の重要度を変化することにより、利用者の興味・関心により適合する資料を検索することが可能になる。 According to the second invention, it is possible to present a material suitable for the contents of the document only by allowing the user to read the document image of the user without inputting the search keyword into the system. The burden can be reduced.
Further, by changing the importance of the feature word of the document in accordance with the user's writing, it becomes possible to search for a material that matches the user's interest / interest.

第３の発明は、特徴語データとの関連度に基づいて資料を検索する資料検索装置で行う資料検索方法であって、書込みを含むドキュメント画像に文字認識処理を施しテキストデータを抽出するテキスト抽出ステップと、前記書込みの種類と位置を抽出する書込み抽出ステップと、検索対象資料の第１の特徴語とその重要度を含む検索用インデックスを記憶する記憶ステップと、前記テキストデータから第２の特徴語を抽出する特徴語抽出ステップと、前記書込みの種類と位置とを用いて前記第２の特徴語の重要度を算出し、前記テキストデータの前記特徴語データを作成する特徴データ作成ステップと、前記検索用インデックスと前記特徴語データとの関連度を計算する関連度計算ステップと、を含むことを特徴とする資料検索方法である。 A third invention is a material retrieval method performed by a material retrieval device that retrieves materials based on the degree of association with feature word data, and performs text recognition processing on a document image including writing to extract text data A step of extracting, a writing extraction step of extracting the type and position of the writing, a storing step of storing a first feature word of the search target material and a search index including its importance, and a second feature from the text data A feature word extracting step for extracting a word; a feature data creating step for calculating the importance of the second feature word using the type and position of the writing; and creating the feature word data of the text data; And a relevance level calculating step for calculating a relevance level between the search index and the feature word data.

第４の発明は、コンピュータを、特徴語データとの関連度に基づいて資料を検索する資料検索装置として機能させるためのプログラムであって、前記コンピュータを、書込みを含むドキュメント画像に文字認識処理を施しテキストデータを抽出するテキスト抽出手段と、前記書込みの種類と位置を抽出する書込み抽出手段と、検索対象資料の第１の特徴語とその重要度を含む検索用インデックスを記憶する記憶手段と、前記テキストデータから第２の特徴語を抽出する特徴語抽出手段と、前記書込みの種類と位置とを用いて前記第２の特徴語の重要度を算出し、前記テキストデータの前記特徴語データを作成する特徴データ作成手段と、前記検索用インデックスと前記特徴語データとの関連度を計算する関連度計算手段、として機能させるためのプログラムである。 A fourth invention is a program for causing a computer to function as a material retrieval device that retrieves material based on the degree of association with feature word data, wherein the computer performs character recognition processing on a document image including writing. Text extracting means for extracting the applied text data, write extracting means for extracting the type and position of the writing, storage means for storing a first feature word of the search target material and a search index including its importance, Importance of the second feature word is calculated using feature word extraction means for extracting a second feature word from the text data, and the type and position of the writing, and the feature word data of the text data is The feature data creating means to be created and the relevance calculation means for calculating the relevance between the search index and the feature word data Which is the program.

第４の発明に係るプログラムを汎用コンピュータにインストールすることによって、第１の発明に係る資料検索装置を得て、第３の発明に係る資料検索方法を実行することができる。 By installing the program according to the fourth invention in a general-purpose computer, the material search apparatus according to the first invention can be obtained and the material search method according to the third invention can be executed.

本発明の資料検索装置等によって、利用者が持参する書込み入りのドキュメントの内容に関連する書籍・資料を提示することが可能になる。 With the material retrieval apparatus of the present invention, it becomes possible to present books and materials related to the contents of written documents brought by users.

本実施形態に係る資料検索装置１のハードウエア構成を示すブロック図1 is a block diagram showing a hardware configuration of a material retrieval apparatus 1 according to the present embodiment. 本実施形態に係る資料検索装置１の機能構成を示すブロック図The block diagram which shows the function structure of the material search device 1 which concerns on this embodiment. 検索対象資料の特徴語データベース１５の構成例を示す図The figure which shows the structural example of the feature word database 15 of search object material 入力ドキュメント３３の例を示す図The figure which shows the example of the input document 33 本実施形態に係る資料検索装置１の処理の流れを示すフローチャートThe flowchart which shows the flow of a process of the material search device 1 which concerns on this embodiment. 書込みマーク・データ４１の例を示す図The figure which shows the example of the write mark data 41 入力ドキュメント３３から抽出された特徴語の例を示す図The figure which shows the example of the feature word extracted from the input document 33 書込みマークの重み付け倍率４５の例を示す図The figure which shows the example of the weighting magnification 45 of a writing mark 入力ドキュメント３３の特徴データの例を示す図The figure which shows the example of the characteristic data of the input document 33 特徴語データと検索対象資料の検索インデックスの関連度を説明する図Diagram explaining the degree of association between feature word data and search index of search target material 検索結果出力画面５５の例を示す図The figure which shows the example of the search result output screen 55 資料検索システム１０のシステム構成例を示す図The figure which shows the system configuration example of the material search system 10

以下、本発明の実施形態を、図面を参照しながら詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

図１は、資料検索装置１のハードウエア構成例を示すブロック図である。資料検索装置１は、図１に示すように、コンピュータシステムで構成することが可能である。図１の構成は、あくまで一例であり、用途、目的に応じて様々な構成を採ることが可能である。 FIG. 1 is a block diagram illustrating a hardware configuration example of the material retrieval apparatus 1. As shown in FIG. 1, the material retrieval apparatus 1 can be configured by a computer system. The configuration in FIG. 1 is merely an example, and various configurations can be adopted depending on the application and purpose.

資料検索装置１は、例えば、制御部２１、記憶部２２、メディア入出力部２３、通信制御部２４、入力部２５、表示部２６、周辺機器Ｉ／Ｆ（インタフェース）部２７等がバス２８を介して接続されて構成される。 In the material retrieval apparatus 1, for example, the control unit 21, the storage unit 22, the media input / output unit 23, the communication control unit 24, the input unit 25, the display unit 26, the peripheral device I / F (interface) unit 27, etc. Connected and configured.

制御部２１は、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）等で構成される。 The control unit 21 includes a CPU (Central Processing Unit), a ROM (Read Only Memory), a RAM (Random Access Memory), and the like.

ＣＰＵは、ＲＯＭ、記憶部２２等に格納されるプログラムをＲＡＭ上のワークメモリ領域に呼び出して実行し、バス２８を介して接続された各装置を駆動制御し、コンピュータが行う処理を実現する。
ＲＯＭは、不揮発性メモリであり、コンピュータのブートプログラムやＢＩＯＳ等のプログラム、データ等を恒久的に保持している。
ＲＡＭは、揮発性メモリであり、記憶部２２、ＲＯＭ、記憶媒体等からロードしたプログラム、データ等を一時的に保持するとともに、制御部２１が各種処理を行うために使用するワークエリアを備える。 The CPU calls and executes a program stored in the ROM, the storage unit 22 and the like in a work memory area on the RAM, drives and controls each device connected via the bus 28, and realizes processing performed by the computer.
The ROM is a non-volatile memory and permanently holds a computer boot program, a program such as BIOS, data, and the like.
The RAM is a volatile memory, and temporarily stores a program, data, and the like loaded from the storage unit 22, ROM, storage medium, and the like, and includes a work area used by the control unit 21 to perform various processes.

記憶部２２は、制御部２１が実行するプログラム、プログラム実行に必要なデータ、ＯＳ（ＯｐｅｒａｔｉｎｇＳｙｓｔｅｍ）等が格納される。記憶部２２には、後述する検索対象資料の特徴語データベース１５及び辞書データベース１８や、本実施形態の資料検索装置１のプログラム及びデータが格納される。 The storage unit 22 stores a program executed by the control unit 21, data necessary for program execution, an OS (Operating System), and the like. The storage unit 22 stores a feature word database 15 and a dictionary database 18 of a search target material, which will be described later, and a program and data of the material search device 1 of the present embodiment.

メディア入出力部２３（ドライブ装置）は、データの入出力を行い、例えば、ＣＤドライブ（−ＲＯＭ、−Ｒ、−ＲＷ等）、ＤＶＤドライブ（−ＲＯＭ、−Ｒ、−ＲＷ等）等のメディア入出力装置を有する。
通信制御部２４は、通信制御装置、通信ポート等を有し、ネットワークを介して、他の装置との通信制御を行う。ネットワークは、有線、無線を問わない。 The media input / output unit 23 (drive device) inputs / outputs data, for example, media such as a CD drive (-ROM, -R, -RW, etc.), DVD drive (-ROM, -R, -RW, etc.) Has input / output devices.
The communication control unit 24 includes a communication control device, a communication port, and the like, and performs communication control with other devices via a network. The network may be wired or wireless.

入力部２５は、データの入力を行い、例えば、キーボード、マウス、タッチパネル等のポインティングデバイス、テンキー等の入力装置を有する。
表示部２６は、ＣＲＴモニタ、液晶パネル等のディスプレイ装置であり、表示部２６には、本実施形態の資料検索装置１において検索結果等が表示される。 The input unit 25 inputs data and includes, for example, a pointing device such as a keyboard, a mouse, and a touch panel, and an input device such as a numeric keypad.
The display unit 26 is a display device such as a CRT monitor or a liquid crystal panel, and the search result is displayed on the display unit 26 in the material search device 1 of the present embodiment.

周辺機器Ｉ／Ｆ（インタフェース）部２７は、周辺機器を接続させるためのポートであり、ＵＳＢ、ＩＥＥＥ１３９４、ＲＳ−２３２Ｃ等で構成され、接続形態は有線、無線を問わない。
周辺機器Ｉ／Ｆ部２７を介して、例えば、スキャナが接続され、利用者が持参したドキュメントの画像入力データを取り込むことが可能である。
バス２８は、各装置間の制御信号、データ信号等の授受を媒介する経路である。 The peripheral device I / F (interface) unit 27 is a port for connecting peripheral devices, and is configured by USB, IEEE1394, RS-232C, or the like, and the connection form may be wired or wireless.
For example, a scanner is connected via the peripheral device I / F unit 27, and image input data of a document brought by the user can be captured.
The bus 28 is a path that mediates transmission / reception of control signals, data signals, and the like between the devices.

資料検索装置１は、その他、画像データの入力用に、図示しないカメラを備えていてもよく、また、周辺機器Ｉ／Ｆ（インタフェース）部２７に、図示しないスキャナが接続されていてもよい。 In addition, the document search apparatus 1 may include a camera (not shown) for inputting image data, and a scanner (not shown) may be connected to the peripheral device I / F (interface) unit 27.

図２は、本発明の実施形態に係る資料検索装置１の機能構成例を示すブロック図である。
資料検索装置１は、ドキュメント入力部１１、文字認識・書込み抽出部１２、特徴語抽出部１３、特徴語重み付け部１４、検索対象資料の特徴語データベース１５、関連度計算部１６、検索結果表示部１７、辞書データベース１８等で構成される。 FIG. 2 is a block diagram illustrating a functional configuration example of the material search apparatus 1 according to the embodiment of the present invention.
The document search device 1 includes a document input unit 11, a character recognition / write extraction unit 12, a feature word extraction unit 13, a feature word weighting unit 14, a feature word database 15 of a search target material, a relevance calculation unit 16, and a search result display unit. 17 and a dictionary database 18 and the like.

検索対象資料の特徴語データベース１５は、例えば、大学等の図書館が所蔵する書籍や資料を本実施の形態の資料検索装置１で検索するための検索インデックスを記憶するデータベースである。
詳しくは後述するが、検索インデックスは、各書籍、資料の特徴となる単語（特徴語）の重要度に関するデータであり、予め、各書籍や資料の書誌データや全文から辞書データベース１８を使用して特徴語を抽出し、その重要度を求めることにより作成する。 The feature word database 15 of the search target material is a database that stores, for example, a search index for searching for books and materials held in libraries such as universities by the material search device 1 of the present embodiment.
As will be described in detail later, the search index is data relating to the importance of words (characteristic words) that are characteristic of each book and document. The dictionary database 18 is used in advance from bibliographic data and full text of each book and document. Create by extracting feature words and determining their importance.

辞書データベース１８は、例えば、何冊かの辞書に収録されている見出しを記憶したデータベースであり、特徴語の抽出に使用する。辞書データベース１８に記憶する見出しの品詞は名詞のみでよいが、その他の品詞（動詞、形容詞等）も記憶させて使用してもよい。 The dictionary database 18 is, for example, a database that stores headings recorded in several dictionaries, and is used for extracting feature words. The part of speech of the headline stored in the dictionary database 18 may be only a noun, but other parts of speech (verbs, adjectives, etc.) may be stored and used.

ドキュメント入力部１１は、例えば、スキャナ又はカメラで構成することができる。
ドキュメント入力部１１は、利用者が持ち込むドキュメントを画像データとして取り込む。 The document input unit 11 can be configured by, for example, a scanner or a camera.
The document input unit 11 captures a document brought in by the user as image data.

ドキュメントは、例えば、大学等の授業のシラバスやレジュメ、関連資料、講義ノート、書籍のなかの１ページ、新聞や雑誌の記事等であり、印刷文字の印刷物であるが、利用者による手書きの書込みがあってもよい。
書込みは、例えば、下線やマーカーによるマーキング、囲み、手書き文字、不要な部分を除外するための×印等である。 Documents are, for example, university syllabuses and resumes, related materials, lecture notes, one page of books, newspapers and magazine articles, etc., which are printed matter of printed characters, but handwritten by users There may be.
The writing is, for example, marking with an underline or a marker, a box, a handwritten character, an X mark for excluding unnecessary portions, and the like.

文字認識・書込み抽出部１２は、ドキュメント入力部１１により資料検索装置１に取り込まれたドキュメントの画像データに文字認識処理を実行し、テキストデータに変換するとともに、利用者が手書きで書き込んだ書込みの種類と位置を抽出する。
また、書込みが手書き文字の場合には、手書き文字に対して文字認識処理を実行し、手書き文字もテキストデータに変換する。 The character recognition / write extraction unit 12 executes character recognition processing on the image data of the document captured by the document search unit 1 by the document input unit 11 and converts it into text data. Extract type and position.
When the handwritten character is written, a character recognition process is executed for the handwritten character, and the handwritten character is converted into text data.

特徴語抽出部１３は、文字認識・書込み抽出部１２によって変換されたテキストデータから辞書データベース１８を参照して特徴語を抽出する。
特徴語は、例えば、名詞の単語、及び、辞書データベース１８にはない未知語等である。 The feature word extraction unit 13 extracts feature words from the text data converted by the character recognition / writing extraction unit 12 with reference to the dictionary database 18.
The feature words are, for example, noun words and unknown words that are not in the dictionary database 18.

特徴語抽出部１３は、まず、テキストデータを形態素解析し、そのなかの名詞の部分について辞書データベース１８を検索して一致する単語（例えば、「歴史」、「女性」、「フェミニズム」、「日本」等）を特徴語として抽出する。また、形態素解析において名詞と判別され、辞書データベース１８に一致する単語がない場合（例えば、「アベノミクス」等）には、未知語として特徴語に加える。 The feature word extraction unit 13 first performs morphological analysis on the text data, searches the dictionary database 18 for the noun part of the text data, and matches words (for example, “history”, “female”, “feminism”, “Japan” Etc.) as feature words. In addition, when it is determined as a noun in the morphological analysis and there is no matching word in the dictionary database 18 (for example, “Abenomics” or the like), it is added to the feature word as an unknown word.

次に、特徴語重み付け部１４は、特徴語抽出部１３によって抽出された特徴語について重要度を求めるとともに、利用者による書込みの内容に応じて重要度に重み付けを行う。
重要度は、例えば、特徴語の出現頻度や、ＴＦ・ＩＤＦ（ＴｅｒｍＦｒｅｑｕｅｎｃｙ・ＩｎｖｅｒｓｅＤｏｃｕｍｅｎｔＦｒｅｑｕｅｎｃｙ）値など、各種の重要度算出方法を利用できる。 Next, the feature word weighting unit 14 obtains the importance for the feature word extracted by the feature word extraction unit 13 and weights the importance according to the content of the writing by the user.
As the importance, for example, various importance calculation methods such as the appearance frequency of feature words and TF / IDF (Term Frequency / Inverse Document Frequency) values can be used.

重要度に出現頻度を用いる場合、例えば、下線やマーキング、囲み、手書き文字の書込みの場合、利用者が重要な部分、あるいは、興味・関心のある部分として書き込んでいると判断して、その部分に含まれる特徴語について、例えば２倍というように、正係数の重みをかけて出現頻度を計数する。
一方、ドキュメント中で×印の書込みがある部分については、その部分に含まれる特徴語は計数しないようにしてもよい。 When the appearance frequency is used for importance, for example, in the case of underlining, marking, enclosing, or writing of handwritten characters, it is determined that the user is writing as an important part or an interesting / interested part, and that part For example, the appearance frequency is counted by multiplying the weight of a positive coefficient, for example, twice.
On the other hand, the feature words included in the portion where the x mark is written in the document may not be counted.

特徴語重み付け部１４による書込み部分についての特徴語の重要度の重み付けにより、利用者の興味・関心に合った適切な資料検索が可能になる。 By the weighting of the importance of the feature word for the written part by the feature word weighting unit 14, it is possible to search for an appropriate material that matches the user's interest.

関連度計算部１６は、特徴語重み付け部１４により求められた利用者のドキュメントについての特徴データ（特徴語とその重要度のデータ）と、検索対象資料の特徴語データベース１５に格納されている各検索対象資料の関連度を計算する。 The degree-of-relevance calculation unit 16 stores the feature data (feature words and their importance data) about the user's document obtained by the feature word weighting unit 14 and the feature word database 15 of the search target material. Calculate the relevance of the search target material.

検索結果表示部１７は、関連度計算部１６で計算された関連度を元に、関連度の大きい検索対象資料の名称等を表示する。 The search result display unit 17 displays the name of the search target material having a high relevance level based on the relevance level calculated by the relevance level calculation unit 16.

次に、図３に沿って検索対象資料の特徴語データベース１５について説明する。
図３は、検索対象資料の特徴語データベース１５の構成例を示す図である。 Next, the feature word database 15 of the search target material will be described with reference to FIG.
FIG. 3 is a diagram illustrating a configuration example of the feature word database 15 of the search target material.

検索対象資料の特徴語データベース１５は、検索対象となる各書籍や資料についての検索用インデックス３１から成る。Ｐ冊の検索対象資料についてＰ個の検索用インデックス３１−１〜３１−Ｐが特徴語データベース１５に格納される。
検索用インデックス３１は、図書館等の蔵書・資料が増えると、その都度、作成・追加され、特徴語データベース１５が更新される。 The feature word database 15 of the search target material includes a search index 31 for each book or material to be searched. P search indexes 31-1 to 31 -P for the P search target materials are stored in the feature word database 15.
The search index 31 is created / added as the number of collections / materials such as libraries increases, and the feature word database 15 is updated.

検索用インデックス３１は、例えば、検索対象資料ＩＤ、及び、特徴語の見出しとその重要度で構成される。
図３に示すように、例えば、検索対象資料ＩＤ「１」の検索用インデックス３１−１は、特徴語として「女性」、「職業」、「カルチャー」、「日本」等の特徴語と、その重要度から成る。 The search index 31 includes, for example, a search target material ID, a feature word heading, and its importance.
As shown in FIG. 3, for example, the search index 31-1 for the search target material ID “1” includes feature words such as “female”, “profession”, “culture”, “Japan”, and the like. Consists of importance.

検索用インデックス３１における特徴語の重要度は、検索対象資料の書誌データや資料の全文に含まれる特徴語の出現頻度を基本とするが、例えば、ＴＦ・ＩＤＦ法等による重み付けを行ったものであることが望ましい。 The importance of feature words in the search index 31 is based on the appearance frequency of feature words included in the bibliographic data of the material to be searched and the full text of the material, but is weighted by, for example, the TF / IDF method. It is desirable to be.

ＴＦ・ＩＤＦ法は公知の技術であり、詳細な説明は省略するが、ＴＦ・ＩＤＦ法は、特定の文書に含まれる全単語の出現頻度における特定の単語の出現頻度の割合に関する値（ＴＦ）と、全文書数のなかの当該特定の単語を含む文書数の割合に関する値（ＩＤＦ）とに基づいた出現頻度を求める方法である。ＴＦ・ＩＤＦ法によれば、例えば、「これ」、「その」のようにどの文書にでも多く出現する単語の出現頻度は抑えられ、特定の文書にのみ多く出現する単語の出現頻度は大きくなる。 The TF / IDF method is a well-known technique and will not be described in detail. However, the TF / IDF method is a value (TF) related to the ratio of the appearance frequency of a specific word to the appearance frequency of all words included in the specific document. And the appearance frequency based on the value (IDF) relating to the ratio of the number of documents including the specific word in the total number of documents. According to the TF / IDF method, for example, the appearance frequency of words that frequently appear in any document such as “this” and “that” is suppressed, and the appearance frequency of words that frequently appear only in a specific document increases. .

図４は、利用者が資料検索装置１に入力する入力ドキュメント３３の例を示す図である。
入力ドキュメント３３には、印刷文字３５による記事が印刷されているとともに、利用者が手書きで書き込んだ書込みマーク３７、手書き文字３９が描画されている。 FIG. 4 is a diagram illustrating an example of the input document 33 that the user inputs to the material search apparatus 1.
In the input document 33, an article by the print character 35 is printed, and a writing mark 37 and a handwritten character 39 written by the user by hand are drawn.

図４の入力ドキュメント例３３は、記事Ａ〜Ｄが印刷されており、利用者が書込みマーク３７ａ〜３７ｄ、手書き文字３９を書き込んだものである。
書込みマーク３７は、例えば、マーカーによるマーキング３７ａ（書込みマークａ）、下線３７ｂ（書込みマークｂ）、囲み３７ｃ（書込みマークｃ）、×印３７ｄ（書込みマークｄ）等である。 In the input document example 33 of FIG. 4, articles A to D are printed, and the user has written writing marks 37 a to 37 d and handwritten characters 39.
The writing mark 37 is, for example, a marker 37a (writing mark a), an underline 37b (writing mark b), an enclosure 37c (writing mark c), an X mark 37d (writing mark d), or the like.

マーキング３７ａ、下線３７ｂ、囲み３７ｃは、利用者が重要と考えた部分に書き込むものであり、その部分に含まれる特徴語の重要度は大きくなるよう重み付けすればよい。
また、×印３７ｄは、利用者が必要ないと考えた部分に書き込むものであり、その部分に含まれる特徴語は除外するようにすればよい。 The marking 37a, the underline 37b, and the enclosure 37c are written in a portion that the user considers important, and weighting may be performed so that the importance of the feature word included in the portion is increased.
Further, the x mark 37d is written in a portion that the user thinks is unnecessary, and the feature words included in the portion may be excluded.

また、手書き文字３９は、利用者が重要と考えた文または文章と考えられ、そのなかに含まれる特徴語は重要であり、重要度が大きくなるよう重み付けする。 The handwritten character 39 is considered to be a sentence or sentence that the user considers important, and the feature words included therein are important and weighted so as to increase the importance.

次に、本実施の形態に係る資料検索装置１の処理の流れを説明する。
図５は、資料検索装置１の処理の流れを示すフローチャートである。 Next, the flow of processing of the material retrieval apparatus 1 according to this embodiment will be described.
FIG. 5 is a flowchart showing a processing flow of the material search apparatus 1.

まず、資料検索装置１の制御部２１は、入力ドキュメント３３の画像を取り込む（ステップ１０１）。
例えば、周辺機器Ｉ／Ｆ部２７に接続されたスキャナにより入力ドキュメント３３の画像を読み取り、記憶部２２に格納する。 First, the control unit 21 of the material retrieval apparatus 1 captures an image of the input document 33 (step 101).
For example, the image of the input document 33 is read by a scanner connected to the peripheral device I / F unit 27 and stored in the storage unit 22.

入力ドキュメント３３の取り込み方は、スキャナに限ることなく、例えば、利用者に携帯端末等のカメラで入力ドキュメント３３を撮影させ、ネットワークを介してその画像を資料検索装置１に送らせ、通信制御部２４を介して受信し、記憶部２２に格納するようにしてもよい。 The method of capturing the input document 33 is not limited to the scanner. For example, the user can photograph the input document 33 with a camera such as a portable terminal and send the image to the document retrieval apparatus 1 via the network. 24 may be received and stored in the storage unit 22.

次に、制御部２１は、取り込んだ画像データに対して文字認識処理を実行し、入力ドキュメント３３の印刷文字３５及び手書き文字３９をテキストデータに変換する（ステップ１０２）。
文字認識処理は、ＯＣＲ（ＯｐｔｉｃａｌＣｈａｒａｃｔｅｒＲｅｃｏｇｎｉｔｉｏｎ）等の公知の技術を使用すればよい。
手書き文字に対する文字認識処理により抽出されたテキストデータには、手書き文字であることを示すフラグを付しておくとよい。 Next, the control unit 21 performs character recognition processing on the captured image data, and converts the print characters 35 and handwritten characters 39 of the input document 33 into text data (step 102).
For the character recognition process, a known technique such as OCR (Optical Character Recognition) may be used.
The text data extracted by the character recognition process for handwritten characters may be attached with a flag indicating that it is a handwritten character.

次に、制御部２１は、利用者が書き込んだ書込みマーク３７を抽出し、その種類と位置を求める（ステップ１０３）。
書込みの種類（マーキング３７ａ、下線３７ｂ、囲み３７ｃ、×印３７ｄ）を想定して入力ドキュメント３３の画像から抽出を行う。 Next, the control unit 21 extracts the writing mark 37 written by the user, and obtains the type and position (step 103).
Extraction from the image of the input document 33 is performed assuming the type of writing (marking 37a, underline 37b, box 37c, x mark 37d).

例えば、マーキング３７ａの場合は、テキスト部分と重なったほぼ矩形の形状を抽出する。
また、下線３７ｂの場合は、テキストに重ならないほぼ直線の形状を抽出する。
また、囲み３７ｃの場合は、文字以外の閉曲線の形状を抽出する。
また、×印３７ｄの場合は、斜めの交わる２直線の形状を抽出する。 For example, in the case of the marking 37a, a substantially rectangular shape overlapping the text portion is extracted.
In the case of the underline 37b, a substantially straight line shape that does not overlap the text is extracted.
In the case of the enclosure 37c, the shape of a closed curve other than characters is extracted.
Further, in the case of the x mark 37d, the shape of two diagonally intersecting straight lines is extracted.

ステップ１０３で抽出した書込みマーク３７の種類と位置のデータは、書込みマーク・データ４１として記憶部２２に格納する。
図６は、書込みマーク・データ４１の構成例を示す図である。
書込みマーク・データ４１は、入力ドキュメント３３の識別番号である入力ドキュメントＩＤ、及び、当該入力ドキュメント３３に含まれる書込みの識別番号を示すマークＮｏ、書込みマーク３７の種類を示す書込みマークＩＤ、当該書込みマークの位置データ等で構成される。 The type and position data of the write mark 37 extracted in step 103 is stored in the storage unit 22 as write mark data 41.
FIG. 6 is a diagram illustrating a configuration example of the write mark data 41.
The write mark data 41 includes an input document ID that is an identification number of the input document 33, a mark No that indicates the identification number of writing included in the input document 33, a write mark ID that indicates the type of the write mark 37, and the write It consists of mark position data.

位置データは、例えば、入力ドキュメント３３の左上部を原点とする二次元座標である。
マーキング３７ａの場合、ほぼ矩形の対角の頂点の座標、下線３７ｂの場合、直線の両端の座標、囲み３７ｃの場合、囲みの閉曲線の（最小ｘ座標、最小ｙ座標）と（最大ｘ座標、最大ｙ座標）、×印３７ｄの場合、２直線の（最小ｘ座標、最小ｙ座標）と（最大ｘ座標、最大ｙ座標）を位置データとすることができる。 The position data is, for example, two-dimensional coordinates with the upper left corner of the input document 33 as the origin.
In the case of the marking 37a, the coordinates of the substantially rectangular diagonal vertices, in the case of the underline 37b, the coordinates of both ends of the straight line, and in the case of the enclosure 37c, the (closed x curve, minimum x coordinate) and (maximum x coordinate, In the case of (maximum y coordinate) and x mark 37d, two lines (minimum x coordinate, minimum y coordinate) and (maximum x coordinate, maximum y coordinate) can be used as position data.

次に、制御部２１は、テキストデータを形態素解析する（ステップ１０４）。
すなわち、テキストデータを意味のある単語に区切り、辞書データベース１８を利用して品詞を識別する。 Next, the control unit 21 performs morphological analysis on the text data (step 104).
That is, the text data is divided into meaningful words and the part of speech is identified using the dictionary database 18.

次に、制御部２１は、形態素解析された単語のなかの名詞、及び、辞書データベース１８で検索できない未知語を特徴語として抽出する（ステップ１０５）。
図７は、入力ドキュメント３３のテキストデータから抽出された特徴語４３の例を示す図である。
図７に示すように、テキストデータに含まれる特徴語が抽出される。
また、手書き文字３９の部分のテキストデータから抽出された特徴語には、その旨のフラグを付しておく。 Next, the control unit 21 extracts nouns in words subjected to morphological analysis and unknown words that cannot be searched in the dictionary database 18 as feature words (step 105).
FIG. 7 is a diagram illustrating an example of the feature word 43 extracted from the text data of the input document 33.
As shown in FIG. 7, feature words included in the text data are extracted.
In addition, a flag indicating that is added to the feature word extracted from the text data of the handwritten character 39 portion.

次に、制御部２１は、各特徴語の重み付け重要度を計数し、入力ドキュメント３３の特徴データを作成する（ステップ１０６）。
重み付け重要度は、図６に示した書込みマーク・データ４１及び、図８に示す重み付け倍率４５を元に、テキストデータから抽出された特徴語の出現頻度を求めて計数し、重要度を求める。 Next, the control unit 21 counts the weighting importance of each feature word and creates feature data of the input document 33 (step 106).
The weighting importance is obtained by counting the appearance frequency of feature words extracted from the text data based on the writing mark data 41 shown in FIG. 6 and the weighting magnification 45 shown in FIG.

図８は、重み付け倍率４５の例を示す図である。
例えば、書込みマーク３７がマーキング３７ａ及び下線３７ｂの場合、マーキング３７ａ及び下線３７ｂの位置に含まれる特徴語の出現頻度を２．０倍として、書込みマーク３７が囲み３７ｃの場合、囲みの位置に含まれる特徴語の出現頻度を１．７倍として計数する。
また、書込みマーク３７が×印３７ｄの場合、×印の位置の範囲に含まれる特徴語の重み付け倍率を０にして、計数しないようにする。
更に、手書き文字３９の場合、手書き文字としてフラグが付されている特徴語の出現頻度に、例えば、２．５倍の重み付けをして計数する。 FIG. 8 is a diagram illustrating an example of the weighting magnification 45.
For example, when the writing mark 37 is the marking 37a and the underline 37b, the appearance frequency of the feature word included in the position of the marking 37a and the underline 37b is 2.0 times, and when the writing mark 37 is the surrounding 37c, it is included in the surrounding position. The frequency of occurrence of feature words is counted as 1.7 times.
When the writing mark 37 is an x mark 37d, the weighting magnification of the feature word included in the range of the position of the x mark is set to 0 so as not to count.
Furthermore, in the case of the handwritten character 39, for example, the appearance frequency of the feature word flagged as the handwritten character is weighted by 2.5 times and counted.

図９は、ステップ１０６により作成された特徴語データ４７の例を示す図である。重要度として出現頻度を使用した場合について示している。
入力ドキュメント３３の特徴データ４７は、特徴語と重み付け出現頻度（重み付け重要度）で構成される。
重みを付すことにより、重み付け出現頻度の値は、実際に入力ドキュメント３３に含まれる特徴語の出現頻度（カッコ内の数値）と異なり、増減した数値になる。 FIG. 9 is a diagram illustrating an example of the feature word data 47 created in step 106. The case where the appearance frequency is used as the importance is shown.
The feature data 47 of the input document 33 includes feature words and weighted appearance frequency (weighted importance).
By assigning weights, the value of the weighted appearance frequency becomes an increased or decreased numerical value, unlike the appearance frequency of the feature word actually included in the input document 33 (the numerical value in parentheses).

以上のように、本実施形態の資料検索装置１における特徴データ４７は、利用者の書込みマーク３７及び手書き文字３９に応じて重み付けされ、利用者の興味・関心、重要と考えている特徴語をより的確に現わすデータとなり、より的確な資料検索が可能になる。 As described above, the feature data 47 in the material retrieval apparatus 1 according to the present embodiment is weighted according to the user's writing marks 37 and handwritten characters 39, and the feature words that the user considers interesting / interesting and important are displayed. It becomes data that appears more accurately, and more accurate material search becomes possible.

次に、制御部２１は、入力ドキュメント３３の特徴データ４７と、特徴語データベース１５の各検索対象資料の検索用インデックス３１の関連度を算出する（ステップ１０７）。
関連度の計算には、例えば、公知の技術であるコサイン類似度を用いればよい。 Next, the control unit 21 calculates the degree of association between the feature data 47 of the input document 33 and the search index 31 of each search target material in the feature word database 15 (step 107).
For the calculation of the degree of association, for example, a cosine similarity that is a known technique may be used.

図１０は、特徴語データ４７と検索対象資料の検索用インデックス３１の関連度を説明する図である。
コサイン類似度は、検索用インデックス３１のベクトル５１と、特徴データ４７のベクトル５３が成す角度θであり、この角度θが小さいほど類似度、すなわち、２つのベクトルの関連度が高いことを示す。 FIG. 10 is a diagram illustrating the degree of association between the feature word data 47 and the search index 31 of the search target material.
The cosine similarity is an angle θ formed by the vector 51 of the search index 31 and the vector 53 of the feature data 47. The smaller the angle θ, the higher the similarity, that is, the higher the degree of association between the two vectors.

図１０では、説明を簡単化するために、３種類の特徴語についての３次元のベクトルを例に説明しているが、各ベクトル５１、５３の要素は、検索対象資料、及び、入力ドキュメント３３に含まれる複数の特徴語の重要度（重み付き）である。
関連度を示す角度θの大きさは、検索用インデックス３１のベクトル５１と、入力ドキュメント３３の特徴データ４７のベクトル５３の内積を計算することにより求める。 In FIG. 10, in order to simplify the explanation, a three-dimensional vector for three types of feature words is described as an example, but the elements of the vectors 51 and 53 are the search target material and the input document 33. Is the importance (weighted) of a plurality of feature words included in.
The magnitude of the angle θ indicating the relevance is obtained by calculating the inner product of the vector 51 of the search index 31 and the vector 53 of the feature data 47 of the input document 33.

次に、制御部２１は、ステップ１０７で計算された関連度の値を比較し、関連度の高い検索対象資料の識別番号を元に資料名等を検索し、表示部２６に表示する（ステップ１０８）。 Next, the control unit 21 compares the relevance values calculated in step 107, searches for material names based on the identification numbers of the search target materials having high relevance levels, and displays them on the display unit 26 (steps). 108).

図１１は、検索結果の出力画面５５の例を示す図である。
資料検索装置１の表示部２６に、利用者が提示した入力ドキュメント３３との関連度が高い文献、書籍、資料の名称等が表示される。 FIG. 11 is a diagram illustrating an example of the search result output screen 55.
On the display unit 26 of the material retrieval apparatus 1, documents, books, names of materials, etc. that are highly relevant to the input document 33 presented by the user are displayed.

以上のように、本実施形態に係る資料検索装置１は、利用者が持参したドキュメント３３をスキャナで読み取ることにより、当該ドキュメントとの関連度が高い検索対象資料を、特徴語の重み付き出現頻度を尺度として検索し、表示することが可能になる。 As described above, the material retrieval apparatus 1 according to the present embodiment reads the document 33 brought by the user with the scanner, thereby obtaining the retrieval target material having a high degree of association with the document with the weighted appearance frequency of the feature word. Can be retrieved and displayed as a scale.

また、本実施形態に係る資料検索装置１は、利用者がドキュメントに書込みマーク３７や手書き文字３９を書き込むことにより、その部分の特徴語の出現頻度に書込みマーク３７の種類に応じた重みを付け、より利用者の関心・興味に合致した検索対象資料を検索することを可能にする。 In addition, the material retrieval apparatus 1 according to the present embodiment assigns a weight according to the type of the writing mark 37 to the appearance frequency of the feature word of the portion by the user writing the writing mark 37 or the handwritten character 39 in the document. This makes it possible to search for a search target material that more matches the user's interests / interests.

以上の説明において、本実施の形態に係る資料検索装置１は１台の装置として説明したが、スキャナ等の画像読み取り装置１１０と、ＯＣＲ等の抽出装置１２０と、検索処理を行うサーバ１３０からなる資料検索システム１０として構成してもよい。 In the above description, the material retrieval apparatus 1 according to the present embodiment has been described as a single apparatus. However, the material retrieval apparatus 1 includes an image reading apparatus 110 such as a scanner, an extraction apparatus 120 such as an OCR, and a server 130 that performs search processing. The document search system 10 may be configured.

図１２は、資料検索システム１０のシステム構成例を示す図である。
図１２に示すように、資料検索システム１０は、画像読み取り装置１１０、抽出装置１２０、サーバ１３０が、例えばネットワーク１４０を介して通信可能に接続された構成である。 FIG. 12 is a diagram illustrating a system configuration example of the material search system 10.
As shown in FIG. 12, the material search system 10 has a configuration in which an image reading device 110, an extraction device 120, and a server 130 are communicably connected via a network 140, for example.

画像読み取り装置１１０は、例えば、スキャナで構成でき、利用者が持参する書込みを含むドキュメント３３を読み取る。
読み取られたドキュメント画像データは、ネットワーク１４０を介して抽出装置１２０に送られる。 The image reading device 110 can be constituted by a scanner, for example, and reads the document 33 including writings brought by the user.
The read document image data is sent to the extraction device 120 via the network 140.

抽出装置１２０は、例えば、ＯＣＲ装置で構成できる。
抽出装置１２０は、ドキュメント画像データを受信し、印刷文字３５の認識処理及び手書き文字３９の認識処理を行い、テキストデータを作成するとともに、ドキュメント３３に書き込まれた書込みマーク３７を抽出し書込みマーク・データ４１を作成する。
作成したテキストデータ及び書込みマーク・データ４１は、ネットワーク１４０を介してサーバ３に送られる。 The extraction device 120 can be configured by an OCR device, for example.
The extracting device 120 receives the document image data, performs recognition processing of the print character 35 and recognition processing of the handwritten character 39, creates text data, extracts the writing mark 37 written in the document 33, and extracts the writing mark Data 41 is created.
The created text data and write mark data 41 are sent to the server 3 via the network 140.

サーバ３は、検索対象資料から作成された特徴語データベース１５及び辞書データベース１８を具備する。
サーバ３は、汎用コンピュータ等で構成でき、図５のフローチャートのステップ１０４〜１０８の処理を実行する。 The server 3 includes a feature word database 15 and a dictionary database 18 created from search target materials.
The server 3 can be composed of a general-purpose computer or the like, and executes the processing of steps 104 to 108 in the flowchart of FIG.

すなわち、サーバ３は、抽出装置１２０から受信したテキストデータから、辞書データベース１８を用いて特徴語を抽出する処理を行い（ステップ１０４、１０５）、抽出装置１２０から受信した書込みマーク・データ４１を元に、抽出した特徴語の重み付き重要度を算出して特徴データを作成し（ステップ１０６）、特徴語データベース１５の各検索インデックスと特徴データの関連度を計算し（ステップ１０７）、関連度の高い検索対象資料を利用者に提示する（ステップ１０８）。 That is, the server 3 performs a process of extracting feature words from the text data received from the extracting device 120 using the dictionary database 18 (steps 104 and 105), and based on the writing mark data 41 received from the extracting device 120. Then, the weighted importance of the extracted feature word is calculated to create feature data (step 106), the degree of association between each search index in the feature word database 15 and the feature data is calculated (step 107), A high search target material is presented to the user (step 108).

以上の説明において、利用者が、携帯端末やパーソナルコンピュータ等からドキュメント３３の画像をインターネット等のネットワークを介して資料検索システム１０に送り、送られたドキュメント画像を抽出装置１２０、サーバ１３０で処理し、検索結果をインターネット等のネットワークを介して携帯端末やパーソナルコンピュータに送り、表示部に検索結果を表示させるようにしてもよい。 In the above description, the user sends an image of the document 33 from a portable terminal, personal computer, or the like to the material retrieval system 10 via a network such as the Internet, and the sent document image is processed by the extraction device 120 and the server 130. The search result may be sent to a portable terminal or a personal computer via a network such as the Internet, and the search result may be displayed on the display unit.

また、以上の説明においては、特徴語として名詞及び未知語を使用すると説明したが、その他の品詞の単語も使用するようにしてもよい。 In the above description, nouns and unknown words are used as feature words. However, other parts of speech may be used.

また、検索対象資料の特徴語データベース１５における特徴語の出現頻度は、ＴＦ・ＩＤＦ法による重み付けを行なうことが望ましいが、特徴語の出現頻度を用いずに、特徴語の出現の有無を示す２値ベクトルで表現してもよい。
また、特徴語の出現頻度の代わりに、隣り合って出現する特徴語の共起頻度（単語Ｎグラム）を要素とするベクトルを用いるようにしてもよい。 Further, the appearance frequency of the feature word in the feature word database 15 of the search target material is preferably weighted by the TF / IDF method, but it indicates whether or not the feature word has appeared without using the appearance frequency of the feature word 2 It may be expressed by a value vector.
Moreover, you may make it use the vector which makes the element co-occurrence frequency (word N-gram) of the feature word which adjoins instead of the appearance frequency of a feature word.

以上、添付図を参照しながら、本発明の実施の形態を説明したが、本発明の技術的範囲は、前述した実施の形態に左右されない。当業者であれば、特許請求の範囲に記載された技術的思想の範疇内において各種の変更例または修正例に想到し得ることは明らかであり、それらについても当然に本発明の技術的範囲に属するものと了解される。 As mentioned above, although embodiment of this invention was described referring an accompanying drawing, the technical scope of this invention is not influenced by embodiment mentioned above. It is obvious for those skilled in the art that various modifications or modifications can be conceived within the scope of the technical idea described in the claims. It is understood that it belongs.

１………資料検索装置
１０………資料検索システム
１１………ドキュメント入力部
１２………文字認識・書込み抽出部
１３………特徴語抽出部
１４………特徴語重み付け部
１５………検索対象資料の特徴語データベース
１６………関連度計算部
１７………検索結果表示部
１８………辞書データベース
３１………検索用インデックス
３３………入力ドキュメント
３５………印刷文字
３７………書込みマーク
３９………手書き文字
４１………書込みマーク・データ
４５………重み付き倍率
４７………入力ドキュメント３３の特徴データ DESCRIPTION OF SYMBOLS 1 ......... Material search device 10 ......... Material search system 11 ......... Document input part 12 ......... Character recognition and writing extraction part 13 ......... Feature word extraction part 14 ......... Feature word weighting part 15 ... ... Characteristic database of search target material 16 ......... Relevance calculation unit 17 ......... Search result display unit 18 ......... Dictionary database 31 ......... Search index 33 ......... Input document 35 ......... Print characters 37 ......... Writing mark 39 ......... Handwritten character 41 ......... Writing mark data 45 ......... Weighted magnification 47 ......... Characteristic data of input document 33

Claims

In a material retrieval device that retrieves materials based on the degree of association with feature word data,
Text extraction means for extracting text data by performing character recognition processing on a document image including writing;
Write extraction means for extracting the type and position of the write;
Storage means for storing a search index including a first feature word of the search target material and its importance;
Feature word extraction means for extracting a second feature word from the text data;
Calculating the importance of the second feature word using the type and position of writing, and feature data creating means for creating the feature word data of the text data;
Relevance calculation means for calculating relevance between the search index and the feature word data;
A material retrieval apparatus comprising:

The material retrieval apparatus according to claim 1, wherein the writing extraction unit performs a character recognition process and adds a recognition result to the text data.

The material search apparatus according to claim 1, wherein the feature data creation unit changes the importance of the second feature word in accordance with the type of writing.

4. The material retrieval apparatus according to claim 1, wherein the feature data creating unit deletes the second feature word corresponding to the type of writing.

5. The material retrieval apparatus according to claim 1, further comprising index creation means for creating a search index for the retrieval target material.

6. The material retrieval apparatus according to claim 1, further comprising image reading means for reading the document image.

In a material retrieval system that retrieves materials based on the degree of association with feature word data,
An image reading device for reading a document image including writing and transmitting the read image;
A text extraction means for performing character recognition processing on the document image and extracting text data; and a writing extraction means for extracting the type and position of the writing; and an extraction device for transmitting the extracted data;
A storage means for storing a first feature word of the search target material and a search index including its importance; a feature word extraction means for extracting a second feature word from the text data; and the type and position of the writing Calculating the degree of importance of the second feature word by using the feature data creating means for creating the feature word data of the text data, and the relationship for calculating the degree of association between the search index and the feature word data A server comprising: a degree calculation means;
A material retrieval system comprising:

A material retrieval method that is performed by a material retrieval device that retrieves materials based on the degree of association with feature word data,
A text extraction step of extracting text data by performing character recognition processing on a document image including writing;
A write extraction step for extracting the type and position of the write;
A storage step of storing a search index including a first feature word of the search target material and its importance;
A feature word extraction step of extracting a second feature word from the text data;
Calculating a degree of importance of the second feature word using the type and position of writing, and a feature data creating step of creating the feature word data of the text data;
A relevance calculating step for calculating a relevance between the search index and the feature word data;
A material retrieval method characterized by including:

A program for causing a computer to function as a material retrieval device that retrieves material based on the degree of association with feature word data,
The computer,
Text extraction means for extracting text data by performing character recognition processing on a document image including writing;
Write extraction means for extracting the type and position of the write;
Storage means for storing a search index including a first feature word of the search target material and its importance;
Feature word extraction means for extracting a second feature word from the text data;
Calculating the importance of the second feature word using the type and position of writing, and feature data creating means for creating the feature word data of the text data;
Relevance calculation means for calculating relevance between the search index and the feature word data;
Program to function as.