JP2023111074A

JP2023111074A - Search support apparatus, search support method, and program

Info

Publication number: JP2023111074A
Application number: JP2022012712A
Authority: JP
Inventors: 貴彦深澤; Takahiko Fukazawa; 雄大朝井; Takehiro Asai
Original assignee: PFU Ltd
Current assignee: PFU Ltd
Priority date: 2022-01-31
Filing date: 2022-01-31
Publication date: 2023-08-10

Abstract

To provide a search support apparatus which reduces time and effort of a user for search.SOLUTION: A search support apparatus includes: a character extraction unit which extracts a character string based on input image data; a selection unit which selects, from among a group of character strings extracted by the character extraction unit, a part of a character string displayed with the image data, as a search character string; and a query output unit which outputs the search character string selected by the selection unit, in a form of search query. Preferably, the selection unit selects as many search character strings as possible up to an upper limit number of character strings of a search system. The query output unit outputs as many selected search character strings as possible in a form of search query.SELECTED DRAWING: Figure 6

Description

本発明は、検索支援装置、検索支援方法、及びプログラムに関する。 The present invention relates to a search support device, a search support method, and a program.

例えば、特許文献１には、検索キーワードを入力する検索キーワード入力手段と、複数のイメージ情報を記憶するイメージ情報記憶手段と、前記イメージ情報記憶手段に記憶された複数のイメージ情報のうち認識対象となるイメージ情報を読み出す認識対象イメージ読出手段と、前記認識対象イメージ読出手段の読み出したイメージ情報に含まれる文字を逐次認識する文字認識手段と、前記文字認識手段の認識結果に、前記検索キーワード入力手段によって入力された検索キーワードが含まれる場合には、前記文字認識手段の逐次認識を中止させるとともに、前記認識対象イメージ読出手段の読み出したイメージ情報または該イメージ情報を特定するための情報を出力する検索手段とを備えることを特徴とする検索システムが開示されている。 For example, Patent Document 1 discloses a search keyword input means for inputting a search keyword, an image information storage means for storing a plurality of image information, and a recognition target among the plurality of image information stored in the image information storage means. character recognition means for sequentially recognizing characters included in the image information read by said recognition target image reading means; and said search keyword input means for recognition results of said character recognition means. If the search keyword input by is included, the sequential recognition of the character recognition means is stopped, and the image information read by the recognition target image reading means or information for specifying the image information is output. A search system is disclosed comprising: means.

また、特許文献２には、画像メディアのテキストをスキャンし、それに対応するテキストデータを生成し、そのテキストデータからフレーズリストを生成し、そのフレーズリストに対応する情報検索を開始する装置が開示されている。 Further, Patent Document 2 discloses a device that scans text of image media, generates text data corresponding to it, generates a phrase list from the text data, and starts information retrieval corresponding to the phrase list. ing.

特許第３４８６１６８号Patent No. 3486168 米国特許第７１５１８６４号明細書U.S. Pat. No. 7,151,864

検索におけるユーザの手間を軽減する検索支援装置を提供することを目的とする。 An object of the present invention is to provide a search support device that reduces the user's trouble in searching.

本発明に係る検索支援装置は、入力された画像データに基づいて、文字列を抽出する文字抽出部と、前記文字抽出部により抽出された文字列群の中から、前記画像データで表示される文字列の一部を検索用文字列として選択する選択部と、前記選択部により選択された検索用文字列を、検索用クエリの形式で出力するクエリ出力部とを有する。 A search support device according to the present invention includes a character extraction unit for extracting a character string based on input image data, and a character string extracted from a group of character strings extracted by the character extraction unit and displayed with the image data. It has a selection unit that selects part of a character string as a search character string, and a query output unit that outputs the search character string selected by the selection unit in the form of a search query.

好適には、前記選択部は、検索システムの文字数上限を上限として、できるだけ多くの検索用文字列を選択し、前記クエリ出力部は、選択されたできるだけ多くの検索用文字列を検索用クエリの形式で出力する。 Preferably, the selection unit selects as many search character strings as possible up to the upper limit of the number of characters of a search system, and the query output unit converts as many of the selected search character strings as possible into a search query. output in the format

好適には、前記選択部は、前記文字抽出部により抽出された文字列群の中から、固有名詞、数字列、又は英数文字列を優先的に選択する。 Preferably, the selection unit preferentially selects proper nouns, numeric strings, or alphanumeric strings from among the character strings extracted by the character extraction unit.

好適には、前記文字抽出部は、ＯＣＲ処理部であり、前記選択部は、前記ＯＣＲ処理部から出力されたＯＣＲスコアに基づいて、検索用文字列を選択する。 Preferably, the character extraction section is an OCR processing section, and the selection section selects the search character string based on the OCR score output from the OCR processing section.

好適には、前記選択部は、既定の選択ロジックを用いて、検索用文字列を選択し、前記クエリ出力部により出力された検索用クエリに対するユーザの編集操作を検知する操作検知部と、前記操作検知部による検知結果に基づいて、前記選択部により用いられる選択ロジックを更新するロジック更新部とをさらに有する。 Preferably, the selection unit selects a search character string using a predetermined selection logic, and an operation detection unit that detects a user's editing operation on the search query output by the query output unit; A logic update unit that updates the selection logic used by the selection unit based on a detection result by the operation detection unit.

好適には、前記選択部は、帳票の項目及び項目値の組合せに基づいて、項目値を検索用文字列として選択する。 Preferably, the selection unit selects the item value as the search character string based on the combination of the items of the form and the item value.

好適には、出現頻度の高い文字列を除外文字列として定義する除外文字列定義部をさらに有し、前記選択部は、前記除外文字列定義部により定義された除外文字列を検索用文字列から除外しながら、検索用文字列を選択する。 Preferably, the system further includes an excluded character string definition unit that defines character strings with a high appearance frequency as excluded character strings, and the selection unit selects the excluded character strings defined by the excluded character string definition unit as search character strings. Select a string for search while excluding from .

好適には、前記選択部は、画像における文字列の位置に基づいて、選択の優先順位を決定する。 Preferably, the selection unit determines the priority of selection based on the position of the character string in the image.

また、本発明に係る検索支援方法は、入力された画像データに基づいて、文字列を抽出する文字抽出ステップと、前記文字抽出ステップにより抽出された文字列群の中から、前記画像データで表示される文字列の一部を検索用文字列として選択する選択ステップと、前記選択ステップにより選択された検索用文字列を、検索用クエリの形式で出力するクエリ出力ステップとを有する。 Further, a search support method according to the present invention includes a character extraction step of extracting a character string based on input image data, and displaying the image data from among the character strings extracted by the character extraction step. and a query output step of outputting the search character string selected by the selection step in the form of a search query.

また、本発明に係るプログラムは、入力された画像データに基づいて、文字列を抽出する文字抽出ステップと、前記文字抽出ステップにより抽出された文字列群の中から、前記画像データで表示される文字列の一部を検索用文字列として選択する選択ステップと、前記選択ステップにより選択された検索用文字列を、検索用クエリの形式で出力するクエリ出力ステップとをコンピュータに実行させる。 Further, the program according to the present invention includes a character extraction step of extracting a character string based on input image data, and a character string extracted from the character string group extracted by the character extraction step, and displaying the image data. A computer is caused to execute a selection step of selecting a part of a character string as a search character string, and a query output step of outputting the search character string selected by the selection step in the form of a search query.

検索におけるユーザの手間を軽減することができる。 It is possible to reduce the user's trouble in searching.

ファイル管理システム１の全体構成を例示する図である。1 is a diagram illustrating the overall configuration of a file management system 1; FIG. ストレージサービス９の検索画面を例示する図である。4 is a diagram illustrating a search screen of the storage service 9; FIG. 検索支援装置２のハードウェア構成を例示する図である。2 is a diagram illustrating a hardware configuration of a search support device 2; FIG. 検索支援装置２の機能構成を例示する図である。2 is a diagram illustrating a functional configuration of a search support device 2; FIG. ファイル管理システム１における検索処理（Ｓ１０）を説明するフローチャートである。4 is a flowchart for explaining search processing (S10) in the file management system 1; 図５の検索クエリ生成処理（Ｓ２０）をより詳細に説明するフローチャートである。FIG. 6 is a flowchart illustrating in more detail the search query generation process (S20) of FIG. 5; FIG. 変形例における検索支援プログラム３２を例示する図である。It is a figure which illustrates the search assistance program 32 in a modification.

以下、本発明の実施形態を、図面を参照して説明する。
図１は、ファイル管理システム１の全体構成を例示する図である。
図１に例示するように、ファイル処理システム１は、データファイルを格納するストレージサービス９と、ストレージサービス９にアクセスするコンピュータ端末６０とを含み、インターネットなどの通信回線により互いに接続されている。
ストレージサービス９は、データファイルを検索用キーワードに関連付けて格納しており、図２に例示する検索窓に入力された検索用クエリに応じて、データファイルを検索する。
コンピュータ端末６０は、ユーザが操作するコンピュータ端末である。ユーザは、コンピュータ端末６０の代わりに、スマートフォン６２を用いて、ストレージサービス９を利用してもよい。 BEST MODE FOR CARRYING OUT THE INVENTION Hereinafter, embodiments of the present invention will be described with reference to the drawings.
FIG. 1 is a diagram illustrating the overall configuration of a file management system 1. As shown in FIG.
As illustrated in FIG. 1, the file processing system 1 includes a storage service 9 that stores data files and a computer terminal 60 that accesses the storage service 9, which are connected to each other via a communication line such as the Internet.
The storage service 9 stores data files in association with search keywords, and searches for data files according to search queries entered in the search window illustrated in FIG.
The computer terminal 60 is a computer terminal operated by a user. The user may use the storage service 9 using the smartphone 62 instead of the computer terminal 60 .

上記構成において、例えば、手元に印刷された帳票があり、この帳票の原本のデータファイルをストレージサービス９から検索する場合、検索されるように帳票内に記載された文言を抽出し、入力する必要がある。ただし、他のデータファイルにも使われているような語句は、検索が絞れないため意味がなく、検索に有効な語句を人が判断する必要がある。また、検索精度を高めるために、複数の語句を手入力するのは時間がかかる上、帳票番号のような英数字の羅列は写し間違いも発生する。 In the above configuration, for example, when there is a printed form at hand and the data file of the original of this form is searched from the storage service 9, it is necessary to extract and input the wording described in the form so that it can be retrieved. There is However, words and phrases that are also used in other data files are meaningless because the search cannot be narrowed down, and it is necessary for a human to determine which words and phrases are effective for the search. In addition, it takes time to manually input multiple words in order to improve the search accuracy, and a string of alphanumeric characters such as a form number may be copied incorrectly.

そこで、本実施形態のファイル管理システム１は、検索支援装置２及びスキャナ４をさらに含み、検索の入力手段として、スキャナ４を利用する。より具体的には、検索したい帳票をスキャナ４でスキャンし、検索支援装置２が、記載内容をＯＣＲ（Optical Character Recognition）処理したうえで、検索クエリとして有効な語句を抽出し、その語句を用いて検索を行う。
これにより、ユーザは、帳票をスキャナ４にセットするだけで、手入力の手間なく精度の高い原本の検索が可能となる。
なお、検索支援装置２は、スキャナ４に接続されたコンピュータ端末であり、スキャナ４により読み取られた画像データから検索クエリを生成する。本例の検索支援装置２は、インターネットなどの通信回線を介して、コンピュータ端末６０、スマートフォン６２及びストレージサービス９に接続しており、コンピュータ端末６０又はスマートフォン６２からの要求に応じて、ストレージサービス９に対する検索クエリの出力を代行し、ストレージサービス９による検索結果をコンピュータ端末６０又はスマートフォン６２に返す。
スキャナ４は、原稿から光学的に画像を読み取る画像読取装置である。 Therefore, the file management system 1 of this embodiment further includes a search support device 2 and a scanner 4, and uses the scanner 4 as search input means. More specifically, a form to be searched is scanned by the scanner 4, and the search support device 2 performs OCR (Optical Character Recognition) processing on the description content, extracts effective words and phrases as a search query, and uses the words and phrases. to search.
As a result, the user can retrieve the original document with high accuracy by simply setting the form on the scanner 4 without the trouble of manual input.
The search support device 2 is a computer terminal connected to the scanner 4 and generates search queries from image data read by the scanner 4 . The search support device 2 of this example is connected to a computer terminal 60, a smartphone 62, and a storage service 9 via a communication line such as the Internet, and in response to a request from the computer terminal 60 or smartphone 62, the storage service 9 , and returns search results from the storage service 9 to the computer terminal 60 or the smart phone 62.
The scanner 4 is an image reading device that optically reads an image from a document.

図３は、検索支援装置２のハードウェア構成を例示する図である。
図３に例示するように、検索支援装置２は、ＣＰＵ２００、メモリ２０２、ＨＤＤ２０４、ネットワークインタフェース２０６（ネットワークＩＦ２０６）、表示装置２０８、及び、入力装置２１０を有し、これらの構成はバス２１２を介して互いに接続している。
ＣＰＵ２００は、例えば、中央演算装置である。
メモリ２０２は、例えば、揮発性メモリであり、主記憶装置として機能する。
ＨＤＤ２０４は、例えば、ハードディスクドライブ装置であり、不揮発性の記録装置としてコンピュータプログラム（例えば、図４の検索支援プログラム３）やその他のデータファイルを格納する。
ネットワークＩＦ２０６は、有線又は無線で通信するためのインタフェースであり、例えば、スキャナ４、コンピュータ端末６０及びストレージサービス９への接続を実現する。
表示装置２０８は、例えば、液晶ディスプレイである。
入力装置２１０は、例えば、キーボード及びマウスである。 FIG. 3 is a diagram illustrating the hardware configuration of the search support device 2. As illustrated in FIG.
As illustrated in FIG. 3, the search support device 2 has a CPU 200, a memory 202, an HDD 204, a network interface 206 (network IF 206), a display device 208, and an input device 210. These components are connected via a bus 212. connected to each other.
CPU 200 is, for example, a central processing unit.
The memory 202 is, for example, a volatile memory and functions as a main memory.
The HDD 204 is, for example, a hard disk drive device, and stores computer programs (eg, search support program 3 in FIG. 4) and other data files as a non-volatile recording device.
The network IF 206 is an interface for wired or wireless communication, and realizes connection to the scanner 4, the computer terminal 60, and the storage service 9, for example.
The display device 208 is, for example, a liquid crystal display.
Input device 210 is, for example, a keyboard and mouse.

図４は、検索支援装置２の機能構成を例示する図である。
図４に例示するように、本例の検索支援装置２には、検索支援プログラム３がインストールされ、動作する。検索支援プログラム３は、例えば、ＣＤ－ＲＯＭ等の記録媒体に格納されており、この記録媒体を介して、検索支援装置２にインストールされる。
検索支援プログラム３は、スキャナ制御部３００、ＯＣＲ処理部３１０、除外文字列定義部３２０、項目値抽出部３３０、選択部３４０、及び、クエリ出力部３５０を有する。
なお、検索支援プログラム３の一部又は全部は、ＡＳＩＣなどのハードウェアにより実現されてもよく、また、ＯＳ（ＯｐｅｒａｔｉｎｇＳｙｓｔｅｍ）の機能を一部借用して実現されてもよい。 FIG. 4 is a diagram illustrating the functional configuration of the search support device 2. As illustrated in FIG.
As illustrated in FIG. 4, a search support program 3 is installed and operates in the search support device 2 of this example. The search support program 3 is stored in a recording medium such as a CD-ROM, for example, and installed in the search support device 2 via this recording medium.
The search support program 3 has a scanner control section 300 , an OCR processing section 310 , an excluded character string definition section 320 , an item value extraction section 330 , a selection section 340 and a query output section 350 .
Part or all of the search support program 3 may be realized by hardware such as ASIC, or may be realized by partially borrowing functions of an OS (Operating System).

検索支援プログラム３において、スキャナ制御部３００は、スキャナ４を制御して、スキャナ４により読み取られた画像データを取得する。
ＯＣＲ処理部３１０は、スキャナ制御部３００により取得された画像データに対して、ＯＣＲ処理を施して、原稿に表示されている文字列を抽出する。 In the search support program 3 , the scanner control section 300 controls the scanner 4 to acquire image data read by the scanner 4 .
The OCR processing unit 310 performs OCR processing on the image data acquired by the scanner control unit 300, and extracts the character strings displayed on the document.

除外文字列定義部３２０は、出現頻度の高い文字列を、検索クエリから除外すべき除外文字列として定義する。
また、除外文字列定義部３２０は、既定文字数（Ｎ文字）よりも短い文字列を除外文字列として定義する。 The exclusion character string definition unit 320 defines character strings with a high appearance frequency as exclusion character strings to be excluded from the search query.
Also, the excluded character string definition unit 320 defines a character string shorter than the predetermined number of characters (N characters) as an excluded character string.

項目値抽出部３３０は、ＯＣＲ処理部３１０によるＯＣＲ結果から、帳票の項目及び項目値の組合せを抽出する。本例の項目値抽出部３３０は、ＯＣＲ処理部３１０によるＯＣＲ結果から、Key-Valueの組合せを抽出する。 The item value extraction unit 330 extracts a combination of items and item values of the form from the OCR result of the OCR processing unit 310 . The item value extraction unit 330 of this example extracts a key-value combination from the OCR result of the OCR processing unit 310 .

選択部３４０は、ＯＣＲ処理部３１０により抽出された文字列群の中から、画像データで表示される文字列の一部を検索用文字列として選択する。より具体的には、選択部３４０は、ＯＣＲ処理部３１０により抽出された文字列群の中から、除外文字列定義部３２０により定義された除外文字列を除外しながら、ストレージサービス９の検索システムの文字数上限を上限として、できるだけ多くの文字列を検索用文字列として選択する。
また、選択部３４０は、ＯＣＲ処理部３１０により抽出された文字列群の中から、固有名詞などの名詞に相当する文字列、数字列、及び、英数文字列を検索用文字列として優先的に選択する。さらに、選択部３４０は、ＯＣＲ処理部３１０から出力されたＯＣＲスコアに基づいて、検索用文字列を選択する。なお、ＯＣＲスコアとは、ＯＣＲ処理における文字認識の確からしさを示す指標である。 The selection unit 340 selects part of the character strings displayed in the image data from among the character strings extracted by the OCR processing unit 310 as search character strings. More specifically, the selection unit 340 excludes the exclusion character strings defined by the exclusion character string definition unit 320 from among the character strings extracted by the OCR processing unit 310, while the search system of the storage service 9 Select as many strings as possible for the search string, up to the upper limit of the number of characters in .
In addition, the selection unit 340 preferentially selects character strings corresponding to nouns such as proper nouns, numeric strings, and alphanumeric character strings from the character string group extracted by the OCR processing unit 310 as search character strings. to select. Furthermore, the selection unit 340 selects a search character string based on the OCR score output from the OCR processing unit 310 . Note that the OCR score is an index indicating the certainty of character recognition in OCR processing.

クエリ出力部３５０は、選択部３４０により選択された検索用文字列を、検索用クエリの形式で出力する。本例のクエリ出力部３５０は、選択部３４０により選択された検索用文字列を、ストレージサービス９の検索用クエリの形式で出力する。 The query output unit 350 outputs the search character string selected by the selection unit 340 in the form of a search query. The query output unit 350 of this example outputs the search character string selected by the selection unit 340 in the form of a search query for the storage service 9 .

図５は、ファイル管理システム１における検索処理（Ｓ１０）を説明するフローチャートである。
図５に例示するように、ステップ１００（Ｓ１００）において、ユーザが、検索する帳票をスキャナ４にセットし、コンピュータ端末６０からスキャナによる検索を指示すると、検索支援装置２のスキャナ制御部３００（図４）は、コンピュータ端末６０からの指示に応じて、スキャナ４にスキャン指示を送る。
ステップ１０５（Ｓ１０５）において、スキャナ制御部３００は、スキャナ４がスキャンした画像データを取得する。
ステップ２０（Ｓ２０）において、検索支援プログラム３は、スキャナ４から取得した画像データに基づいて、検索クエリの生成処理を実行する。
ステップ１１５（Ｓ１１５）において、検索支援プログラム３は、生成された検索クエリをストレージサービス９の検索窓（図２）に入力し、検索を実行させる。 FIG. 5 is a flow chart for explaining the search processing (S10) in the file management system 1. As shown in FIG.
As exemplified in FIG. 5, in step 100 (S100), when the user sets the form to be searched on the scanner 4 and instructs the search by the scanner from the computer terminal 60, the scanner control section 300 of the search support device 2 (see FIG. 4) sends a scan instruction to the scanner 4 according to the instruction from the computer terminal 60 .
At step 105 ( S<b>105 ), the scanner control section 300 acquires image data scanned by the scanner 4 .
At step 20 ( S<b>20 ), the search support program 3 executes search query generation processing based on the image data acquired from the scanner 4 .
At step 115 (S115), the search support program 3 inputs the generated search query into the search window (FIG. 2) of the storage service 9 to execute the search.

図６は、図５の検索クエリ生成処理（Ｓ２０）をより詳細に説明するフローチャートである。
図６に示すように、ステップ２００（Ｓ２００）において、ＯＣＲ処理部３１０は、スキャナ制御部３００により取得された画像データに対して、ＯＣＲ処理を実施し、ＯＣＲ処理の結果とＯＣＲスコアを出力する。 FIG. 6 is a flowchart explaining in more detail the search query generation process (S20) of FIG.
As shown in FIG. 6, in step 200 (S200), the OCR processing unit 310 performs OCR processing on the image data acquired by the scanner control unit 300, and outputs the result of the OCR processing and the OCR score. .

ステップ２０５（Ｓ２０５）において、選択部３４０は、ＯＣＲ処理部３１０から出力されたＯＣＲ処理結果に対して、形態素解析を行う。
ステップ２１０（Ｓ２１０）において、選択部３４０は、形態素解析の結果に基づいて、ＯＣＲ処理部３１０から出力されたＯＣＲ処理結果の中から、名詞のみを候補文字列群として抽出する。 At step 205 ( S<b>205 ), the selection unit 340 performs morphological analysis on the OCR processing result output from the OCR processing unit 310 .
At step 210 (S210), the selection unit 340 extracts only nouns as candidate character strings from the OCR processing results output from the OCR processing unit 310 based on the morphological analysis results.

ステップ２１５（Ｓ２１５）において、項目値抽出部３３０は、ＯＣＲ処理結果からKey-Value抽出を行い、選択部３４０は、Key-Value抽出の結果から、帳票番号にあたる英数字列を選択する。なお、帳票番号にあたる英数字列は、固有表現抽出により抽出されてもよい。
ステップ２２０（Ｓ２２０）において、クエリ出力部３５０は、選択部３４０により選択された英数字列を検索クエリの一部に設定する。 At step 215 (S215), the item value extraction section 330 performs key-value extraction from the OCR processing result, and the selection section 340 selects an alphanumeric string corresponding to the form number from the key-value extraction result. Note that the alphanumeric string corresponding to the form number may be extracted by named entity extraction.
At step 220 (S220), the query output unit 350 sets the alphanumeric string selected by the selection unit 340 as part of the search query.

ステップ２２５（Ｓ２２５）において、選択部３４０は、検索クエリとして選択された文字の数が、ストレージサービス９の検索システムにおける文字数制限を超えたか否かを判断し、文字数制限を超えていない場合に、Ｓ２３０の処理に移行し、文字数制限を超えた場合に、検索クエリ生成処理を終了する。
ステップ２３０（Ｓ２３０）において、選択部３４０は、候補文字列群の中に、検索クエリの一部として選択されていない文字列が存在するか否かを判断し、選択されていない文字列が残っている場合に、Ｓ２３５の処理に移行し、選択されていない文字列が残っていない場合に、検索クエリ生成処理を終了する。 At step 225 (S225), the selection unit 340 determines whether the number of characters selected as the search query exceeds the character limit in the search system of the storage service 9. If the character limit is not exceeded, The process proceeds to S230, and when the character count limit is exceeded, the search query generation process ends.
At step 230 (S230), the selection unit 340 determines whether or not there is a character string that has not been selected as part of the search query in the candidate character string group. If so, the process proceeds to S235, and if there is no character string that has not been selected, the search query generation process ends.

ステップ２３５（Ｓ２３５）において、選択部３４０は、項目値抽出部３３０によるKey-Value抽出の結果から、会社名や氏名、住所等の固有名詞にあたる文字列を選択する。なお、固有名詞にあたる文字列は、固有表現抽出や単語辞書抽出により抽出されてもよい。
ステップ２４０（Ｓ２４０）において、クエリ出力部３５０は、選択部３４０により選択された文字列を検索クエリの一部に設定する。 At step 235 (S235), the selection unit 340 selects a character string corresponding to a proper noun such as a company name, name, address, etc. from the result of key-value extraction by the item value extraction unit 330. FIG. Character strings corresponding to proper nouns may be extracted by named entity extraction or word dictionary extraction.
At step 240 (S240), the query output unit 350 sets the character string selected by the selection unit 340 as part of the search query.

ステップ２４５（Ｓ２４５）において、選択部３４０は、検索クエリとして選択された文字の数が、ストレージサービス９の検索システムにおける文字数制限を超えたか否かを判断し、文字数制限を超えていない場合に、Ｓ２５０の処理に移行し、文字数制限を超えた場合に、検索クエリ生成処理を終了する。
ステップ２５０（Ｓ２５０）において、選択部３４０は、候補文字列群の中に、検索クエリの一部として選択されていない文字列が存在するか否かを判断し、選択されていない文字列が残っている場合に、Ｓ２５５の処理に移行し、選択されていない文字列が残っていない場合に、検索クエリ生成処理を終了する。 At step 245 (S245), the selection unit 340 determines whether the number of characters selected as the search query exceeds the character limit in the search system of the storage service 9. If the character limit is not exceeded, The process proceeds to S250, and when the character count limit is exceeded, the search query generation process ends.
At step 250 (S250), the selection unit 340 determines whether or not there is a character string that has not been selected as part of the search query in the candidate character string group. If so, the process proceeds to S255, and if there is no character string that has not been selected, the search query generation process ends.

ステップ２５５（Ｓ２５５）において、除外文字列定義部３２０は、文字数がＮ文字未満（例えば、Ｎ＝４）の文字列を除外文字列に定義し、選択部３４０は、候補文字列群の中から、Ｎ文字未満の文字列を除外する。
ステップ２６０（Ｓ２６０）において、選択部３４０は、ＯＣＲ処理部３１０から出力されたＯＣＲスコアに基づいて、候補文字列群をＯＣＲスコアの高い順にソートする。 At step 255 (S255), the exclusion character string definition unit 320 defines a character string having less than N characters (for example, N=4) as an exclusion character string, and the selection unit 340 selects , to exclude strings with less than N characters.
At step 260 (S260), the selection unit 340 sorts the candidate character string group in descending order of the OCR score based on the OCR score output from the OCR processing unit 310. FIG.

ステップ２６５（Ｓ２６５）において、選択部３４０は、ソート結果に基づいて、候補文字列群の中から、最もＯＣＲスコアが高い文字列を選択する。
ステップ２７０（Ｓ２７０）において、クエリ出力部３５０は、選択部３４０により選択された文字列を検索クエリの一部に設定する。 At step 265 (S265), the selection unit 340 selects a character string with the highest OCR score from the candidate character string group based on the sort result.
At step 270 (S270), the query output unit 350 sets the character string selected by the selection unit 340 as part of the search query.

ステップ２７５（Ｓ２７５）において、選択部３４０は、検索クエリとして選択された文字の数が、ストレージサービス９の検索システムにおける文字数制限を超えたか否かを判断し、文字数制限を超えていない場合に、Ｓ２８０の処理に移行し、文字数制限を超えた場合に、検索クエリ生成処理を終了する。
ステップ２８０（Ｓ２８０）において、選択部３４０は、候補文字列群の中に、検索クエリの一部として選択されていない文字列が存在するか否かを判断し、選択されていない文字列が残っている場合に、Ｓ２６５の処理に戻り、選択されていない文字列が残っていない場合に、検索クエリ生成処理を終了する。 At step 275 (S275), the selection unit 340 determines whether the number of characters selected as the search query exceeds the character limit in the search system of the storage service 9. If the character limit is not exceeded, The process proceeds to S280, and when the number of characters exceeds the character limit, the search query generation process ends.
At step 280 (S280), the selection unit 340 determines whether or not there is a character string that has not been selected as part of the search query in the candidate character string group. If so, the process returns to S265, and if there is no character string that has not been selected, the search query generation process ends.

上記検索クエリ生成処理（Ｓ２０）には、おおまかに、帳票番号抽出処理、固有名詞抽出処理、及び、認識スコア順抽出処理の３つの抽出処理が含まれている。
これらの処理は、抽出精度が１００％ではなく、そもそも記載がない場合も考えられるため、抽出の成否にかかわらず順次処理を行っていく。抽出した文字列は、スペース区切りで列挙されることにより、検索クエリとして設定される(例：”株式会社ABC 請求書 ABC12345678”)。
それぞれの抽出処理で検索クエリとしての文字数制限を超える、または、候補文字列群がなくなり次第抽出処理は終了する。 The search query generation process (S20) roughly includes three extraction processes: a form number extraction process, a proper noun extraction process, and a recognition score order extraction process.
Since the extraction accuracy of these processes may not be 100% and there may be cases where there is no description in the first place, the processes are performed sequentially regardless of the success or failure of the extraction. The extracted character strings are set as a search query by enumerating them separated by spaces (eg, "ABC bill ABC12345678").
The extraction process ends as soon as the character limit for the search query is exceeded in each extraction process, or as soon as the candidate character string group is exhausted.

１つ目の帳票番号抽出処理（Ｓ２１５及びＳ２２０）は、帳票番号を抽出する処理である。帳票番号は、帳票発行時に与えられるユニークな文字列となっているため、これを抽出することができれば検索精度は非常に高まる。ただし、ユニーク性については、発行した企業の管理範囲内のみであるため、他の会社のものと重複する可能性は０ではないため、他の情報も後段の処理で必要となる。
抽出方法としては、key-valueによる抽出や固有表現抽出による抽出などが挙げられる。 The first form number extraction process (S215 and S220) is a process for extracting a form number. Since the form number is a unique character string that is given when the form is issued, if it can be extracted, the search accuracy will be greatly improved. However, since the uniqueness is only within the management range of the issuing company, the possibility of duplication with that of other companies is not 0, so other information is also required in the subsequent processing.
Extraction methods include key-value extraction and named entity extraction.

２つ目の固有名詞抽出処理（Ｓ２３５及びＳ２４０）は、帳票中に記載された固有名詞を抽出する処理である。会社名や氏名、住所など１つだけでは帳票が一意に定まらない情報でも複数抽出することで、検索精度を高めることができる。
抽出方法としては、帳票番号抽出処理と同じくkey-valueや固有表現抽出、また単語辞書による抽出が挙げられる。 The second proper noun extraction processing (S235 and S240) is processing for extracting proper nouns written in the form. Search accuracy can be improved by extracting multiple pieces of information such as a company name, name, address, etc., for which a form cannot be uniquely determined.
Extraction methods include key-value extraction, named entity extraction, and extraction using a word dictionary, as in the form number extraction process.

３つ目の認識スコア順抽出処理（Ｓ２５５～Ｓ２７０）は、ＯＣＲ処理での認識スコアをもとにして、確度の高い文字列を抽出する処理である。上記の抽出処理での抽出漏れや、帳票に記載されている他の文字列の抽出が期待できる。
抽出方法としては、まず候補文字列群からＮ文字以上の単語に絞り込みを行う（例えば、Ｎ＝４）。これにより文字列長が短すぎて意味を成していない単語や、認識ミスで生じたノイズとなる単語が検索クエリから除外される。
その後ＯＣＲの認識スコアでソートをし、認識スコアが高い文字列を順次抽出していく。 The third recognition score order extraction process (S255 to S270) is a process of extracting a highly accurate character string based on the recognition score in the OCR process. It can be expected that there will be omissions in the above extraction process and extraction of other character strings written in the form.
As an extraction method, first, the group of candidate character strings is narrowed down to words of N characters or more (for example, N=4). As a result, words that are too short to make sense and words that are noise due to misrecognition are excluded from the search query.
After that, sorting is performed according to the OCR recognition score, and character strings with high recognition scores are sequentially extracted.

以上説明したように、本実施形態のファイル管理システム１によれば、画像データから自動的に検索クエリを生成することができるため、例えば、見積書、請求書又は領収書などの帳票が容易に検索することができる。また、検索クエリに含まれる文字列の数は、検索システムで許容される限りにおいて、できるだけ多くなるように自動入力されるため、ユーザは適宜不要なキーワード（文字列）を削除する操作のみで検索クエリをチューニングできる。
また、本ファイル管理システム１によれば、プレゼン資料や販促資料（チラシ）、論文、特許文献、契約書等の検索も容易になる。 As described above, according to the file management system 1 of the present embodiment, it is possible to automatically generate a search query from image data. can be searched. In addition, since the number of character strings included in the search query is automatically entered as much as possible as long as it is allowed by the search system, the user can search by simply deleting unnecessary keywords (character strings) as appropriate. Queries can be tuned.
Further, according to the file management system 1, it becomes easy to search for presentation materials, sales promotion materials (flyers), papers, patent documents, contracts, and the like.

次に、上記実施形態の変形例を説明する。
図７は、変形例における検索支援プログラム３２を例示する図である。なお、本図に例示された各構成のうち、図４に示された構成と実質的に同一のものには同一の符号が付されている。
図７に例示するように、変形例の検索支援プログラム３２は、図４の検索支援プログラム３に、操作検知部３６０及びロジック更新部３７０を追加した構成を採る。
操作検知部３６０は、クエリ出力部３５０により出力された検索クエリに対するユーザの編集操作を検知する。例えば、操作検知部３６０は、クエリ出力部３５０により検索窓（図２）に入力された検索クエリに対して、ユーザの削除操作を行うと、これを検知する。
ロジック更新部３７０は、操作検知部による検知結果に基づいて、選択部３４０により用いられる選択ロジックを更新する。例えば、ロジック更新部３７０は、帳票番号抽出処理、固有名詞抽出処理及び認識スコア順抽出処理の採否、又は、これらの順序を変更してもよいし、文字数Ｎの値を変更してもよい。
すなわち、検索クエリ抽出処理の結果をユーザに確認させるインタラクションを設けることで、検索支援装置２は、抽出内容の成否を学習する。これにより、抽出する語句の優先順位付けや抽出精度を高められる。 Next, a modification of the above embodiment will be described.
FIG. 7 is a diagram illustrating the search support program 32 in the modified example. In addition, the same code|symbol is attached|subjected to the substantially same thing as the structure shown by FIG. 4 among each structure illustrated by this figure.
As illustrated in FIG. 7, the search support program 32 of the modified example adopts a configuration in which an operation detection unit 360 and a logic update unit 370 are added to the search support program 3 of FIG.
The operation detection unit 360 detects a user's editing operation on the search query output by the query output unit 350 . For example, the operation detection unit 360 detects when the user performs a deletion operation on the search query input to the search window (FIG. 2) by the query output unit 350 .
The logic update unit 370 updates the selection logic used by the selection unit 340 based on the detection result of the operation detection unit. For example, the logic update unit 370 may change the adoption or rejection of the form number extraction process, the proper noun extraction process, and the recognition score order extraction process, or their order, or may change the value of the number of characters N.
In other words, the search support device 2 learns the success or failure of the extraction content by providing an interaction that allows the user to confirm the result of the search query extraction process. This makes it possible to prioritize the words to be extracted and improve the extraction accuracy.

また、ストレージサービス９は、社内サーバなどコンテンツを格納でき、検索が可能な機能があれば、形態は問わない。
固有名詞抽出についても、単語辞書での抽出を行う場合、完全一致ではなく、部分一致を認めてもよい。これにより、検索クエリに設定する際に、ＯＣＲの認識ミスを補正することも可能になる。
また、クエリ抽出の際に、選択部３４０は、帳票内の単語の位置情報や、フォント（サイズ、太字）などを利用してもよい。これにより、さらに重要度の高い単語を優先的に検索キーワードとして抽出することが可能になる。
また、上記実施形態では、原本の特定を目的としたが、クエリ抽出の優先順位や内容を適宜変えることで、原本そのものではなく、類似コンテンツの検索（帳票番号の下Ｎ桁を削除）や、カテゴリ検索（請求書と日付）なども可能になる。
なお、検索の方式としては、１枚の帳票のみを対象としたリアルタイム検索でもよいし、複数枚を対象にしたバッチ検索でもよい。検索結果は、csvファイルなどテキストとして出力することも可能である。
また、上記実施形態では、原稿を読み取る手段としてスキャナ４を用いる形態を説明したが、スキャナ４をスマートフォンなどの撮影機器で代替してもよい。例えば、スマートフォンに内蔵されたカメラにより撮影された画像データに基づいて、検索支援装置２が検索クエリを作成する。 Also, the storage service 9 may be of any form, such as an in-house server, as long as it can store content and has a search function.
When extracting proper nouns using a word dictionary, not complete matching but partial matching may be accepted. This makes it possible to correct OCR recognition errors when setting search queries.
Further, when extracting a query, the selection unit 340 may use position information of words in the form, font (size, bold), and the like. This makes it possible to preferentially extract words of higher importance as search keywords.
In the above embodiment, the purpose is to identify the original document, but by appropriately changing the priority order and contents of query extraction, it is possible to search for similar content (delete the last N digits of the form number) instead of the original document itself, Category search (invoice and date) will also be possible.
The search method may be a real-time search targeting only one form, or a batch search targeting a plurality of sheets. Search results can also be output as text such as a csv file.
Further, in the above-described embodiment, the scanner 4 is used as means for reading a document, but the scanner 4 may be replaced by a photographing device such as a smart phone. For example, the search support device 2 creates a search query based on image data captured by a camera built into a smartphone.

１…ファイル管理システム
２…検索支援装置
３…検索支援プログラム
４…スキャナ
９…ストレージサービス 1 File Management System 2 Search Support Device 3 Search Support Program 4 Scanner 9 Storage Service

Claims

a character extraction unit for extracting a character string based on input image data;
a selection unit that selects, as a search character string, a part of the character string displayed in the image data from the character string group extracted by the character extraction unit;
and a query output unit that outputs the search character string selected by the selection unit in the form of a search query.

The selection unit selects as many search character strings as possible up to the upper limit of the number of characters of the search system,
The search support device according to claim 1, wherein the query output unit outputs as many of the selected search character strings as possible in the form of a search query.

3. The search support device according to claim 2, wherein the selection unit preferentially selects proper nouns, numeric strings, or alphanumeric strings from among the character strings extracted by the character extraction unit.

The character extraction unit is an OCR processing unit,
The search support device according to claim 3, wherein the selection unit selects a search character string based on the OCR score output from the OCR processing unit.

The selection unit selects a search string using a predetermined selection logic,
An operation detection unit that detects a user's editing operation on the search query output by the query output unit;
5. The search support device according to claim 4, further comprising: a logic update unit that updates the selection logic used by the selection unit based on the detection result of the operation detection unit.

4. The search support device according to claim 3, wherein the selection unit selects an item value as a search character string based on a combination of items and item values of a form.

further comprising an excluded character string definition section that defines frequently occurring character strings as excluded character strings;
7. The search support device according to claim 6, wherein the selection unit selects the search character string while excluding the exclusion character string defined by the exclusion character string definition unit from the search character string.

8. The search support device according to claim 7, wherein the selection unit determines the priority of selection based on the position of the character string in the image.

a character extraction step of extracting a character string based on the input image data;
a selection step of selecting, as a search character string, part of the character strings displayed in the image data from the character string group extracted by the character extraction step;
and a query output step of outputting the search character string selected in the selection step in the form of a search query.

a character extraction step of extracting a character string based on the input image data;
a selection step of selecting, as a search character string, part of the character strings displayed in the image data from the character string group extracted by the character extraction step;
A program for causing a computer to execute a query output step of outputting the search character string selected in the selection step in the form of a search query.