JP2018190030A

JP2018190030A - Information processing server, control method for the same, and program, and information processing system, control method for the same, and program

Info

Publication number: JP2018190030A
Application number: JP2017089575A
Authority: JP
Inventors: 下郡山　敬己; Itsuki Shimokooriyama; 敬己下郡山
Original assignee: Canon Marketing Japan Inc; Canon IT Solutions Inc
Current assignee: Canon Marketing Japan Inc; Canon IT Solutions Inc
Priority date: 2017-04-28
Filing date: 2017-04-28
Publication date: 2018-11-29
Anticipated expiration: 2037-04-28
Also published as: JP6916437B2

Abstract

PROBLEM TO BE SOLVED: To properly correct a correction character string included in a character string received from a user, and to search document data.SOLUTION: An information processing server is provided that can search proper document data from a plurality of document data on the basis of a character string received from a user. The information processing server is configured to: store the plurality of document data; receive a plurality of character strings for retrieving the document data, and including at least a correction character string targeted for correction; search a plurality of document data to be stored, using the received character string; correct the correction character string on the basis of the character string included in the searched document data; and search the plurality of document data to be stored, using the character string including the corrected correction character string.SELECTED DRAWING: Figure 3

Description

本発明は、ユーザから受け付けた文字列に含まれる修正文字列に対し、適切に修正して、文書データを検索することのできる情報処理サーバ、その制御方法、及びプログラム、並びに、情報処理システム、その制御方法、及びプログラム技術に関する。 The present invention relates to an information processing server that can appropriately correct a corrected character string included in a character string received from a user and retrieve document data, a control method thereof, a program, and an information processing system, The present invention relates to a control method and program technology.

近年、企業においても家庭においても、パーソナルコンピュータが普及し、またインターネットが身近なものになったこともあって、キーボードから文字列を入力する機会が多くなっている。 In recent years, personal computers have become widespread in businesses and homes, and the Internet has become familiar, so there are many opportunities for inputting character strings from a keyboard.

しかしながら、キーボードを使いこなすにはかなりの熟練が必要であり、また熟練した人であっても誤った入力（スペリングミス）をすることは多い。 However, considerable skill is required to master the keyboard, and even a skilled person often makes an erroneous input (spelling mistake).

また、近年は音声認識により、キーボードを使わず人の声をコンピュータのソフトウェアが文字列に変換する技術も広く使われるようになってきた。しかしながら、音声認識に関しても精度は１００％ではないため、スペリングミスが発生する。そして、スペリングミスに効率的に対処するため、様々な技術が開発されている。
特許文献１においては、あらかじめ電子的に記憶された辞書を用意し、音声認識で受け付けた文字列のうち、誤認識されている部分を特定し、さらに辞書の中から誤認識した文字列に対して、訂正候補の文字列および誤認識した文字列と訂正候補の文字列の類似度を計算する技術が記載されている。 In recent years, a technique for converting a human voice into a character string by a computer software without using a keyboard has been widely used. However, since the accuracy of speech recognition is not 100%, spelling errors occur. Various techniques have been developed to efficiently deal with spelling mistakes.
In Patent Document 1, a dictionary that is stored electronically in advance is prepared, a misrecognized portion of a character string received by speech recognition is specified, and a character string that is misrecognized from the dictionary is identified. Thus, a technique for calculating the correction candidate character string and the similarity between the erroneously recognized character string and the correction candidate character string is described.

また、特許文献２においては、検索システムにおいて、過去にユーザが入力したクエリ（文字列）をログとして格納し、新たにユーザが入力をした際に、前記クエリログに基づいて、スペリングチェックを行う技術が記載されている。 In Patent Document 2, in the search system, a query (character string) input by the user in the past is stored as a log, and when the user newly inputs, a spelling check is performed based on the query log. Is described.

特開２０１２−０６３５４５号公報JP 2012-063545 A 特開２００５−２６７６３８号公報JP 2005-267638 A

しかしながら、特許文献１においてはあらかじめ辞書を用意する必要がある。この辞書に含まれている単語は、汎用的でありどの分野にも適用可能に作らねばならない。その場合、辞書をあらかじめ用意する工数、また新しい単語を追加するなど辞書を更新する保守の工数が必要である。またユーザが新たに入力する文字列が特定の分野であったとしても、汎用的に用意された辞書から訂正候補の文字列を探すため、訂正候補が大量に存在し、文字列間の類似度の計算が高い精度で算出されたとしても、無関係な候補を高い優先順位でユーザに提示することになるという問題がある。 However, in Patent Document 1, it is necessary to prepare a dictionary in advance. The words contained in this dictionary are general and must be made applicable to any field. In that case, the man-hours for preparing the dictionary in advance and the maintenance man-hours for updating the dictionary such as adding new words are required. In addition, even if the character string newly input by the user is in a specific field, there are a large number of correction candidates and the similarity between the character strings is searched for a character string to be corrected from a general-purpose dictionary. Even if the above calculation is calculated with high accuracy, there is a problem that irrelevant candidates are presented to the user with high priority.

また、特許文献２においては、特定のユーザのクエリログを記憶してスペリングチェックに使用しているものの、ユーザが特定のクエリだけを頻繁に使用する場合であればともかく、汎用的に使用している場合には、それだけ多くのスペリングチェック用のログを収集するには長期間を要し、またその汎用性があるため、特許文献１と同様に無関係な候補を高い優先順位でユーザに提示することになるという問題がある。 Further, in Patent Document 2, although a query log of a specific user is stored and used for spelling check, it is used for general purposes only if the user frequently uses only a specific query. In such a case, it takes a long time to collect so many spelling check logs, and because of its versatility, it is necessary to present irrelevant candidates to the user with high priority as in Patent Document 1. There is a problem of becoming.

本発明の目的は、ユーザから受け付けた文字列に含まれる修正文字列に対し、適切に修正して、文書データを検索することが可能な技術を提供することである。 An object of the present invention is to provide a technique capable of appropriately correcting a corrected character string included in a character string received from a user and searching document data.

上記の目的を達成するために、本発明は、複数の文書データから適切な文書データを、ユーザから受け付けた文字列をもとに検索することのできる情報処理サーバであって、複数の前記文書データを記憶する記憶手段と、前記文書データを検索するための複数の文字列であって、修正対象の修正文字列を少なくとも含む文字列を受け付ける受付手段と、前記受付手段で受け付けた前記文字列を用いて、前記記憶手段で記憶される複数の前記文書データを検索する検索手段と、前記検索手段で検索された前記文書データに含まれる文字列に基づいて、前記修正文字列を修正する修正手段とを備え、前記検索手段は、前記修正手段で修正された前記修正文字列を含む前記文字列を用いて、前記記憶手段で記憶される複数の前記文書データを検索することを特徴とする。 In order to achieve the above object, the present invention provides an information processing server capable of retrieving appropriate document data from a plurality of document data based on a character string received from a user, wherein the plurality of documents Storage means for storing data, a plurality of character strings for searching the document data, receiving means for receiving a character string including at least a correction character string to be corrected, and the character string received by the receiving means Using the search means for searching the plurality of document data stored in the storage means, and the correction for correcting the correction character string based on the character string included in the document data searched by the search means And the search means searches the plurality of document data stored in the storage means using the character string including the corrected character string corrected by the correction means. It is characterized in.

本発明によれば、ユーザから受け付けた文字列に含まれる修正文字列に対し、適切に修正して、文書データを検索することが可能となる。 According to the present invention, it is possible to appropriately correct a corrected character string included in a character string received from a user and search document data.

本発明の実施形態に係る機能構成の一例を示す図である。It is a figure which shows an example of the function structure which concerns on embodiment of this invention. 本発明の実施形態に係る情報処理装置のハードウェア構成の一例を示すブロック図である。It is a block diagram which shows an example of the hardware constitutions of the information processing apparatus which concerns on embodiment of this invention. 本発明の実施形態に係る検索処理のフローチャートの一例を示す図である。It is a figure which shows an example of the flowchart of the search process which concerns on embodiment of this invention. 本発明の実施形態に係るスペリングチェックの処理のフローチャートの一例を示す図である。It is a figure which shows an example of the flowchart of the process of a spelling check which concerns on embodiment of this invention. 本発明の実施形態に係る検索対象となる文書の一例を示す図である。It is a figure which shows an example of the document used as the search object which concerns on embodiment of this invention. 本発明の実施形態に係る抽出語記憶部と、検索条件に含まれる語と抽出語の共起関係の情報に関する記憶部の一例を説明するための図である。It is a figure for demonstrating an example of the memory | storage part regarding the extracted word memory | storage part which concerns on embodiment of this invention, and the information on the co-occurrence relation of the word contained in a search condition, and an extracted word. 本発明の実施形態に係る単語辞書記憶部のうち、スペリングミスに関連する候補単語リストの一例を示すための図である。It is a figure for showing an example of the candidate word list relevant to a spelling mistake among the word dictionary memory | storage parts which concern on embodiment of this invention. 本発明の実施形態に係る修正候補記憶部の一例を説明するための図である。It is a figure for demonstrating an example of the correction candidate memory | storage part which concerns on embodiment of this invention.

以下、本発明の実施の形態を、図面を参照して詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

図１は、本発明の実施形態に係る機能構成の一例を示す図である。 FIG. 1 is a diagram illustrating an example of a functional configuration according to an embodiment of the present invention.

入力文字列受付部１０１は、テキスト文書（文書データに相当する）を検索するための検索条件となる入力文字列を受け付ける。本実施形態では、文書データの例をテキスト文書としているが、文書が含まれていればテキスト文書に限定されず、ＰＤＦファイルなどの文書データでもよい。当該入力文字列は、本発明における情報処理装置がクライアント装置として機能するコンピュータである場合には、当該コンピュータを使用するユーザから直接入力を受け付ける機能部であってもよい。また、当該コンピュータで動作する他のアプリケーションプログラムなどから受け付ける機能部であってもよい。また、本実施形態における情報処理装置がネットワーク上におけるサーバである場合には、他の情報処理装置から当該ネットワークを介して通信情報を受け付ける機能部であってもよい。
文書検索部１０２は、前述の手順で取得された入力文字列に基づき、文書記憶部１２１に格納されたテキスト文書を検索する。検索は、テキスト文書自体ではなく、テキスト文書から予め生成された索引に対して実行するものであってもよい。テキスト文書の検索および索引を生成することに関する技術は、周知の技術であるため詳細は割愛する。 The input character string receiving unit 101 receives an input character string serving as a search condition for searching for a text document (corresponding to document data). In the present embodiment, an example of document data is a text document. However, the document data is not limited to a text document as long as the document is included, and may be document data such as a PDF file. When the information processing apparatus according to the present invention is a computer that functions as a client device, the input character string may be a functional unit that directly receives an input from a user who uses the computer. Moreover, the function part received from the other application program etc. which operate | move with the said computer may be sufficient. Moreover, when the information processing apparatus in this embodiment is a server on a network, it may be a functional unit that receives communication information from another information processing apparatus via the network.
The document search unit 102 searches for a text document stored in the document storage unit 121 based on the input character string acquired in the above procedure. The search may be performed not on the text document itself but on an index generated in advance from the text document. Since the technique relating to the search and generation of the index of the text document is a well-known technique, the details are omitted.

単語抽出部１０３は、文書検索部１０２で検索した結果であるテキスト文書から、当該テキスト文書に含まれる文字列を解析し、単語を抽出する。 The word extraction unit 103 analyzes a character string included in the text document from the text document that is a result of the search by the document search unit 102, and extracts a word.

辞書検索部１０４は、あらかじめ用意された単語の情報（修正データに相当する）を記憶して管理する単語辞書記憶部１２２（管理手段に相当する）から、入力文字列受付部１０１で受け付けた入力文字列の一部（部分文字列）、すなわち入力文字列を解析して単語（スペリングミスを含んでいてもよい）に基づき、単語を検索する。ただし、検索する単語の文字列が正しいか正しくないかがあらかじめ分かっていないため、文字列として完全に一致する単語のみを検索するものではなく、スペリングミスがあることも考慮した上で、類似であると判定される文字列を検索するものである。スペルチェックのための辞書検索については周知の技術であるため詳細は割愛する。 The dictionary search unit 104 receives an input received by the input character string reception unit 101 from a word dictionary storage unit 122 (corresponding to a management unit) that stores and manages information (corresponding to correction data) of words prepared in advance. A part of the character string (partial character string), that is, an input character string is analyzed, and a word is searched based on a word (which may include a spelling error). However, since it is not known in advance whether the character string of the word to be searched is correct or incorrect, it does not search only for a word that exactly matches the character string, and is similar in consideration of spelling mistakes. The character string determined to be is searched. Since dictionary search for spell check is a well-known technique, the details are omitted.

類似度判定部１０５は、単語抽出部１０３で抽出された単語、あるいは辞書検索部１０４で検索された単語が、入力文字列の部分文字列で、単語（スペリングミスを含んでいる可能性がある文字列）とどの程度類似しているか判定する。結果は数値として算出される。 The similarity determination unit 105 has a possibility that the word extracted by the word extraction unit 103 or the word searched by the dictionary search unit 104 is a partial character string of the input character string and includes a spelling error. It is determined how similar to (character string). The result is calculated as a numerical value.

修正候補格納部１０６は、前述において入力文字列の部分文字列である単語に類似していると思われる単語を、その類似度とともに一時的に記憶部に記憶させる。当該修正候補を、修正候補提示部１０７により、ユーザあるいは本発明のシステムを利用するアプリケーションに提示し、ユーザまたはアプリケーションの選択結果を、修正結果受付部１０８が受け付けて、入力文字列から得られた単語を修正する。修正された結果が改めて検索条件となり、文書検索部１０２で文書記憶部１２１を再検索する。 The correction candidate storage unit 106 temporarily stores, in the storage unit, a word that is considered to be similar to the word that is a partial character string of the input character string, together with the similarity. The correction candidate is presented to the user or an application using the system of the present invention by the correction candidate presenting unit 107, and the selection result of the user or application is received by the correction result receiving unit 108 and obtained from the input character string. Correct the word. The corrected result becomes a new search condition, and the document storage unit 121 searches the document storage unit 121 again.

図２は、本発明の情報処理システムに含まれる情報処理装置（情報処理サーバに相当する）のハードウェア構成の一例を示すブロック図である。 FIG. 2 is a block diagram illustrating an example of a hardware configuration of an information processing apparatus (corresponding to an information processing server) included in the information processing system of the present invention.

図２に示すように、情報処理装置１００、アプリケーションサーバ１４０は、システムバス２０４を介してＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）２０１、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）２０２、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）２０３、入力コントローラ２０５、ビデオコントローラ２０６、メモリコントローラ２０７、通信Ｉ／Ｆコントローラ２０８等が接続された構成を採る。ＣＰＵ２０１は、システムバス２０４に接続される各デバイスやコントローラを統括的に制御する。 As shown in FIG. 2, the information processing apparatus 100 and the application server 140 include a central processing unit (CPU) 201, a random access memory (RAM) 202, a read only memory (ROM) 203, and an input controller 205 via a system bus 204. The video controller 206, the memory controller 207, the communication I / F controller 208, etc. are connected. The CPU 201 comprehensively controls each device and controller connected to the system bus 204.

また、ＲＯＭ２０３あるいは外部メモリ２１１には、ＣＰＵ２０１の制御プログラムであるＢＩＯＳ（ＢａｓｉｃＩｎｐｕｔ／ＯｕｔｐｕｔＳｙｓｔｅｍ）やＯＳ（ＯｐｅｒａｔｉｎｇＳｙｓｔｅｍ）や、各サーバあるいは各ＰＣが実行する機能を実現するために必要な後述する各種プログラム等が記憶されている。また、本発明を実施するために必要な情報が記憶されている。なお外部メモリはデータベースであってもよい。 Further, the ROM 203 or the external memory 211 will be described later, which is necessary for realizing the functions executed by each server or each PC, such as BIOS (Basic Input / Output System) and OS (Operating System) which are control programs of the CPU 201. Various programs are stored. Further, information necessary for carrying out the present invention is stored. The external memory may be a database.

ＲＡＭ２０２は、ＣＰＵ２０１の主メモリ、ワークエリア等として機能する。ＣＰＵ２０１は、処理の実行に際して必要なプログラム等をＲＯＭ２０３あるいは外部メモリ２１１からＲＡＭ２０２にロードし、ロードしたプログラムを実行することで各種動作を実現する。 The RAM 202 functions as a main memory, work area, and the like for the CPU 201. The CPU 201 implements various operations by loading a program or the like necessary for executing the processing from the ROM 203 or the external memory 211 to the RAM 202 and executing the loaded program.

また、入力コントローラ２０５は、キーボード（ＫＢ）２０９や不図示のマウス等のポインティングデバイス等からの入力を制御する。 The input controller 205 controls input from a keyboard (KB) 209 or a pointing device such as a mouse (not shown).

ビデオコントローラ２０６は、ディスプレイ２１０等の表示器への表示を制御する。尚、表示器は液晶ディスプレイ等の表示器でもよい。これらは、必要に応じて管理者が使用する。 The video controller 206 controls display on a display device such as the display 210. The display device may be a display device such as a liquid crystal display. These are used by the administrator as needed.

メモリコントローラ２０７は、ブートプログラム、各種のアプリケーション、フォントデータ、ユーザファイル、編集ファイル、各種データ等を記憶する外部記憶装置（ハードディスク（ＨＤ））や、フレキシブルディスク（ＦＤ）、あるいは、ＰＣＭＣＩＡ（ＰｅｒｓｏｎａｌＣｏｍｐｕｔｅｒＭｅｍｏｒｙＣａｒｄＩｎｔｅｒｎａｔｉｏｎａｌＡｓｓｏｃｉａｔｉｏｎ）カードスロットにアダプタを介して接続されるコンパクトフラッシュ（登録商標）メモリ等の外部メモリ２１１（記憶手段に相当する）へのアクセスを制御する。 The memory controller 207 is an external storage device (hard disk (HD)), flexible disk (FD), or PCMCIA (Personal Computer) that stores a boot program, various applications, font data, user files, editing files, various data, and the like. Controls access to an external memory 211 (corresponding to storage means) such as a CompactFlash (registered trademark) memory connected to a memory card international association (Card) card slot via an adapter.

通信Ｉ／Ｆコントローラ２０８は、ネットワークを介して外部機器と接続・通信し、ネットワークでの通信制御処理を実行する。例えば、ＴＣＰ／ＩＰ（ＴｒａｎｓｍｉｓｓｉｏｎＣｏｎｔｒｏｌＰｒｏｔｏｃｏｌ／ＩｎｔｅｒｎｅｔＰｒｏｔｏｃｏｌ）を用いた通信等が可能である。 The communication I / F controller 208 connects and communicates with an external device via a network, and executes communication control processing on the network. For example, communication using TCP / IP (Transmission Control Protocol / Internet Protocol) is possible.

尚、ＣＰＵ２０１は、例えばＲＡＭ２０２内の表示情報用領域へアウトラインフォントの展開（ラスタライズ）処理を実行することにより、ディスプレイ２１０上に表示することが可能である。また、ＣＰＵ２０１は、ディスプレイ２１０上のマウスカーソル（図示しない）等によるユーザ指示を可能とする。 Note that the CPU 201 can display on the display 210 by executing an outline font rasterization process on a display information area in the RAM 202, for example. Further, the CPU 201 enables a user instruction using a mouse cursor (not shown) on the display 210.

本発明を実現するための後述する各種プログラムは、外部メモリ２１１に記録されており、必要に応じてＲＡＭ２０２にロードされることによりＣＰＵ２０１によって実行されるものである。
図３は、本発明の実施形態に係る検索処理のフローチャートの一例を示す図である。図３のフローチャートの各ステップは、情報処理装置１００上のＣＰＵ２０１で実行される。 Various programs to be described later for realizing the present invention are recorded in the external memory 211 and executed by the CPU 201 by being loaded into the RAM 202 as necessary.
FIG. 3 is a diagram showing an example of a flowchart of search processing according to the embodiment of the present invention. Each step of the flowchart of FIG. 3 is executed by the CPU 201 on the information processing apparatus 100.

ステップＳ３０１においては、入力文字列受付部１０１が、文書検索部１０２でテキスト文書を検索するための条件として、入力文字列を受け付ける（受付手段に相当する）。本発明の説明のための例として、“人工知能、機械学習、ビッグデー？”という文字列が受け付けられたとする。最後の“？”は、“ビッグデータ”という単語の入力においてスペリングミスをしたもの（修正文字列に相当する）で有り、修正対象である。例えば一文字分不足した“ビッグデー”であったり、誤った文字が入力され“ビッグデーラ”であったりするものとする。 In step S301, the input character string receiving unit 101 receives an input character string as a condition for searching the text document by the document search unit 102 (corresponding to a receiving unit). As an example for explaining the present invention, it is assumed that a character string “artificial intelligence, machine learning, big day?” Is accepted. The last “?” Is a spelling mistake in the input of the word “big data” (corresponding to a corrected character string) and is a correction target. For example, it is assumed that “Big Day” is insufficient for one character, or “Big Dayra” is entered when an incorrect character is input.

ステップＳ３０２においては、文書検索部１０２が、前述の入力文字列を検索条件として、文書記憶部１２１から条件に合致するテキスト文書の一覧を取得（検索）する（検索手段に相当する）。文書記憶部１２１の例を、図５を用いて説明する。 In step S302, the document search unit 102 acquires (searches) a list of text documents that match the conditions from the document storage unit 121 using the above-described input character string as a search condition (corresponding to a search unit). An example of the document storage unit 121 will be described with reference to FIG.

図５において、５０１が１つの検索対象となるテキスト文書である。図５の例では、テキスト文書は５０１ａ〜５０１ｎまである。これらのテキスト文書５０１のうち、前述の検索条件“人工知能”、“機械学習”を含む文書が検索結果となるが、図５の例では、５０１ａ〜５０１ｄの４つである。検索条件と一致した単語にアンダーラインを引いている。例えば、テキスト文書５０１ａでは、“人工知能”と“機械学習”の２つが一致している。 In FIG. 5, reference numeral 501 denotes a text document to be searched. In the example of FIG. 5, there are text documents 501a to 501n. Among these text documents 501, documents including the aforementioned search conditions “artificial intelligence” and “machine learning” are the search results. In the example of FIG. 5, there are four documents 501 a to 501 d. The word that matches the search criteria is underlined. For example, in the text document 501a, “artificial intelligence” and “machine learning” match.

“ビッグデータ”という単語も含まれているが、検索条件が“ビッグデー？”とスペリングミスしたものであるため一致したとは見なされない。 The word “big data” is also included, but it is not considered a match because the search condition is a spelling mistake with “Big Day?”.

ステップＳ３０３においては、単語抽出部１０３が、検索結果である前述のテキスト文書を解析し、当該テキスト文書のいずれかに含まれる単語を抽出する。 In step S303, the word extraction unit 103 analyzes the above-described text document that is a search result, and extracts words included in any of the text documents.

抽出された単語は、例えば図６の抽出単語記憶部３１１（“ビッグデータ”から“デジタル”までの縦の列）のように一覧として表される。ここで図６を説明する。 The extracted words are represented as a list, for example, as in the extracted word storage unit 311 (vertical column from “big data” to “digital”) in FIG. Here, FIG. 6 will be described.

図６は、前述の通り抽出単語記憶部３１１のリストと、文書毎に検索条件の単語を並べた部分（横の列の６０１）の２次元の表である。抽出単語記憶部３１１と各文書６０１が交差する重み付け欄６０２には、重み付けが記載されている。 FIG. 6 is a two-dimensional table of the list in the extracted word storage unit 311 and the part (601 in the horizontal column) where the search condition words are arranged for each document as described above. Weighting is described in the weighting column 602 where the extracted word storage unit 311 and each document 601 intersect.

例えば最も左上の欄では、テキスト文書５０１には“ビッグデータ”という単語が含まれているため、ビッグデータという抽出単語の一つ右の欄に“２”と記載されている。これは、検索条件の中の“人工知能”、“機械学習”の２つの単語が、テキスト文書に出現している、すなわち“ビッグデータ”は、検索条件に含まれる単語のうち２つの単語と共起している、ということを表している。 For example, in the upper left column, since the text document 501 includes the word “big data”, “2” is described in the right column of the extracted word “big data”. This is because two words of “artificial intelligence” and “machine learning” in the search condition appear in the text document, that is, “big data” is two words among the words included in the search condition. It means that they co-occur.

この重み付けはあくまで例である。例えば、検索システムにおいては、テキスト文書に現れる各単語に、通常“重み”を付与している。１つのテキスト文書に何度も現れる単語ほど重みが高い、一方、異なるテキスト文書に何度も現れる単語ほど重みが低い、などである。これらの重みを考慮して、図６の重み付け欄６０２の値を算出してもよい。 This weighting is only an example. For example, in a search system, each word appearing in a text document is usually given a “weight”. A word that appears more than once in one text document has a higher weight, while a word that appears more than once in a different text document has a lower weight. Considering these weights, the value in the weighting column 602 in FIG. 6 may be calculated.

また、図６では各テキスト文書に出現する検索条件の単語は２つだけであるが、もっと多い場合、例えば単語Ａ１，Ａ２，Ｂ１，Ｂ２の４つがある場合、単語Ａ１，Ａ２は共起頻度が高いため、単語Ａ１，Ａ２と共起する抽出単語の重みも高くする、一方、単語Ｂ１，Ｂ２の共起頻度は低いため、単語Ｂ１，Ｂ２と共起する抽出単語の重みは低くする、などとしてもよい。従って重み付け欄６０２の値の算出方法は、任意であり設計事項である。本発明においては、任意の算出方法を含むものとする。 In FIG. 6, there are only two search condition words appearing in each text document. When there are more words, for example, when there are four words A1, A2, B1, and B2, the words A1 and A2 are co-occurrence frequencies. Therefore, the weight of the extracted word that co-occurs with the words A1 and A2 is also increased, while the co-occurrence frequency of the words B1 and B2 is low, so that the weight of the extracted word that co-occurs with the words B1 and B2 is decreased. And so on. Therefore, the method for calculating the value in the weighting column 602 is arbitrary and is a design matter. In the present invention, an arbitrary calculation method is included.

また、抽出単語記憶部３１１に“人工知能”、“機械学習”の２単語が含まれていないが、これは、後述の説明でスペリングミスがなかったものとして扱うためで有り、ステップ３０３の段階でその判定ができていない場合には、暫定的に抽出単語記憶部３１１に含んでいてもよい。 Further, the extracted word storage unit 311 does not include two words “artificial intelligence” and “machine learning”, but this is because it is treated as having no spelling error in the following description. If the determination is not possible, the extracted word storage unit 311 may temporarily include the determination.

同様に、図６の抽出単語記憶部３１１には、“人工知能”、“機械学習”をのぞき、各テキスト文書に出現する全ての単語を記載しているが、本例での“ビッグデー？”のスペリングチェックのための候補にさえならない、という判定があらかじめできる文字列であれば、抽出単語記憶部３１１に必ずしも含まなくてもよい。 Similarly, all words appearing in each text document are described in the extracted word storage unit 311 in FIG. 6 except for “artificial intelligence” and “machine learning”. In this example, “Big Day?” The extracted word storage unit 311 does not necessarily include the character string that can be determined in advance as not being a candidate for the spelling check.

フローチャートの説明に戻る。ステップＳ３０４においては、検索条件となる入力文字列に含まれる単語に対して、スペリングチェックを行う。スペリングチェックについては周知の技術である。ただし、本願発明の特徴に関わる部分を図４のフローチャートを用いて詳細に説明する。 Return to the description of the flowchart. In step S304, a spelling check is performed on the words included in the input character string serving as a search condition. The spelling check is a well-known technique. However, the parts related to the features of the present invention will be described in detail with reference to the flowchart of FIG.

ステップＳ４０１からステップＳ４０６は、検索条件の入力文字列から抽出された単語（スペリングミスがあるものも含む）の１つずつに着目しながら、繰り返し行われる処理である。 Steps S401 to S406 are processes that are repeatedly performed while paying attention to each of the words (including those with spelling mistakes) extracted from the input character string of the search condition.

ステップＳ４０２においては、着目中の単語と同じものが抽出単語記憶部３１１にあるか否かを判定する。例えば“ビッグデー？”があるか否かを判定する。文字列として一致するものがある場合（ＹＥＳ）には、ステップＳ４０６に進み、次の着目を次の単語に移行する。一致するものがない場合（ＮＯ）の場合には、ステップＳ４０３に進む。 In step S402, it is determined whether or not the extracted word storage unit 311 has the same word as the focused word. For example, it is determined whether there is “Big Day?”. If there is a matching character string (YES), the process proceeds to step S406, and the next attention is shifted to the next word. If there is no match (NO), the process proceeds to step S403.

ステップＳ４０３においては、類似度判定部１０５が、抽出単語記憶部３１１に記憶された検索結果のテキスト文書に含まれる単語と、単語辞書記憶部１２２に含まれる単語の中から、着目中の単語と類似のものを選択し、類似度を算出する。類似度は、例えば２つの単語の文字列としての一致度に基づいて算出される。スペリングチェックおよび類似度算出の処理については、周知の技術であり詳細は割愛する。単語辞書記憶部の例として、図７を説明する。 In step S <b> 403, the similarity determination unit 105 selects a word being focused on from the words included in the text document of the search result stored in the extracted word storage unit 311 and the words included in the word dictionary storage unit 122. A similar thing is selected and the similarity is calculated. The similarity is calculated based on, for example, the degree of coincidence of two words as a character string. The spelling check and similarity calculation processing is a well-known technique and will not be described in detail. FIG. 7 will be described as an example of the word dictionary storage unit.

図７においては、“ビッグデー？”と類似する見出しをもつ単語のみを表示しているが、実際には他の単語も登録されている。着目中の単語７０１に対して、類似する見出し（７０２）を持つ単語をリストアップした状態である。 In FIG. 7, only words having a headline similar to “Big Day?” Are displayed, but other words are also actually registered. This is a state where words having a similar heading (702) are listed with respect to the word 701 under consideration.

また前述で、特許文献１のように辞書を用いることの問題点を記載したが、これは汎用的な辞書を用意する場合であり、例えば企業が特定の製品についての質問応答システムを開発するような場合に、その製品に関連する特有の技術用語のみを登録するのであれば、着目中の単語に誤って類似する単語も少なく、また保守の工数も少なく効果があるため、使用しても同様の問題を生じないようにすることが可能である。 Moreover, although the problem of using a dictionary like patent document 1 was described above, this is a case where a general-purpose dictionary is prepared, for example, a company develops a question answering system about a specific product. In this case, if you register only specific technical terms related to the product, there are few words that are mistakenly similar to the word you are interested in, and there are fewer maintenance steps, so it can be used as well. It is possible to prevent this problem.

さらに、類似していると判定された単語一覧の例を図８で説明する。図８は、修正候補記憶部３１２を説明したものである。ステップＳ４０３の結果は、修正候補記憶部３１２に格納しておく。 Further, an example of a word list determined to be similar will be described with reference to FIG. FIG. 8 illustrates the correction candidate storage unit 312. The result of step S403 is stored in the correction candidate storage unit 312.

修正候補記憶部３１２には、スペリングミス候補８０１に対して、類似していると判定された単語の表記８０２と、その類似度８０３を格納する。さらに、単語辞書記憶部１２２から取得した修正候補は使用しない、ということをユーザやアプリケーションが判断できるように、出典８０４を格納してもよい。検索システムがデフォルトで判断せず、ユーザやアプリケーションに提示する際に、判断する根拠として、修正候補とともに出典８０４の情報を提示してもよい。 The correction candidate storage unit 312 stores a word notation 802 determined to be similar to the spelling error candidate 801 and its similarity 803. Further, the source 804 may be stored so that the user or application can determine that the correction candidate acquired from the word dictionary storage unit 122 is not used. When the search system does not determine by default but presents it to the user or application, information on the source 804 may be presented together with the correction candidate as a basis for determination.

ステップＳ４０４においては、抽出単語記憶部３１１および（必要なら）単語辞書記憶部１２２から、着目中の単語と類似の文字列を見つけることができたか否かを判定する。具体的には、類似度の閾値を設定しプログラムコードや設定ファイルなどの記憶部に記憶しておく。例えば、閾値として“０．５”を設定し、類似度がその値よりも低いものは、スペリングミスの修正候補ではないと判定してもよい。 In step S404, it is determined whether a character string similar to the focused word can be found from the extracted word storage unit 311 and the word dictionary storage unit 122 (if necessary). Specifically, a similarity threshold is set and stored in a storage unit such as a program code or a setting file. For example, “0.5” may be set as the threshold value, and a similarity lower than that value may be determined not to be a spelling error correction candidate.

着目中の単語に類似するものがあると判定された場合（ＹＥＳ）には、ステップＳ４０５に進む。類似するものがないと判定された場合（ＮＯ）には、ステップＳ４０６に進み、次の着目を次の単語に移行する。 If it is determined that there is something similar to the word being noticed (YES), the process proceeds to step S405. If it is determined that there is no similarity (NO), the process proceeds to step S406, and the next attention is shifted to the next word.

ステップＳ４０５においては、着目中の単語と類似しているとされた単語に対して、類似度を変更する。 In step S405, the similarity is changed for a word that is similar to the word of interest.

例をあげて説明する。既に、図６において、抽出単語と検索条件に含まれる単語の共起度（言語が同一の発話・文・文脈などの言語的環境において生起する回数を指す）算出すること（算出手段に相当する）により、重み付けをする表を説明した。例えば、類似度が０．８の“ビッグデータ”は、図６のテーブルから、３つのテキスト文書で合計４つの検索条件と共起しているから、類似度を４倍の“３．２”に修正、類似度が０．６の“ビットデータ”は、１つしか共起していないので、変わらず“０．６”などとしてもよい。この共起度の算出方法はあくまで例であり、任意の算出方法でよい。本発明では、それら任意の算出方法をも含むものとする。 An example will be described. 6, the co-occurrence degree of the extracted word and the word included in the search condition (referring to the number of occurrences in a linguistic environment such as speech, sentence, context, etc., in which the language is the same) is calculated (corresponding to a calculation means). ) Explained the weighting table. For example, “big data” having a similarity degree of 0.8 co-occurs with a total of four search conditions in three text documents from the table of FIG. Since there is only one “bit data” with a modified degree of similarity of 0.6, it may be “0.6” without change. This co-occurrence calculation method is merely an example, and any calculation method may be used. In the present invention, these arbitrary calculation methods are also included.

以上で、図４のフローチャートの説明を完了し、図３のフローチャートの説明に戻る。 This completes the description of the flowchart of FIG. 4 and returns to the description of the flowchart of FIG. 3.

ステップＳ３０５においては、修正候補提示部１０７が、スペリングミスがあると判定された検索条件内の単語に対して、修正候補を提示する。例えば、図８の表記８０２のうち、出典８０４に“抽出”と記されているもののみ提示する、あるいは、類似度８０３が“０．７以上”のもののみを提示する、などの処理を行う。 In step S305, the correction candidate presenting unit 107 presents correction candidates for words in the search condition determined to have a spelling error. For example, among the notations 802 shown in FIG. 8, only those that are described as “extracted” in the source 804 are presented, or only those whose similarity 803 is “0.7 or more” are presented. .

提示された修正候補から、ユーザあるいは検索システムを呼び出したアプリケーションが、適切なものを１つ選択することで修正候補を特定する（特定手段に相当する）。例えば、“ビッグデータ”が選択されたとする。 From the presented correction candidates, the user or the application that called the search system selects one appropriate one to specify the correction candidate (corresponding to the specifying means). For example, assume that “big data” is selected.

ステップＳ３０６においては、修正結果受付部１０８が、ユーザあるいはアプリケーションの選択結果を受け付け、入力文字列の中でスペリングミスがあった単語を選択された結果に置き換えて修正する（修正手段に相当する）。具体的には、“ビッグデー？”を“ビッグデータ”に置き換えて修正する。 In step S306, the correction result receiving unit 108 receives the selection result of the user or the application, and corrects the word having a spelling error in the input character string by replacing it with the selected result (corresponding to a correction means). . Specifically, “Big Day?” Is replaced with “Big Data” and corrected.

ステップＳ３０７においては、文書検索部１０２が、置き換えられた検索条件で、再度、文書記憶部１２１を検索する（検索手段に相当する）。具体的には、“人工知能”、“機械学習”、“ビッグデータ”の３条件で再度検索する。この結果、テキスト文書５０１ｂは順位が下がり、５０１ｃ、５０１ｄの結果の順位が上がる。 In step S307, the document search unit 102 searches the document storage unit 121 again with the replaced search condition (corresponding to a search unit). Specifically, the search is performed again under three conditions of “artificial intelligence”, “machine learning”, and “big data”. As a result, the ranking of the text document 501b is lowered, and the ranking of the results of 501c and 501d is raised.

また、不図示ではあるが、３条件のうち“ビッグデータ”のみを含んでいるテキスト文書５０１は、ステップＳ３０２の検索ではヒットしないが、ステップＳ３０７の再検索ではヒットすることになる。 Although not shown, the text document 501 including only “big data” among the three conditions is not hit in the search in step S302, but is hit in the re-search in step S307.

ステップＳ３０８においては、順位が入れ替わった、あるいは１回目でヒットしなかったテキスト文書５０１の一覧が、ステップＳ３０６にて修正された検索条件に基づいて再検索されたステップＳ３０７の結果として、提示される。 In step S308, the list of the text documents 501 whose ranks have been changed or which has not been hit at the first time is presented as a result of step S307 which is re-searched based on the search conditions corrected in step S306. .

以上、ステップＳ３０５とステップＳ３０６で、スペリングミスのある単語に対する修正候補の中から、ユーザやアプリケーションにより正しいスペリングの単語が選択されるとの説明をした。ただし、これはあくまで例である。本発明の検索システムにおいて、最も類似度が高いものをデフォルトで選択しても構わない。 As described above, in steps S305 and S306, it has been described that a correct spelling word is selected by a user or an application from correction candidates for a spelling error word. However, this is only an example. In the search system of the present invention, the search system having the highest similarity may be selected by default.

例えば、ユーザに対して１回目の検索結果もスペリングチェックの結果も提示せずに再検索を実施し、ステップＳ３０８まで再検索の結果を提示すれば、ユーザからは、入力文字列の中にスペリングミスをした単語があるにもかかわらず、正しいスペリングで検索したように見せることが可能となる。以上で、図３のフローチャートの説明を完了する。 For example, if the user performs a re-search without presenting the first search result or the spelling check result to the user and presents the re-search result up to step S308, the user spells in the input character string. Despite having missed words, it is possible to make it look as if you searched with the correct spelling. This completes the description of the flowchart of FIG.

なお、上述した各種データの構成及びその内容はこれに限定されるものではなく、用途や目的に応じて、様々な構成や内容で構成されることは言うまでもない。 It should be noted that the configuration and contents of the various data described above are not limited to this, and it goes without saying that the various data and configurations are configured according to the application and purpose.

以上、いくつかの実施形態について示したが、本発明は、例えば、システム、装置、方法、コンピュータプログラムもしくは記録媒体等としての実施態様をとることが可能であり、具体的には、複数の機器から構成されるシステムに適用しても良いし、また、一つの機器からなる装置に適用しても良い。 Although several embodiments have been described above, the present invention can take an embodiment as, for example, a system, apparatus, method, computer program, or recording medium, and more specifically, a plurality of devices. The present invention may be applied to a system configured from the above, or may be applied to an apparatus including a single device.

また、本発明におけるコンピュータプログラムは、図３〜図４に示すフローチャートの処理方法をコンピュータが実行可能なコンピュータプログラムであり、本発明の記憶媒体は図３〜図４の処理方法をコンピュータが実行可能なコンピュータプログラムが記憶されている。なお、本発明におけるコンピュータプログラムは図３〜図４の各装置の処理方法ごとのコンピュータプログラムであってもよい。 The computer program according to the present invention is a computer program capable of executing the processing method of the flowcharts shown in FIGS. 3 to 4, and the storage medium of the present invention can execute the processing method of FIGS. Various computer programs are stored. In addition, the computer program in this invention may be a computer program for every processing method of each apparatus of FIGS.

以上のように、前述した実施形態の機能を実現するコンピュータプログラムを記録した記録媒体を、システムあるいは装置に供給し、そのシステムあるいは装置のコンピュータ（またはＣＰＵやＭＰＵ）が記録媒体に格納されたコンピュータプログラムを読出し実行することによっても、本発明の目的が達成されることは言うまでもない。 As described above, a recording medium in which a computer program for realizing the functions of the above-described embodiments is recorded is supplied to the system or apparatus, and the computer (or CPU or MPU) of the system or apparatus is stored in the recording medium. It goes without saying that the object of the present invention can also be achieved by reading and executing a program.

この場合、記録媒体から読み出されたコンピュータプログラム自体が本発明の新規な機能を実現することになり、そのコンピュータプログラムを記憶した記録媒体は本発明を構成することになる。 In this case, the computer program itself read from the recording medium realizes the novel function of the present invention, and the recording medium storing the computer program constitutes the present invention.

コンピュータプログラムを供給するための記録媒体としては、例えば、フレキシブルディスク、ハードディスク、光ディスク、光磁気ディスク、ＣＤ−ＲＯＭ、ＣＤ−Ｒ、ＤＶＤ−ＲＯＭ、磁気テープ、不揮発性のメモリカード、ＲＯＭ、ＥＥＰＲＯＭ、シリコンディスク、ソリッドステートドライブ等を用いることができる。 As a recording medium for supplying a computer program, for example, a flexible disk, a hard disk, an optical disk, a magneto-optical disk, a CD-ROM, a CD-R, a DVD-ROM, a magnetic tape, a nonvolatile memory card, a ROM, an EEPROM, Silicon disks, solid state drives, etc. can be used.

また、コンピュータが読み出したコンピュータプログラムを実行することにより、前述した実施形態の機能が実現されるだけでなく、そのコンピュータプログラムの指示に基づき、コンピュータ上で稼働しているＯＳ（オペレーティングシステム）等が実際の処理の一部または全部を行い、その処理によって前述した実施形態の機能が実現される場合も含まれることは言うまでもない。 Further, by executing the computer program read by the computer, not only the functions of the above-described embodiments are realized, but also an OS (operating system) running on the computer based on the instructions of the computer program. It goes without saying that a case where the function of the above-described embodiment is realized by performing part or all of the actual processing and the processing is included.

さらに、記録媒体から読み出されたコンピュータプログラムが、コンピュータに挿入された機能拡張ボードやコンピュータに接続された機能拡張ユニットに備わるメモリに書き込まれた後、そのコンピュータプログラムコードの指示に基づき、その機能拡張ボードや機能拡張ユニットに備わるＣＰＵ等が実際の処理の一部または全部を行い、その処理によって前述した実施形態の機能が実現される場合も含まれることは言うまでもない。 Furthermore, after the computer program read from the recording medium is written in a memory provided in a function expansion board inserted into the computer or a function expansion unit connected to the computer, the function is based on the instructions of the computer program code. It goes without saying that the CPU or the like provided in the expansion board or the function expansion unit performs part or all of the actual processing and the functions of the above-described embodiments are realized by the processing.

また、本発明は、複数の機器から構成されるシステムに適用しても、１つの機器からなる装置に適用してもよい。また、本発明は、システムあるいは装置にコンピュータプログラムを供給することによって達成される場合にも適応できることは言うまでもない。この場合、本発明を達成するためのコンピュータプログラムを格納した記録媒体を該システムあるいは装置に読み出すことによって、そのシステムあるいは装置が、本発明の効果を享受することが可能となる。 Further, the present invention may be applied to a system composed of a plurality of devices or an apparatus composed of a single device. It goes without saying that the present invention can also be applied to a case where the present invention is achieved by supplying a computer program to a system or apparatus. In this case, by reading the recording medium storing the computer program for achieving the present invention into the system or apparatus, the system or apparatus can enjoy the effects of the present invention.

さらに、本発明を達成するためのコンピュータプログラムをネットワーク上のサーバ、データベース等から通信プログラムによりダウンロードして読み出すことによって、そのシステムあるいは装置が、本発明の効果を享受することが可能となる。 Furthermore, by downloading and reading out a computer program for achieving the present invention from a server, database, etc. on a network using a communication program, the system or apparatus can enjoy the effects of the present invention.

なお、上述した各実施形態およびその変形例を組み合わせた構成も全て本発明に含まれるものである。 In addition, all the structures which combined each embodiment mentioned above and its modification are also included in this invention.

１００情報処理装置
１０１入力文字列受付部
１０２文書検索部
１０３単語抽出部
１０４辞書検索部
１０５類似度判定部
１０６修正候補格納部
１０７修正候補提示部
１０８修正結果受付部
１２１文書記憶部
１２２単語辞書記憶部
３１１抽出単語記憶部
３１２修正候補記憶部
６００共起情報記憶部 DESCRIPTION OF SYMBOLS 100 Information processing apparatus 101 Input character string reception part 102 Document search part 103 Word extraction part 104 Dictionary search part 105 Similarity determination part 106 Correction candidate storage part 107 Correction candidate presentation part 108 Correction result reception part 121 Document storage part 122 Word dictionary storage 311 Extracted word storage unit 312 Correction candidate storage unit 600 Co-occurrence information storage unit

Claims

An information processing server capable of retrieving appropriate document data from a plurality of document data based on a character string received from a user,
Storage means for storing a plurality of the document data;
A plurality of character strings for searching for the document data, and accepting means for receiving a character string including at least a correction character string to be corrected;
Search means for searching a plurality of the document data stored in the storage means using the character string received by the receiving means;
Correction means for correcting the correction character string based on a character string included in the document data searched by the search means; and
The information processing server, wherein the search unit searches the plurality of document data stored in the storage unit using the character string including the corrected character string corrected by the correction unit.

Management means for managing correction data for correcting the correction character string,
The information processing server according to claim 1, wherein the correction unit corrects the corrected character string based on the correction data.

A specifying means for specifying a correction candidate of the correction character string from a character string included in the document data searched by the search means;
The information processing server according to claim 1, wherein the correction unit corrects the correction character string with any one of the correction candidates specified by the specifying unit.

The correction candidate specified by the specifying means further comprises a calculation means for obtaining a co-occurrence degree indicating the number of times the correction character string and the document data co-occur,
The information processing server according to claim 3, wherein the correcting unit corrects the corrected character string based on the correction candidate determined based on the co-occurrence degree calculated by the calculating unit.

An information processing server comprising a storage means for storing a plurality of document data, and an information processing server control method capable of retrieving appropriate document data from a plurality of document data based on a character string received from a user There,
Receiving a plurality of character strings for searching the document data, the character string including at least a correction character string to be corrected; and
A search step for searching the plurality of document data stored in the storage unit using the character string received in the reception step;
A correction step of correcting the correction character string based on the character string included in the document data searched in the search step;
The search step includes: searching the plurality of document data stored in the storage unit using the character string including the corrected character string corrected in the correction step. Method.

An information processing server having storage means for storing a plurality of document data, and can be executed by an information processing server that can retrieve appropriate document data from a plurality of document data based on a character string received from a user. A program,
Information processing server
A plurality of character strings for searching for the document data, and accepting means for receiving a character string including at least a correction character string to be corrected;
Search means for searching a plurality of the document data stored in the storage means using the character string received by the receiving means;
Based on the character string included in the document data searched by the search means, function as correction means for correcting the correction character string,
A program for causing the search means to function to search a plurality of the document data stored in the storage means using the character string including the corrected character string corrected by the correction means.

An information processing system including an information processing server capable of searching appropriate document data from a plurality of document data based on a character string received from a user,
Storage means for storing a plurality of the document data;
A plurality of character strings for searching for the document data, and accepting means for receiving a character string including at least a correction character string to be corrected;
Search means for searching a plurality of the document data stored in the storage means using the character string received by the receiving means;
Correction means for correcting the correction character string based on a character string included in the document data searched by the search means; and
The information processing system, wherein the search means searches the plurality of document data stored in the storage means using the character string including the corrected character string corrected by the correction means.

An information processing system comprising a storage means for storing a plurality of document data, and including an information processing server capable of retrieving appropriate document data from the plurality of document data based on a character string received from a user A system control method comprising:
Receiving a plurality of character strings for searching the document data, the character string including at least a correction character string to be corrected; and
A search step for searching the plurality of document data stored in the storage unit using the character string received in the reception step;
A correction step of correcting the correction character string based on the character string included in the document data searched in the search step;
The search step searches the plurality of document data stored in the storage means using the character string including the corrected character string corrected in the correction step. Method.

An information processing system comprising a storage means for storing a plurality of document data, and including an information processing server capable of retrieving appropriate document data from the plurality of document data based on a character string received from a user A program executable on the system,
Information processing system
A plurality of character strings for searching for the document data, and accepting means for receiving a character string including at least a correction character string to be corrected;
Search means for searching a plurality of the document data stored in the storage means using the character string received by the receiving means;
Based on the character string included in the document data searched by the search means, function as correction means for correcting the correction character string,
A program for causing the search means to search the plurality of document data stored in the storage means using the character string including the corrected character string corrected by the correction means.