JP2020154790A

JP2020154790A - Information processing device, information processing method, and program

Info

Publication number: JP2020154790A
Application number: JP2019053170A
Authority: JP
Inventors: 俊平大倉; Shumpei Okura
Original assignee: Yahoo Japan Corp
Current assignee: Yahoo Japan Corp
Priority date: 2019-03-20
Filing date: 2019-03-20
Publication date: 2020-09-24
Anticipated expiration: 2039-03-20
Also published as: JP7139271B2

Abstract

To extract a unique expression from a document with high accuracy.SOLUTION: An information processing device includes a division part for dividing a sentence into a character string including at least one or more characters, a calculation part for calculating a score in each character string divided by the division part on the basis of a plurality of queries inputted by a user, and an extraction part for extracting a unique expression from the sentence on the basis of the score calculated by the calculation part.SELECTED DRAWING: Figure 2

Description

本発明は、情報処理装置、情報処理方法、及びプログラムに関する。 The present invention relates to an information processing device, an information processing method, and a program.

文章に含まれる固有表現（例えば固有名詞）を特徴量として用いることで、その文章の内容を、機械学習に利用可能な表現に置き換えることが研究されている。これに関連し、入力テキストを、形態素解析によってフレーズに区分し、予め印象要素とそのスコアがフレーズに対応付けられた印象辞書を用いて、入力テキストを区分したフレーズ毎に、印象要素及びスコアを対応付けたフレーズリストを生成し、入力テキストのフレーズ全体から各印象要素の尤度を算出し、当該尤度を、各印象要素のスコアとして対応付けた客観的印象を算出し、主観的印象に基づく当該印象要素のスコアと、客観的印象に基づく該印象要素のスコアとを比較した印象差分情報を算出する技術が知られている（例えば、特許文献１参照）。 Research is being conducted to replace the content of a sentence with an expression that can be used for machine learning by using a proper expression (for example, a proper noun) contained in the sentence as a feature quantity. In relation to this, the input text is divided into phrases by morphological analysis, and the impression element and score are divided for each phrase in which the input text is divided using an impression dictionary in which the impression element and its score are associated with the phrase in advance. Generate an associated phrase list, calculate the likelihood of each impression element from the entire phrase of the input text, calculate the objective impression associated with the likelihood as the score of each impression element, and make it a subjective impression. There is known a technique for calculating impression difference information by comparing the score of the impression element based on the score of the impression element based on the score of the impression element based on an objective impression (see, for example, Patent Document 1).

特開２０１７−８４０１５号公報JP-A-2017-84015

世間では、新語や造語といった今まで使われていなかった新しい言葉が流行する場合がある。例えば、ユニークなタイトルが付けられた新作のコンテンツが公開され、そのコンテンツが人々の間で話題となれば、ユニークなタイトルが新しい言葉として流行することになる。しかしながら、従来の技術では、流行に合わせて辞書を頻繁に更新するのは困難な場合が多く、更には、どのような文章から辞書に登録すべき固有表現を探すべきなのかが十分に検討されていなかった。このようなことから、従来の技術では、文書から固有表現を精度よく抽出できない場合があった。 New words that have not been used until now, such as new words and coined words, may become popular in the world. For example, if new content with a unique title is released and the content becomes a hot topic among people, the unique title will become popular as a new word. However, with the conventional technology, it is often difficult to update the dictionary frequently according to the fashion, and further, it is thoroughly examined from what kind of sentence the unique expression to be registered in the dictionary should be searched for. I wasn't. For this reason, conventional techniques may not be able to accurately extract named entities from a document.

本発明は、上記の課題に鑑みてなされたものであり、文書から精度よく固有表現を抽出することができる情報処理装置、情報処理方法、及びプログラムを提供することを目的としている。 The present invention has been made in view of the above problems, and an object of the present invention is to provide an information processing device, an information processing method, and a program capable of accurately extracting a named entity from a document.

本発明の一態様は、文章を、少なくとも一文字以上を含む文字列に分割する分割部と、ユーザによって入力された複数のクエリに基づいて、前記分割部により分割された前記文字列ごとにスコアを算出する算出部と、前記算出部によって算出された前記スコアに基づいて、前記文章から固有表現を抽出する抽出部と、を備える情報処理装である。 One aspect of the present invention is a division unit that divides a sentence into character strings including at least one character, and a score for each character string divided by the division unit based on a plurality of queries input by the user. It is an information processing device including a calculation unit for calculation and an extraction unit for extracting a named entity from the sentence based on the score calculated by the calculation unit.

本発明の一態様によれば、文書から精度よく固有表現を抽出することができる。 According to one aspect of the present invention, named entity can be accurately extracted from a document.

第１実施形態における情報処理装置１００を含む情報処理システム１の一例を示す図である。It is a figure which shows an example of the information processing system 1 including the information processing apparatus 100 in 1st Embodiment. 第１実施形態における情報処理装置１００の構成の一例を示す図である。It is a figure which shows an example of the structure of the information processing apparatus 100 in 1st Embodiment. 検索ログ１３２の一例を示す図である。It is a figure which shows an example of the search log 132. 第１実施形態における制御部１１０の一連の処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a series of processing of the control unit 110 in 1st Embodiment. ３文字のテキストから固有表現を抽出する方法を模式的に示す図である。It is a figure which shows typically the method of extracting a named entity from a three-character text. ３文字のテキストから固有表現を抽出する方法を模式的に示す図である。It is a figure which shows typically the method of extracting a named entity from a three-character text. ３文字のテキストから固有表現を抽出する方法を模式的に示す図である。It is a figure which shows typically the method of extracting a named entity from a three-character text. ３文字のテキストから固有表現を抽出する方法を模式的に示す図である。It is a figure which shows typically the method of extracting a named entity from a three-character text. ４文字のテキストから固有表現を抽出する方法を模式的に示す図である。It is a figure which shows typically the method of extracting a named entity from a four-character text. ４文字のテキストから固有表現を抽出する方法を模式的に示す図である。It is a figure which shows typically the method of extracting a named entity from a four-character text. ４文字のテキストから固有表現を抽出する方法を模式的に示す図である。It is a figure which shows typically the method of extracting a named entity from a four-character text. ４文字のテキストから固有表現を抽出する方法を模式的に示す図である。It is a figure which shows typically the method of extracting a named entity from a four-character text. ４文字のテキストから固有表現を抽出する方法を模式的に示す図である。It is a figure which shows typically the method of extracting a named entity from a four-character text. ４文字のテキストから固有表現を抽出する方法を模式的に示す図である。It is a figure which shows typically the method of extracting a named entity from a four-character text. ４文字のテキストから固有表現を抽出する方法を模式的に示す図である。It is a figure which shows typically the method of extracting a named entity from a four-character text. ４文字のテキストから固有表現を抽出する方法を模式的に示す図である。It is a figure which shows typically the method of extracting a named entity from a four-character text. スコア算出対象のパターンの決定方法を説明するための図である。It is a figure for demonstrating the method of determining the pattern of the score calculation target. スコア算出対象のパターンの決定方法を説明するための図である。It is a figure for demonstrating the method of determining the pattern of the score calculation target. スコア算出対象のパターンの決定方法を説明するための図である。It is a figure for demonstrating the method of determining the pattern of the score calculation target.

以下、本発明を適用した情報処理装置、情報処理方法、及びプログラムを、図面を参照して説明する。 Hereinafter, an information processing apparatus, an information processing method, and a program to which the present invention is applied will be described with reference to the drawings.

［概要］
情報処理装置は、一以上のプロセッサにより実現される。情報処理装置は、文章を、少なくとも一文字以上を含む文字列に分割し、ユーザによって入力された複数のクエリの履歴である検索ログに基づいて、文字列ごとにスコアを算出する。そして、情報処理装置は、算出したスコアに基づいて、文章から固有表現を抽出する。これによって、文書から精度よく固有表現を抽出することができる。この結果、例えば、文章の内容を的確に表した分散表現を得ることができる。 [Overview]
The information processing device is realized by one or more processors. The information processing device divides a sentence into a character string including at least one character, and calculates a score for each character string based on a search log which is a history of a plurality of queries input by a user. Then, the information processing device extracts the named entity from the sentence based on the calculated score. As a result, the named entity can be accurately extracted from the document. As a result, for example, it is possible to obtain a distributed expression that accurately represents the content of the sentence.

＜第１実施形態＞
［全体構成］
図１は、第１実施形態における情報処理装置１００を含む情報処理システム１の一例を示す図である。第１実施形態における情報処理システム１は、例えば、一つ以上の端末装置１０と、サービス提供装置２０と、情報処理装置１００とを備える。これらの装置のうち一部または全部は、ネットワークＮＷを介して互いに接続される。なお、これらの装置のうち一部は、仮想的な装置として他の装置に包含されてもよく、例えば、サービス提供装置２０の機能の一部または全部が、情報処理装置１００の機能によって実現される仮想マシンであってもよいし、これとは反対に、情報処理装置１００の機能の一部または全部が、サービス提供装置２０の機能によって実現される仮想マシンであってもよい。 <First Embodiment>
[overall structure]
FIG. 1 is a diagram showing an example of an information processing system 1 including an information processing device 100 according to the first embodiment. The information processing system 1 in the first embodiment includes, for example, one or more terminal devices 10, a service providing device 20, and an information processing device 100. Some or all of these devices are connected to each other via a network NW. A part of these devices may be included in other devices as a virtual device. For example, a part or all of the functions of the service providing device 20 are realized by the functions of the information processing device 100. The virtual machine may be a virtual machine, or conversely, a part or all of the functions of the information processing device 100 may be realized by the functions of the service providing device 20.

図１に示す各装置は、ネットワークＮＷを介して種々の情報を送受信する。ネットワークＮＷは、例えば、無線基地局、Ｗｉ‐Ｆｉアクセスポイント、通信回線、プロバイダ、インターネットなどを含む。なお、図１に示す各装置の全ての組み合わせが相互に通信可能である必要はなく、ネットワークＮＷは、一部にローカルなネットワークを含んでもよい。 Each device shown in FIG. 1 transmits and receives various information via the network NW. The network NW includes, for example, a radio base station, a Wi-Fi access point, a communication line, a provider, the Internet, and the like. It should be noted that not all combinations of the devices shown in FIG. 1 need to be able to communicate with each other, and the network NW may partially include a local network.

端末装置１０は、例えば、スマートフォンなどの携帯電話、タブレット端末、各種パーソナルコンピュータなどの、入力装置、表示装置、通信装置、記憶装置、および演算装置を備える端末装置である。通信装置は、ＮＩＣ（Network Interface Card）などのネットワークカード、無線通信モジュールなどを含む。端末装置１０では、ウェブブラウザやアプリケーションプログラムなどのＵＡ（User Agent）が起動し、ユーザの入力に応じたリクエストをサービス提供装置２０に送信する。また、ＵＡが起動された端末装置１０は、サービス提供装置２０から取得した情報に基づいて、表示装置に各種画像を表示させる。 The terminal device 10 is a terminal device including an input device, a display device, a communication device, a storage device, and a calculation device, such as a mobile phone such as a smartphone, a tablet terminal, and various personal computers. The communication device includes a network card such as a NIC (Network Interface Card), a wireless communication module, and the like. In the terminal device 10, a UA (User Agent) such as a web browser or an application program is activated, and a request corresponding to a user's input is transmitted to the service providing device 20. Further, the terminal device 10 in which the UA is activated causes the display device to display various images based on the information acquired from the service providing device 20.

サービス提供装置２０は、例えば、ＵＡとして起動されたウェブブラウザからのリクエストに応じてウェブページを端末装置１０に提供するウェブサーバである。ウェブページは、例えば、ショッピングサイトやオークションサイト、フリーマーケットサイトといった各種ウェブサイトを構成するウェブページであってよい。また、サービス提供装置２０は、検索サイトやＳＮＳ（Social Networking Service）、メールサービスなどの各種サービスを提供するウェブページを端末装置１０に提供してもよい。また、サービス提供装置２０は、ＵＡとして起動されたアプリケーションからのリクエストに応じてコンテンツを端末装置１０に提供することで、販売サイトなどの各種ウェブサイトと同様のサービスを提供するアプリケーションサーバであってもよい。 The service providing device 20 is, for example, a web server that provides a web page to the terminal device 10 in response to a request from a web browser started as a UA. The web page may be, for example, a web page constituting various websites such as a shopping site, an auction site, and a flea market site. Further, the service providing device 20 may provide the terminal device 10 with a web page that provides various services such as a search site, an SNS (Social Networking Service), and a mail service. Further, the service providing device 20 is an application server that provides the same services as various websites such as sales sites by providing contents to the terminal device 10 in response to a request from an application started as a UA. May be good.

情報処理装置１００は、サービス提供装置２０から検索ログを取得し、その検索ログを用いて、文章から固有表現を抽出する。本実施形態に係る固有表現には、例えば、名詞のような一つの単語（ワード）だけでなく、名詞と名詞とが他の品詞（例えば助詞）で接続された一つの句（フレーズ）や、名詞や動詞、助詞、助動詞などの種々の品詞を含む一つの文（センテンス）が含まれる。すなわち、人間が固有の表現として用いた言葉であれば、どんなに長い文章であっても固有表現となり得る。 The information processing device 100 acquires a search log from the service providing device 20, and uses the search log to extract a unique expression from the text. The proper expression according to the present embodiment includes, for example, not only one word (word) such as a noun, but also one phrase (phrase) in which a noun and a noun are connected by another part of speech (for example, a particle). A sentence containing various parts of speech such as nouns, verbs, particles, and particles is included. That is, any word used by humans as a named entity can be a named entity, no matter how long the sentence.

［情報処理装置の構成］
図２は、第１実施形態における情報処理装置１００の構成の一例を示す図である。図示のように、情報処理装置１００は、例えば、通信部１０２と、制御部１１０と、記憶部１３０とを備える。 [Information processing device configuration]
FIG. 2 is a diagram showing an example of the configuration of the information processing apparatus 100 according to the first embodiment. As shown in the figure, the information processing device 100 includes, for example, a communication unit 102, a control unit 110, and a storage unit 130.

通信部１０２は、例えば、ＮＩＣ（Network Interface Card）等の通信インターフェースやＤＭＡ（Direct Memory Access）コントローラを含む。通信部１０２は、ネットワークＮＷを介して、サービス提供装置２０や他のウェブサーバと通信する。 The communication unit 102 includes, for example, a communication interface such as a NIC (Network Interface Card) and a DMA (Direct Memory Access) controller. The communication unit 102 communicates with the service providing device 20 and another web server via the network NW.

制御部１１０は、例えば、取得部１１２と、テキスト分割部１１４と、フレーズスコア算出部１１６と、固有表現抽出部１１８とを備える。制御部１１０の構成要素は、例えば、ＣＰＵ（Central Processing Unit）やＧＰＵ（Graphics Processing Unit）などのプロセッサが記憶部１３０に格納されたプログラムを実行することにより実現される。また、制御部１１０の構成要素の一部または全部は、ＬＳＩ（Large Scale Integration）、ＡＳＩＣ（Application Specific Integrated Circuit）、ＦＰＧＡ（Field-Programmable Gate Array）などのハードウェア（回路部；circuitry）により実現されてもよいし、ソフトウェアとハードウェアの協働によって実現されてもよい。 The control unit 110 includes, for example, an acquisition unit 112, a text segmentation unit 114, a phrase score calculation unit 116, and a named entity extraction unit 118. The components of the control unit 110 are realized, for example, by a processor such as a CPU (Central Processing Unit) or a GPU (Graphics Processing Unit) executing a program stored in the storage unit 130. In addition, some or all of the components of the control unit 110 are realized by hardware (circuit unit; circuitry) such as LSI (Large Scale Integration), ASIC (Application Specific Integrated Circuit), FPGA (Field-Programmable Gate Array). It may be done, or it may be realized by the cooperation of software and hardware.

記憶部１３０は、例えば、ＨＤＤ（Hard Disc Drive）、フラッシュメモリ、ＥＥＰＲＯＭ（Electrically Erasable Programmable Read Only Memory）、ＲＯＭ（Read Only Memory）、ＲＡＭ（Random Access Memory）などにより実現される。記憶部１３０には、ファームウェアやアプリケーションプログラムなどの各種プログラムの他に、検索ログ１３２が格納される。 The storage unit 130 is realized by, for example, an HDD (Hard Disc Drive), a flash memory, an EEPROM (Electrically Erasable Programmable Read Only Memory), a ROM (Read Only Memory), a RAM (Random Access Memory), or the like. The storage unit 130 stores the search log 132 in addition to various programs such as firmware and application programs.

図３は、検索ログ１３２の一例を示す図である。図示の例のように、検索ログ１３２は、集計した期間ごとに、ユーザが検索エンジンに入力した各クエリや、そのクエリの入力回数などが対応付けられた履歴情報である。入力回数は、例えば、ユニークブラウザクッキー数であってよい。この場合、同じブラウザを介して同じクエリが何度も入力されても、そのクエリの入力回数は１回としてカウントされる。 FIG. 3 is a diagram showing an example of the search log 132. As shown in the illustrated example, the search log 132 is historical information associated with each query input by the user to the search engine, the number of times the query is input, and the like for each aggregated period. The number of inputs may be, for example, the number of unique browser cookies. In this case, even if the same query is input many times through the same browser, the number of times the query is input is counted as one.

［処理フロー］
以下、第１実施形態における制御部１１０の一連の処理の流れをフローチャートを用いて説明する。図４は、第１実施形態における制御部１１０の一連の処理の流れを示すフローチャートである。本フローチャートの処理は、例えば、所定の周期で繰り返し行われてよい。 [Processing flow]
Hereinafter, a series of processing flows of the control unit 110 in the first embodiment will be described with reference to a flowchart. FIG. 4 is a flowchart showing a series of processing flows of the control unit 110 according to the first embodiment. The processing of this flowchart may be repeated, for example, at a predetermined cycle.

まず、取得部１１２は、通信部１０２を介して、サービス提供装置２０からコンテンツの一つであるテキストを取得する（Ｓ１００）。例えば、テキストは、ユーザが検索エンジンを利用して検索可能なテキストであり、具体的には、組織や団体、著名人などが自ら運営主体となっている公式サイトに掲載された記事や、組織や団体、著名人などと関係のない第三者が運営主体となっている一般サイト（非公式サイト）に記載された記事などである。 First, the acquisition unit 112 acquires a text, which is one of the contents, from the service providing device 20 via the communication unit 102 (S100). For example, a text is a text that can be searched by a user using a search engine. Specifically, an article or an organization posted on an official website operated by an organization, an organization, a celebrity, or the like. Articles posted on general sites (unofficial sites) operated by third parties unrelated to organizations, celebrities, etc.

次に、取得部１１２は、通信部１０２を介して、サービス提供装置２０から検索ログ１３２を取得する（Ｓ１０２）。例えば、取得部１１２は、直近の数か月の間にユーザが入力したクエリの履歴を含む検索ログ１３２を取得する。 Next, the acquisition unit 112 acquires the search log 132 from the service providing device 20 via the communication unit 102 (S102). For example, the acquisition unit 112 acquires the search log 132 including the history of queries entered by the user in the last few months.

次に、テキスト分割部１１４は、取得部１１２によって取得されたテキストを一つまたは複数のフレーズに分割する（Ｓ１０４）。本実施形態に係るフレーズは、「は」、「が」、「を」といった助詞のように一文字であってもよいし、複数の文字を含む文字列であってもよい。すなわち、本実施形態の説明では、フレーズという用語を、複数の単語の集まりであるという辞書的な意味で使用するのではなく、一つの文字や、一つの単語、一つの句、一つの文といった、もう少し広義な意味で使用する。 Next, the text segmentation unit 114 divides the text acquired by the acquisition unit 112 into one or a plurality of phrases (S104). The phrase according to the present embodiment may be a single character such as particles such as "ha", "ga", and "o", or may be a character string including a plurality of characters. That is, in the description of the present embodiment, the term phrase is not used in the dictionary sense of being a collection of a plurality of words, but is referred to as one character, one word, one phrase, one sentence. , Used in a broader sense.

例えば、テキスト分割部１１４は、テキストに含まれる文字数をＮとした場合、２の（Ｎ−１）乗通りのフレーズの組み合わせのパターンで、テキストを分割する。例えば、テキストは、Ｎ＝１であれば１通りのパターンで分割され、Ｎ＝２であれば２通りのパターンで分割され、Ｎ＝３であれば４通りのパターンで分割され、Ｎ＝４であれば８通りのパターンで分割される。 For example, when the number of characters included in the text is N, the text segmentation unit 114 divides the text by a pattern of a combination of phrases according to 2 (N-1) power. For example, if N = 1, the text is divided into one pattern, if N = 2, it is divided into two patterns, if N = 3, it is divided into four patterns, and N = 4. If so, it is divided into eight patterns.

次に、フレーズスコア算出部１１６は、検索ログ１３２に基づいて、テキスト分割部１１４によって分割されたフレーズごとに、そのフレーズが固有表現であることの確からしさを定量化した指標値（以下、フレーズスコアと称する）を算出する（Ｓ１０６）。例えば、フレーズスコア算出部１１６は、数式（１）に基づいて、フレーズスコアを算出する。 Next, the phrase score calculation unit 116 quantifies the certainty that the phrase is a named entity for each phrase divided by the text segmentation unit 114 based on the search log 132 (hereinafter, phrase). (Referred to as a score) is calculated (S106). For example, the phrase score calculation unit 116 calculates the phrase score based on the mathematical formula (1).

式中Ｓは、フレーズスコアを表し、βは、スコア算出対象とするフレーズと一致するクエリの入力回数（そのクエリを使った検索回数）を表し、αは、１よりも大きい実数（例えば１０など）を表し、Ｌは、スコア算出対象とするフレーズの長さ、すなわちフレーズに含まれる文字数を表している。 In the formula, S represents the phrase score, β represents the number of times a query is input that matches the phrase for which the score is to be calculated (the number of searches using that query), and α is a real number greater than 1 (for example, 10). ), And L represents the length of the phrase for which the score is to be calculated, that is, the number of characters included in the phrase.

例えば、スコア算出対象とするフレーズが、検索ログ１３２に含まれるクエリのいずれかと一致している場合、すなわち、スコア算出対象とするフレーズがクエリとして１回以上入力されている場合、フレーズスコア算出部１１６は、クエリの入力回数βが多く、且つフレーズ長Ｌが大きいほど、対象のフレーズのフレーズスコアＳを大きくし、クエリの入力回数βが少なく、且つフレーズ長Ｌが小さいほど、対象のフレーズのフレーズスコアＳを小さくする。なお、スコア算出対象とするフレーズが、検索ログ１３２に含まれるクエリのいずれかと一致しない場合、すなわち、スコア算出対象とするフレーズがクエリとして入力されていない場合、フレーズスコア算出部１１６は、対象のフレーズのフレーズスコアＳを０にする。 For example, if the phrase to be scored matches any of the queries included in the search log 132, that is, if the phrase to be scored is entered as a query more than once, the phrase score calculation unit. In 116, the larger the number of query inputs β and the larger the phrase length L, the larger the phrase score S of the target phrase, and the smaller the number of query inputs β and the smaller the phrase length L, the larger the phrase score S of the target phrase. Decrease the phrase score S. If the phrase to be scored does not match any of the queries included in the search log 132, that is, if the phrase to be scored is not entered as a query, the phrase score calculation unit 116 is the target. Set the phrase score S of the phrase to 0.

次に、固有表現抽出部１１８は、フレーズスコア算出部１１６によってフレーズごとに算出されたフレーズスコアに基づいて、テキストから固有表現を抽出する（Ｓ１０８）。これによって本フローチャートの処理が終了する。 Next, the named entity extraction unit 118 extracts the named entity from the text based on the phrase score calculated for each phrase by the phrase score calculation unit 116 (S108). This ends the processing of this flowchart.

図５から図８は、３文字のテキストから固有表現を抽出する方法を模式的に示す図である。これらの図は、「ＡＢＣ」という３文字のテキストが一つまたは複数のフレーズに分割されていることを模式的に示している。従って、テキストは、４（２^２）通りのパターンで分割される。 5 to 8 are diagrams schematically showing a method of extracting a named entity from a three-character text. These figures schematically show that the three-letter text "ABC" is divided into one or more phrases. Therefore, the text is divided into 4 ( ²² ) patterns.

例えば、図５は、「ＡＢＣ」という一つのテキストを、「Ａ」という一文字だけのフレーズと、「Ｂ」という一文字だけのフレーズと、「Ｃ」という一文字だけのフレーズとに分割するパターン１を表している。フレーズスコア算出部１１６は、パターン１の場合、「Ａ」のフレーズについては、フレーズスコアＳ_Ａを算出し、「Ｂ」のフレーズについては、フレーズスコアＳ_Ｂを算出し、「Ｃ」のフレーズについては、フレーズスコアＳ_Ｃを算出している。 For example, FIG. 5 shows pattern 1 in which one text "ABC" is divided into a one-letter phrase "A", a one-letter phrase "B", and a one-letter phrase "C". Represents. In the case of pattern 1, the phrase score calculation unit 116 calculates the phrase score S _A for the phrase "A", calculates the phrase score S _B for the phrase "B", and calculates the phrase score S _B for the phrase "C". is to calculate the phrase score S _C.

図６は、「ＡＢＣ」という一つのテキストを、「ＡＢ」という二文字のフレーズと、「Ｃ」という一文字だけのフレーズとに分割するパターン２を表している。フレーズスコア算出部１１６は、「ＡＢ」のフレーズについては、フレーズスコアＳ_ＡＢを算出し、「Ｃ」のフレーズについては、フレーズスコアＳ_Ｃを算出している。 FIG. 6 represents pattern 2 in which one text "ABC" is divided into a two-letter phrase "AB" and a one-letter phrase "C". Phrase score calculating unit 116, the phrase "AB" computes the phrase score S _AB, the phrase "C", calculates the phrase score S _C.

図７は、「ＡＢＣ」という一つのテキストを、「Ａ」という一文字だけのフレーズと、「ＢＣ」という二文字のフレーズとに分割するパターン３を表している。フレーズスコア算出部１１６は、「Ａ」のフレーズについては、フレーズスコアＳ_Ａを算出し、「ＢＣ」のフレーズについては、フレーズスコアＳ_ＢＣを算出している。 FIG. 7 represents pattern 3 in which one text "ABC" is divided into a one-letter phrase "A" and a two-letter phrase "BC". The phrase score calculation unit 116 calculates the phrase score S _A for the phrase "A" and calculates the phrase score S _BC for the phrase "BC".

図８は、「ＡＢＣ」という一つのテキストを、そのまま一つのフレーズとするパターン４を表している。フレーズスコア算出部１１６は、「ＡＢＣ」のフレーズについて、フレーズスコアＳ_ＡＢＣを算出している。 FIG. 8 shows a pattern 4 in which one text "ABC" is used as it is as one phrase. The phrase score calculation unit 116 calculates the phrase score S _ABC for the phrase "ABC".

フレーズスコア算出部１１６は、上記のように各パターンについて個々のフレーズのフレーズスコアＳを算出すると、パターンごとにフレーズスコアＳの和を算出する。図５に例示するパターン１では、フレーズスコアＳの和は、（Ｓ_Ａ＋Ｓ_Ｂ＋Ｓ_Ｃ）となり、図６に例示するパターン２では、フレーズスコアＳの和は、（Ｓ_ＡＢ＋Ｓ_Ｃ）となり、図７に例示するパターン３では、フレーズスコアＳの和は、（Ｓ_Ａ＋Ｓ_ＢＣ）となり、図８に例示するパターン４では、フレーズスコアＳの和は、（Ｓ_ＡＢＣ）となる。 When the phrase score calculation unit 116 calculates the phrase score S of each phrase for each pattern as described above, the phrase score calculation unit 116 calculates the sum of the phrase scores S for each pattern. In pattern 1 illustrated in FIG. 5, the sum of the phrase score _{_{S, (S A + S B}} + S C) next to, in the pattern 2 illustrated in FIG. 6, the sum of the phrase score S _{is, (S} AB + _{S C),} and the in pattern 3 illustrated in FIG. 7, the sum of a phrase score S _is, in the pattern 4 illustrated (S a _{+ S BC),} and the 8, the sum of a phrase score S _{becomes (S ABC).}

固有表現抽出部１１８は、これら４つのパターンの中から、フレーズスコアＳの和が最大となるパターンを選択し、そのパターンが表すフレーズを固有表現として抽出する。例えば、パターン４のフレーズスコアＳの和Ｓ_ＡＢＣが最大である場合、固有表現抽出部１１８は、「ＡＢＣ」という一つのフレーズを固有表現として抽出する。また、例えば、パターン２のフレーズスコアＳの和（Ｓ_ＡＢ＋Ｓ_Ｃ）が最大である場合、固有表現抽出部１１８は、「ＡＢ」というフレーズと「Ｃ」というフレーズとをそれぞれ固有表現として抽出する。 The named entity extraction unit 118 selects a pattern having the maximum sum of the phrase scores S from these four patterns, and extracts the phrase represented by the pattern as a named entity. For example, if the sum S _ABC phrase score S of the pattern 4 is the largest, named entity extraction unit 118 extracts one phrase "ABC" as a named entity. Further, for example, if the sum of the phrase score S of the pattern 2 (S _{AB + S} _C) is the maximum, named entity extraction unit 118 extracts a phrase and phrase "AB", "C" as the respective named entities ..

図９から図１６は、４文字のテキストから固有表現を抽出する方法を模式的に示す図である。これらの図は、「ＡＢＣＤ」という４文字のテキストが一つまたは複数のフレーズに分割されていることを模式的に示している。従って、テキストは、８（２^３）パターンで分割される。 9 to 16 are diagrams schematically showing a method of extracting a named entity from a four-character text. These figures schematically show that the four-letter text "ABCD" is divided into one or more phrases. Therefore, the text is divided into 8 (2 ³ ) patterns.

例えば、図９は、「ＡＢＣＤ」という一つのテキストを、一文字ごとのフレーズに分割するパターン１−１を表している。図１０は、「ＡＢＣＤ」という一つのテキストを、「Ａ」、「Ｂ」、「ＡＢ」という３つのフレーズに分割するパターン１−２を表している。図１１は、「ＡＢＣＤ」という一つのテキストを、「ＡＢ」、「Ｃ」、「Ｄ」という３つのフレーズに分割するパターン２−１を表している。図１２は、「ＡＢＣＤ」という一つのテキストを、「ＡＢ」、「ＣＤ」という２つのフレーズに分割するパターン２−２を表している。図１３は、「ＡＢＣＤ」という一つのテキストを、「Ａ」、「ＢＣ」、「Ｄ」という３つのフレーズに分割するパターン３−１を表している。図１４は、「ＡＢＣＤ」という一つのテキストを、「Ａ」、「ＢＣＤ」という２つのフレーズに分割するパターン３−２を表している。図１５は、「ＡＢＣＤ」という一つのテキストを、「ＡＢＣ」、「Ｄ」という２つのフレーズに分割するパターン４−１を表している。図１６は、「ＡＢＣＤ」という一つのテキストを、そのまま一つのフレーズとするパターン４−２を表している。上記同様に、フレーズスコア算出部１１６は、各パターンについて個々のフレーズのフレーズスコアＳを算出すると、パターンごとにフレーズスコアＳの和を算出する。そして、固有表現抽出部１１８は、これら８つのパターンの中から、フレーズスコアＳの和が最大となるパターンを選択し、そのパターンが表すフレーズを固有表現として抽出する。 For example, FIG. 9 shows a pattern 1-1 in which one text "ABCD" is divided into phrases for each character. FIG. 10 represents a pattern 1-2 that divides one text "ABCD" into three phrases "A", "B", and "AB". FIG. 11 represents a pattern 2-1 that divides one text "ABCD" into three phrases "AB", "C", and "D". FIG. 12 represents a pattern 2-2 that divides one text "ABCD" into two phrases "AB" and "CD". FIG. 13 represents a pattern 3-1 that divides one text "ABCD" into three phrases "A", "BC", and "D". FIG. 14 represents a pattern 3-2 that divides one text "ABCD" into two phrases "A" and "BCD". FIG. 15 represents a pattern 4-1 that divides one text "ABCD" into two phrases "ABC" and "D". FIG. 16 shows a pattern 4-2 in which one text "ABCD" is used as it is as one phrase. Similarly to the above, when the phrase score calculation unit 116 calculates the phrase score S of each phrase for each pattern, the phrase score calculation unit 116 calculates the sum of the phrase scores S for each pattern. Then, the named entity extraction unit 118 selects the pattern having the maximum sum of the phrase scores S from these eight patterns, and extracts the phrase represented by the pattern as the named entity.

以上説明した第１実施形態によれば、テキストの文字数Ｎに基づく数の組み合わせのパターンで、そのテキストを一つまたは複数のフレーズに分割し、各パターンにおいて、分割したフレーズごとにフレーズスコアを算出し、パターンごとにフレーズスコアの和を算出し、算出した和が最大となるパターンのフレーズを固有表現として抽出する。これによって、文書から精度よく固有表現を抽出することができる。 According to the first embodiment described above, the text is divided into one or a plurality of phrases in a pattern of a combination of numbers based on the number of characters N of the text, and a phrase score is calculated for each divided phrase in each pattern. Then, the sum of the phrase scores is calculated for each pattern, and the phrase of the pattern having the maximum calculated sum is extracted as a named entity. As a result, the named entity can be accurately extracted from the document.

従来より、予め固有表現が登録された辞書を用いて、テキストから固有表現を抽出することが行われているが、新語などの固有表現は日々出現しており、頻繁に辞書を更新する必要がある。しかしながら、辞書を日々更新することは現実的に困難である。また、一部のコミュニティで新語として使われ始めたニッチな用語などについては、固有表現として辞書に登録されにくい。 Traditionally, named entities have been extracted from text using a dictionary in which named entities are registered in advance, but named entities such as new words are appearing every day, and it is necessary to update the dictionary frequently. is there. However, it is practically difficult to update the dictionary daily. In addition, it is difficult for niche terms that have begun to be used as new words in some communities to be registered in the dictionary as named entities.

そのため、例えば、新作コンテンツのタイトルが「〇〇〇公式ガイドブック・◇◇から△△までの歩き方」のような一文であり、このタイトルを含むテキストに辞書を適用して固有表現を抽出する場合、「〇〇〇」、「公式」、「ガイドブック」、「◇◇」、「△△」、「歩き方」のような複数の単語が固有表現として抽出され、本来抽出すべき「〇〇〇公式ガイドブック・◇◇から△△までの歩き方」という一文が固有表現として抽出されないことになる。 Therefore, for example, the title of the new content is a sentence such as "○○○ Official Guidebook ・ How to walk from ◇◇ to △△", and a named entity is extracted by applying a dictionary to the text containing this title. In this case, multiple words such as "○○○", "official", "guidebook", "◇◇", "△△", and "how to walk" are extracted as named entities, and "○" should be extracted. 〇〇 Official guidebook ・ The sentence "How to walk from ◇◇ to △△" will not be extracted as a named entity.

一方で、クエリという性質について考えた場合、ある新作コンテンツのタイトルが文のように長いタイトルであれば、ユーザは、公式サイトや第三者のウェブサイトなどからタイトルを表す文字列をコピーし、検索サイトの入力欄に、コピーした文字列を張り付けることが想定される。この場合、固有表現であるコンテンツのタイトルと一語一句同じクエリが検索ログ１３２として収集されることになる。特に、直近数か月のようなごく最近の検索ログ１３２には、今現在流行しているような新語などがクエリとして含まれやすい。そのため、本実施形態では、テキストを分割したフレーズと検索ログ１３２のクエリとを比較することで、固有表現が長くても、或いは真新しい固有表現であっても、テキストから精度よく固有表現を抽出することができる。 On the other hand, when considering the nature of queries, if the title of a new content is as long as a sentence, the user copies the character string representing the title from the official website or the website of a third party, and then It is assumed that the copied character string will be pasted in the input field of the search site. In this case, the query that is the same word by word as the title of the content that is a named entity is collected as the search log 132. In particular, the most recent search log 132, such as the last few months, tends to include new words that are currently popular as queries. Therefore, in the present embodiment, by comparing the phrase in which the text is divided with the query of the search log 132, the named entity is accurately extracted from the text even if the named entity is long or is a brand new named entity. be able to.

また、括弧やアポストロフィ、プライムといった約物（記述記号）によって囲まれたテキストの一部を、固有表現として抽出することも考えられる。しかしながら、この手法では、人物の台詞や引用文などを固有表現として抽出する場合があり、それが一つの名詞として使用されている固有表現なのか、単に台詞や引用文なのかを区別することができない。また、コンテンツのキャラクター名や人名などは固有表現であるものの、通常括弧などで囲まれていないことから、テキストから抽出することができない。 It is also conceivable to extract a part of the text surrounded by punctuation marks (descriptive symbols) such as parentheses, apostrophes, and primes as named entities. However, in this method, a person's dialogue or quotation may be extracted as a named entity, and it is possible to distinguish whether it is a named entity used as one noun or simply a dialogue or quotation. Can not. In addition, although the character name and personal name of the content are unique expressions, they cannot be extracted from the text because they are usually not enclosed in parentheses.

これに対して、本実施形態では、約物に依らずに固有表現を抽出することができる。また、映画や書籍のタイトルには、しばしば副題が付けられており、その副題が約物によって囲まれている場合がある。仮に固有表現が約物で囲まれていたとしても、ユーザが約物で囲まれた固有表現をクエリとして入力していれば、本実施形態の手法によって、その約物を含む固有表現も抽出することができる。 On the other hand, in the present embodiment, named entity can be extracted regardless of punctuation marks. Also, movie and book titles often have subtitles, which may be surrounded by punctuation marks. Even if the named entity is surrounded by punctuation marks, if the user inputs the named entity surrounded by punctuation marks as a query, the named entity including the punctuation marks is also extracted by the method of the present embodiment. be able to.

また、単にテキストのフレーズと検索ログ１３２のクエリとを比較した場合、テキストには、「は」、「を」、「です」、「ます」といった比較的短いフレーズが出現しやすいため、それらのフレーズがクエリと偶然に一致し、フレーズスコアＳが大きくなる傾向となる。これに対して、本実施形態では、指数をフレーズ長Ｌとした任意の基数αと入力回数βとの積をフレーズスコアＳとするため、入力回数βが少ないフレーズであっても、フレーズ長Ｌが大きければフレーズスコアＳを大きくし、入力回数βが多いフレーズであっても、フレーズ長Ｌが小さければフレーズスコアＳを低くすることができる。この結果、助詞などを固有表現として抽出することを抑制しつつ、複数の名詞が助詞などで接続された句や文を一つの固有表現として精度よく抽出することができる。 Also, if you simply compare the phrase in the text with the query in the search log 132, relatively short phrases such as "ha", "o", "desu", and "masu" are likely to appear in the text. The phrase coincides with the query by chance, and the phrase score S tends to increase. On the other hand, in the present embodiment, the product of an arbitrary radix α with the phrase length L as the exponent and the number of inputs β is the phrase score S. Therefore, even if the phrase has a small number of inputs β, the phrase length L If is large, the phrase score S can be increased, and even if the phrase has a large number of inputs β, the phrase score S can be decreased if the phrase length L is small. As a result, it is possible to accurately extract a phrase or sentence in which a plurality of nouns are connected by particles or the like as one named entity while suppressing the extraction of particles or the like as a named entity.

＜第２実施形態＞
以下、第２実施形態について説明する。上述した第１実施形態では、フレーズの組み合わせである全パターンについてフレーズスコアの和を算出し、その和が最大となるパターンのフレーズを固有表現として抽出するものとして説明した。これに対して、第２実施形態では、全パターンについてフレーズスコアの和を算出するのではなく、検証すべきパターンを合理的に決定した上でフレーズスコアの和を算出する点で上述した第１実施形態と相違する。以下、第１実施形態との相違点を中心に説明し、第１実施形態と共通する点については説明を省略する。なお、第２実施形態の説明において、第１実施形態と同じ部分については同一符号を付して説明する。 <Second Embodiment>
Hereinafter, the second embodiment will be described. In the first embodiment described above, the sum of phrase scores is calculated for all patterns that are a combination of phrases, and the phrase of the pattern having the maximum sum is extracted as a named entity. On the other hand, in the second embodiment, the sum of the phrase scores is calculated after rationally determining the pattern to be verified, instead of calculating the sum of the phrase scores for all the patterns. Different from the embodiment. Hereinafter, the differences from the first embodiment will be mainly described, and the points common to the first embodiment will be omitted. In the description of the second embodiment, the same parts as those of the first embodiment will be described with the same reference numerals.

図１７から図１９は、スコア算出対象のパターンの決定方法を説明するための図である。例えば、「ＡＢＣＤ」という４文字のテキストが与えられた場合、第２実施形態に係るテキスト分割部１１４は、図１７に例示するように、テキストの先頭の第１文字「Ａ」とそれに続く第２文字「Ｂ」との間を分割し、「Ａ」というフレーズと「Ｂ」というフレーズとを生成するとともに、これらの文字の間を分割せず、「ＡＢ」という２文字のフレーズを生成する。第２実施形態に係るフレーズスコア算出部１１６は、検索ログ１３２を用いて、フレーズ「Ａ」のフレーズスコアＳ_Ａとフレーズ「Ｂ」のフレーズスコアＳ_Ｂとの和（Ｓ_Ａ＋Ｓ_Ｂ）を算出するとともに、フレーズ「ＡＢ」のフレーズスコアＳ_ＡＢを算出する。テキスト分割部１１４は、これらを比較し、よりスコアが小さい方のパターンから派生したパターンを、次に検証するパターン候補から消去する。第１文字「Ａ」は、「第１文字列」の一例であり、第２文字「Ｂ」は、「第２文字列」の一例であり、フレーズ「ＡＢ」は、「第３文字列」の一例である。 17 to 19 are diagrams for explaining a method of determining a pattern to be scored. For example, when a four-character text "ABCD" is given, the text segmentation unit 114 according to the second embodiment has the first character "A" at the beginning of the text followed by the first character "A" as illustrated in FIG. The two-letter "B" is split to generate the phrase "A" and the phrase "B", and the two-letter phrase "AB" is generated without splitting between these letters. .. Phrase score calculating unit 116 according to the second embodiment, by using the search log 132, calculates the sum of the phrase score _{S B} phrase phrase score _{S A} and phrases "A", "B" _(S A + _{S B)} as well as, to calculate the phrase score _{S AB} of the phrase "AB". The text segmentation unit 114 compares these and deletes the pattern derived from the pattern having the smaller score from the pattern candidates to be verified next. The first character "A" is an example of "first character string", the second character "B" is an example of "second character string", and the phrase "AB" is "third character string". This is an example.

図１７の例では、（Ｓ_Ａ＋Ｓ_Ｂ）よりもＳ_ＡＢの方が大きい。この場合、第２文字に続く第３文字「Ｃ」を含めたフレーズの組み合わせのパターンは、上述したパターン１とパターン２とパターン３とパターン４の計４種類となる。これらの４種類のパターンのうち、少なくともパターン１とパターン２とは、第３文字「Ｃ」を含める以前の結果と同じになる。例えば、パターン１のフレーズスコアの和は（Ｓ_Ａ＋Ｓ_Ｂ＋Ｓ_Ｃ）であり、パターン２のフレーズスコアの和は（Ｓ_ＡＢ＋Ｓ_Ｃ）であることから、前回が（Ｓ_Ａ＋Ｓ_Ｂ）よりもＳ_ＡＢの方が大きいという結果であれば、パターン１およびパターン２のフレーズスコアの和の大小関係は変化しない。従って、フレーズスコア算出部１１６は、テキストの先頭の第１文字「Ａ」とそれに続く第２文字「Ｂ」との間を分割するというパターン１についてはスコアを算出しない。 In the example of FIG. _17, the larger _{S AB} than _{_(S} A + S _B). In this case, the pattern of the phrase combination including the third character "C" following the second character is a total of four types of the above-mentioned pattern 1, pattern 2, pattern 3, and pattern 4. Of these four types of patterns, at least pattern 1 and pattern 2 are the same as the results before including the third character "C". For example, the sum of the phrase score pattern 1 is _{_{_{(S A + S B + S}}} C), the sum of the phrase scores pattern 2 because it is _{_(S} AB + S _C), than the previous time _{_(S} A + S _B) If the result is that S _AB is larger, the magnitude relationship of the sum of the phrase scores of pattern 1 and pattern 2 does not change. Therefore, the phrase score calculation unit 116 does not calculate the score for the pattern 1 in which the first character "A" at the beginning of the text and the second character "B" following it are divided.

次に、テキスト分割部１１４は、図１８に例示するように、残された３種類のパターン３のスコアを比較する。図１８の例では、パターン４のフレーズスコアＳ_ＡＢＣが最も大きい。従って、フレーズスコア算出部１１６は、テキストの先頭の一文字「Ａ」とそれに続く三文字の組み合わせ「ＢＣＤ」との間を分割するパターン３−１と、テキストの先頭の二文字の組み合わせ「ＡＢ」とそれに続く二文字の組み合わせ「ＣＤ」との間を分割するパターン２−２と、テキストの先頭の三文字の組み合わせ「ＡＢＣ」とそれに続く一文字「Ｄ」との間を分割するパターン４−１と、テキストを分割せず一つのフレーズとするパターン４−２の合計４パターンについてのみスコアを算出する。このように、本手法では、文字列の最後のｋ文字目について検討するときには、ｋ個のパターンを比較する。 Next, the text segmentation unit 114 compares the scores of the remaining three types of patterns 3 as illustrated in FIG. In the example of FIG. 18, the phrase score S _{ABC of} pattern 4 is the largest. Therefore, the phrase score calculation unit 116 has a pattern 3-1 that divides between the first character "A" of the text and the combination "BCD" of the following three characters, and the combination "AB" of the first two characters of the text. Pattern 2-2 that divides between and the following two-letter combination "CD", and pattern 4-1 that divides between the first three-letter combination "ABC" and the following one-letter "D". And, the score is calculated only for a total of 4 patterns of pattern 4-2, which is a single phrase without dividing the text. As described above, in this method, when examining the last k-th character of the character string, k patterns are compared.

次に、フレーズスコア算出部１１６は、上記の４パターン（３−１、２−２、４−１、−２）のそれぞれのフレーズスコアの和を算出する。例えば、パターン４−１のフレーズスコアの和（Ｓ_ＡＢＣＤ＋Ｓ_Ｄ）が最も大きい場合、固有表現抽出部１１８は、「ＡＢＣ」というフレーズと「Ｄ」というフレーズとをそれぞれ固有表現として抽出する。このように、テキストの先頭の文字から順番に組み合わせていき、各組み合わせの候補をスコアの大きさに応じて、その組み合わせのパターンを取捨選択することで、最適な組み合わせを探索することができる。 Next, the phrase score calculation unit 116 calculates the sum of the phrase scores of each of the above four patterns (3-1, 2-2, 4-1 and -2). For example, when the sum of the phrase scores of the pattern 4-1 (S _ABCD + _SD ) is the largest, the named entity extraction unit 118 extracts the phrase "ABC" and the phrase "D" as named entities, respectively. In this way, the optimum combination can be searched for by combining the characters in order from the first character of the text and selecting the candidates for each combination according to the size of the score.

なお、あるパターンのフレーズスコアが０となった場合、そのパターンに一文字追加した派生パターンについては、以後考慮しなくてもよい。フレーズスコアが０ということは、そのフレーズと一致するクエリの入力回数βが０であることを意味する。すなわち、どのユーザも、そのパターンによって表されるフレーズをクエリとして入力したことがないことを意味しており、そのパターンのフレーズが固有表現であるという蓋然性が極めて低いことを表している。 When the phrase score of a certain pattern becomes 0, it is not necessary to consider the derived pattern in which one character is added to the pattern. When the phrase score is 0, it means that the number of inputs β of the query matching the phrase is 0. That is, it means that no user has input the phrase represented by the pattern as a query, and it is extremely unlikely that the phrase of the pattern is a named entity.

以上説明した第２実施形態によれば、テキストの先頭の文字から順番に組み合わせていき、その時点で各パターンのフレーズスコアを比較し、フレーズスコアがより小さいパターンを以降の処理対象から除外する。 According to the second embodiment described above, the characters are combined in order from the first character of the text, the phrase scores of each pattern are compared at that time, and the patterns having a smaller phrase score are excluded from the subsequent processing targets.

例えば、映画やドラマ、アニメといったコンテンツには、ある単語Ａと、ある単語Ｂとの間に「の」や「と」といった助詞などを挟んだ固有名詞をタイトルとしているものがある。具体的には、「〇〇と□□」や「〇〇の△△」といったタイトルである。このようなコンテンツのタイトルを、全パターンについてフレーズスコアを求めた場合、「の」や「と」のような助詞が名詞の先頭に出現するようなフレーズについてもスコアを算出することになる。しかしながら、現実世界では、フレーズの冒頭に助詞が出現することは極めて稀であり、そのフレーズそのものが世に存在していないと見做すことができる。従って、テキストの先頭の文字から順番に組み合わせていき、その時点で各パターンのフレーズスコアを比較し、フレーズスコアがより小さいパターンを以降の処理対象から除外することで、効率よく固有名詞を抽出することができる。 For example, some contents such as movies, dramas, and animations have proper nouns whose titles are particles such as "no" and "to" between a certain word A and a certain word B. Specifically, the titles are "○○ and □□" and "○○ △△". When the phrase score is obtained for all patterns of the title of such content, the score is also calculated for the phrase in which particles such as "no" and "to" appear at the beginning of the noun. However, in the real world, particles rarely appear at the beginning of a phrase, and it can be considered that the phrase itself does not exist in the world. Therefore, by combining in order from the first character of the text, comparing the phrase scores of each pattern at that point, and excluding patterns with smaller phrase scores from the subsequent processing targets, proper nouns are efficiently extracted. be able to.

上述した第１実施形態のように、全パターンの区切り方を試した場合、テキストに含まれる文字数をＮとすれば、２^Ｎ−１のようにスコアの算出回数が増加する。これに対して、第２実施形態では、文字の連続性を考慮して、パターン数を減らすため、スコアの算出回数をＮ^２回に抑えることができる。 When the method of separating all patterns is tried as in the first embodiment described above, if the number of characters included in the text is N, the number of times the score is calculated increases as in 2 ^N-1 . In contrast, in the second embodiment, in consideration of the character of continuity, to reduce the number of patterns, it is possible to suppress the number of calculations of the score ² times N.

＜ハードウェア構成＞
上述した実施形態の情報処理装置１００は、例えば、図１９に示すようなハードウェア構成により実現される。図１９は、実施形態の情報処理装置１００のハードウェア構成の一例を示す図である。 <Hardware configuration>
The information processing device 100 of the above-described embodiment is realized by, for example, a hardware configuration as shown in FIG. FIG. 19 is a diagram showing an example of the hardware configuration of the information processing apparatus 100 of the embodiment.

情報処理装置１００は、ＮＩＣ１００−１、ＣＰＵ１００−２、ＲＡＭ１００−３、ＲＯＭ１００−４、フラッシュメモリやＨＤＤなどの二次記憶装置１００−５、およびドライブ装置１００−６が、内部バスあるいは専用通信線によって相互に接続された構成となっている。ドライブ装置１００−６には、光ディスクなどの可搬型記憶媒体が装着される。二次記憶装置１００−５、またはドライブ装置１００−６に装着された可搬型記憶媒体に格納されたプログラムがＤＭＡコントローラ（不図示）などによってＲＡＭ１００−３に展開され、ＣＰＵ１００−２によって実行されることで、制御部１１０が実現される。制御部１１０が参照するプログラムは、ネットワークＮＷを介して他の装置からダウンロードされてもよい。 The information processing device 100 includes NIC100-1, CPU100-2, RAM100-3, ROM100-4, secondary storage devices 100-5 such as flash memory and HDD, and drive device 100-6, which are internal buses or dedicated communication lines. It is configured to be interconnected by. A portable storage medium such as an optical disk is mounted on the drive device 100-6. A program stored in a portable storage medium mounted on the secondary storage device 100-5 or the drive device 100-6 is expanded into the RAM 100-3 by a DMA controller (not shown) or the like, and executed by the CPU 100-2. As a result, the control unit 110 is realized. The program referred to by the control unit 110 may be downloaded from another device via the network NW.

以上、本発明を実施するための形態について実施形態を用いて説明したが、本発明はこうした実施形態に何ら限定されるものではなく、本発明の要旨を逸脱しない範囲内において種々の変形及び置換を加えることができる。 Although the embodiments for carrying out the present invention have been described above using the embodiments, the present invention is not limited to these embodiments, and various modifications and substitutions are made without departing from the gist of the present invention. Can be added.

１…情報処理システム、１０…端末装置、２０…サービス提供装置、１００…情報処理装置、１０２…通信部、１１０…制御部、１１２…取得部、１１４…テキスト分割部、１１６…フレーズスコア算出部、１１８…固有表現抽出部、１３０…記憶部 1 ... Information processing system, 10 ... Terminal device, 20 ... Service providing device, 100 ... Information processing device, 102 ... Communication unit, 110 ... Control unit, 112 ... Acquisition unit, 114 ... Text segmentation unit, 116 ... Phrase score calculation unit , 118 ... Named entity extraction unit, 130 ... Storage unit

Claims

A division part that divides a sentence into a character string containing at least one character,
A calculation unit that calculates a score for each of the character strings divided by the division unit based on a plurality of queries input by the user.
An extraction unit that extracts a named entity from the sentence based on the score calculated by the calculation unit.
Information processing device equipped with.

The calculation unit calculates the score based on the number of times the query is input that matches the character string and the length of the character string.
The information processing device according to claim 1.

The extraction unit extracts one or more of the character strings obtained when the division unit divides the sentence at a position where the sum of the scores for each character string is maximized, as the named entity.
The information processing device according to claim 1 or 2.

The division portion divides the sentence at the first position and
The calculation unit includes a score of a first character string containing at least one character appearing immediately before the first position and a score of a second character string containing at least one character appearing immediately after the first position. , The score of the third character string which is a combination of the first character string and the second character string is calculated.
The extraction unit compares the sum of the score of the first character string and the score of the second character string with the score of the third character string, and expresses the character string having the smaller score as the unique expression. Exclude from the extraction target of
The information processing device according to any one of claims 1 to 3.

The computer
Divide the sentence into a string containing at least one character,
Based on a plurality of queries entered by the user, a score is calculated for each of the divided character strings.
Based on the calculated score, the named entity is extracted from the sentence.
Information processing method.

On the computer
The process of dividing a sentence into a character string containing at least one character,
A process of calculating a score for each of the divided character strings based on a plurality of queries entered by the user, and
A process of extracting a named entity from the sentence based on the calculated score,
A program to execute.