JP5834772B2

JP5834772B2 - Information processing apparatus and program

Info

Publication number: JP5834772B2
Application number: JP2011236417A
Authority: JP
Inventors: 山口　倫治; 倫治山口; 佐藤　勝彦; 勝彦佐藤
Original assignee: Casio Computer Co Ltd
Current assignee: Casio Computer Co Ltd
Priority date: 2011-10-27
Filing date: 2011-10-27
Publication date: 2015-12-24
Anticipated expiration: 2031-10-27
Also published as: JP2013097395A

Description

本発明は、情報処理装置及びプログラムに関する。 The present invention relates to an information processing apparatus and a program.

複数の単語を含む単語列（原文）を他の言語に翻訳する翻訳装置が知られている。このような翻訳装置は、原文を翻訳するにあたって、原文がどの単語と単語との間（語間）で区切れるか推測して翻訳処理を実行する。 A translation device that translates a word string (original text) including a plurality of words into another language is known. When translating an original sentence, such a translation apparatus performs translation processing by estimating which word (word-to-word) the original sentence is divided.

文書や単語列の区切り方を推測する方法に関連して、特許文献１は予め文書が属する言語の文法規則をプログラミングした構文解析器を用いて文書の区切れ方を推測する技術を提案している。
また、特許文献２は、分かち書きされていない文字列を単語毎に分割する技術を提案している。 In relation to a method for estimating how to separate a document or a word string, Patent Document 1 proposes a technique for inferring how to separate a document using a parser that has been programmed in advance with the grammar rules of the language to which the document belongs. Yes.
Patent Document 2 proposes a technique for dividing a character string that is not divided into words.

特開平６−３０９３１０号公報JP-A-6-309310 特開平１０−２５４８７４号公報JP-A-10-254874

特許文献１の技術では、原文がどの単語と単語との間で区切れるかを推測するために、原文が属する言語の文法規則をプログラミングした構文解析器を用いる。そのため、原文の属する言語毎に構文解析器を多くの開発費用・日数を費やして作成しなくてはならなかった。また、特許文献２は、分かち書きされていない文字列を単語毎に分割する技術を開示しているが、文字列がどの単語と単語との間で区切れているか判別する方法を開示していない。 In the technique of Patent Document 1, in order to infer which words are separated from each other in the original text, a syntax analyzer that is programmed with the grammar rules of the language to which the original text belongs is used. Therefore, it was necessary to create a parser for each language to which the original text belongs, spending a lot of development costs and days. Further, Patent Document 2 discloses a technique for dividing a character string that is not divided into words, but does not disclose a method for determining which word and word are separated by the character string. .

構文解析器を用いずに文字列がどの単語と単語との間で区切れるかを推測するための技術として、原文と同じカテゴリの単語列がどのように区切られているかを示す教師データから単語と単語との間で区切れる確からしさを求める方法が考えられる。 Words from teacher data indicating how word strings in the same category as the original text are separated as a technique for inferring which words are separated from each other without using a parser There is a method for obtaining the probability of being divided between a word and a word.

しかし、原文に含まれるｎ個の単語（ｎグラム）について単語がどのように区切れるかを、教師データを用いて推測するためには、原文に含まれる全てのｎグラムについて、十分な数の教師データが必要となる。そのため膨大な量の教師データを収集して処理するための手間・計算量が膨大になってしまうという問題点があった。 However, a sufficient number of all n-grams included in the original text is used to infer how the words are divided for n words (n-grams) included in the original text using the teacher data. Teacher data is required. For this reason, there is a problem that the amount of labor and calculation for collecting and processing a huge amount of teacher data becomes enormous.

本発明は上記事情に鑑みてなされたもので、解析対象となる単語列において、ある単語と単語との間で単語列が区切れる確からしさを求めるために必要な教師データの量が少ない情報処理装置及びプログラムを提供することを目的とする。 The present invention has been made in view of the above circumstances, and in the word string to be analyzed, information processing with a small amount of teacher data required to obtain the probability that the word string is divided between a word and a word An object is to provide an apparatus and a program.

上記目的を達成するため、本願発明に係る情報処理装置は、
複数の単語を有する単語列を取得する単語列取得部と、
前記単語列取得部が取得した単語列に含まれる１以上の単語を含む部分単語列を複数抽出する抽出部と、
単語列を構成する単語と単語との間である語間それぞれで単語列が区切れる場合と区切れない場合とのそれぞれの区切り方に対応する区切パターンを、前記抽出部が抽出した部分単語列それぞれについて取得し、当該区切パターンに対応する区切り方で当該部分単語列が区切れる確からしさの程度を示す区切確率係数を、当該抽出した区切パターンそれぞれについて、前記単語列と同一カテゴリに属する単語列であって、当該単語列の語間のそれぞれで単語列が区切れるか否かを定義した教師単語列を記憶している教師単語列記憶部から前記区切パターンと同じパターンを有する教師単語列を抽出し、抽出した教師単語列の数に基づいて、取得する確率係数取得部と、
前記確率係数取得部が取得した区切確率係数に基づいて、前記単語列取得部が取得した単語列を分割する分割部と、
を備えることを特徴とする。

In order to achieve the above object, an information processing apparatus according to the present invention provides:
A word string acquisition unit for acquiring a word string having a plurality of words;
An extraction unit for extracting a plurality of partial word strings including one or more words included in the word string acquired by the word string acquisition unit;
The partial word sequence extracted by the extraction unit with a delimiter pattern corresponding to each delimiter pattern between the case where the word sequence is delimited and the case where the word sequence is not delimited between each word between the words constituting the word sequence A word string belonging to the same category as the word string for each of the extracted division patterns is obtained for each of the division probability coefficients indicating the degree of probability that the partial word string is divided by the division method corresponding to the division pattern. A teacher word string having the same pattern as the delimiter pattern from a teacher word string storage unit storing a teacher word string defining whether or not the word string is delimited between words of the word string Based on the number of teacher word strings extracted and extracted, a probability coefficient acquisition unit to acquire,
A dividing unit that divides the word string acquired by the word string acquisition unit based on the division probability coefficient acquired by the probability coefficient acquisition unit;
It is characterized by providing.

本発明によれば、解析対象となる単語列において、ある単語と単語との間で単語列が区切れる確からしさを求めるために必要な教師データ量が少ない情報処理装置及びプログラムを提供することができる。 According to the present invention, it is possible to provide an information processing apparatus and a program that require a small amount of teacher data in order to obtain a probability that a word string is divided between certain words in a word string to be analyzed. it can.

本発明の実施形態１に係るメニュー表示装置の構成を示すブロック図である。It is a block diagram which shows the structure of the menu display apparatus which concerns on Embodiment 1 of this invention. （ａ）は実施形態１に係る文字列と教師データの関係を、（ｂ）は単語列と区切フラグとトライグラムと区切パターンとの関係を示す図である。(A) is a figure which shows the relationship between the character string which concerns on Embodiment 1, and teacher data, (b) is a figure which shows the relationship between a word string, a division | segmentation flag, a trigram, and a division | segmentation pattern. 実施形態１に係る確率係数出力装置の構成を示すブロック図であり、（ａ）は物理構成を、（ｂ）は機能構成を、それぞれ示す。It is a block diagram which shows the structure of the probability coefficient output device which concerns on Embodiment 1, (a) shows a physical structure, (b) shows a functional structure, respectively. 実施形態１に係るｎグラムリストの例を示す図であり、（ａ）はトライグラムリストを、（ｂ）はバイグラムリストを、（ｃ）はモノグラムリストを、それぞれ示す。It is a figure which shows the example of the n-gram list which concerns on Embodiment 1, (a) shows a trigram list, (b) shows a bigram list, (c) shows a monogram list, respectively. 実施形態１に係る確率係数算出処理の概要を示す図であり、（ａ）はバイグラムの区切確率係数からトライグラムの区切確率係数を算出する処理の例を、（ｂ）はモノグラムの区切確率係数からトライグラムの区切確率係数を算出する処理の例を、それぞれ示す。It is a figure which shows the outline | summary of the probability coefficient calculation process which concerns on Embodiment 1, (a) is an example of the process which calculates the division | segmentation probability coefficient of a trigram from the division | segmentation probability coefficient of a bigram, (b) is a division | segmentation probability coefficient of a monogram. An example of processing for calculating a trigram segmentation probability coefficient from each is shown. 実施形態１に係るメニュー表示処理を示すフローチャートである。4 is a flowchart illustrating menu display processing according to the first embodiment. 実施形態１に係るメニュー分割処理を示すフローチャートである。4 is a flowchart illustrating menu division processing according to the first embodiment. 実施形態１に係る確率係数取得処理を示すフローチャートである。6 is a flowchart illustrating a probability coefficient acquisition process according to the first embodiment. 実施形態１に係る確率係数算出処理を示すフローチャートである。5 is a flowchart illustrating a probability coefficient calculation process according to the first embodiment. 実施形態１に係る区切パターン毎算出処理を示すフローチャートである。6 is a flowchart illustrating calculation processing for each separation pattern according to the first embodiment. 本発明の実施形態２に係る確率係数算出処理の概要を示す図であり、（ａ）はバイグラムの区切確率係数からトライグラムの区切確率係数を算出する処理の例を、（ｂ）はモノグラムとバイグラムの区切確率係数からトライグラムの区切確率係数を算出する処理の例を、（ｃ）はモノグラムの区切確率係数からトライグラムの区切確率係数を算出する処理の例を、それぞれ示す。It is a figure which shows the outline | summary of the probability coefficient calculation process which concerns on Embodiment 2 of this invention, (a) is an example of the process which calculates the division | segmentation probability coefficient of a trigram from the division | segmentation probability coefficient of a bigram, (b) is a monogram and An example of a process for calculating a trigram segmentation probability coefficient from a bigram segmentation probability coefficient is shown in FIG. 実施形態２に係る確率係数算出処理を示すフローチャートである。10 is a flowchart illustrating a probability coefficient calculation process according to the second embodiment. 実施形態２に係る区切パターン毎算出処理を示すフローチャートである。10 is a flowchart showing calculation processing for each separation pattern according to the second embodiment. 本発明のその他の実施形態に係るｎグラム（トライグラム）パターン確率係数リストを示すフローチャートである。It is a flowchart which shows the n-gram (trigram) pattern probability coefficient list which concerns on other embodiment of this invention.

以下、本発明を実施するための形態に係るメニュー表示装置及び確率係数出力装置（情報処理装置）を、図を参照しながら説明する。なお、図中同一又は相当する部分には同一符号を付す。 A menu display device and a probability coefficient output device (information processing device) according to an embodiment for carrying out the present invention will be described below with reference to the drawings. In the drawings, the same or corresponding parts are denoted by the same reference numerals.

（実施形態１）
実施形態１に係る確率係数出力装置４０は、図１に示すメニュー表示装置１に搭載されている。メニュー表示装置１は、ｉ）解析対象となる特定のカテゴリに属する文字列（メニュー、献立等）を記載した紙等を撮影する撮影機能、ii）撮影した画像から解析対象となる文字列を認識して抽出する機能、iii）抽出した文字列を解析して単語列に変換する機能、iv)文字列の所定部分（単語間）でメニューが区切れる確率を示す係数（区切確率係数）を出力する機能、v)区切確率係数に基づいて単語列を区切る機能、vi)区切った単語列をそれぞれ翻訳する機能、vii)翻訳結果を表示する機能、等を備える。確率係数出力装置４０は、これらの機能のうち、文字列の所定部分（単語間）でメニューが区切れる確率を示す係数（確率係数）を出力する機能を担当する。 (Embodiment 1)
The probability coefficient output device 40 according to the first embodiment is mounted on the menu display device 1 shown in FIG. The menu display device 1 i) a photographing function for photographing paper or the like describing a character string (menu, menu, etc.) belonging to a specific category to be analyzed; ii) recognizing a character string to be analyzed from the photographed image Iii) A function to analyze the extracted character string and convert it to a word string, iv) Output a coefficient (separation probability coefficient) indicating the probability that the menu will be separated at a predetermined part of the character string (between words) V) a function for dividing a word string based on a dividing probability coefficient, vi) a function for translating each divided word string, vii) a function for displaying a translation result, and the like. The probability coefficient output device 40 is in charge of a function of outputting a coefficient (probability coefficient) indicating a probability that the menu is divided at a predetermined portion (between words) of the character string among these functions.

メニュー表示装置１は入力部１０と、ＯＣＲ（ＯｐｔｉｃａｌＣｈａｒａｃｔｅｒＲｅａｄｅｒ）２０とメニュー解析部３０と翻訳部５０とを含む情報処理部７０と、確率係数出力装置４０と、表示部６０と、を備える。 The menu display device 1 includes an input unit 10, an information processing unit 70 including an OCR (Optical Character Reader) 20, a menu analysis unit 30, and a translation unit 50, a probability coefficient output device 40, and a display unit 60.

入力部１０は、カメラと画像処理部とから構成され、このような物理構成によりメニューを撮影した画像を取得する。入力部１０は、取得した画像をＯＣＲ２０に伝達する。 The input unit 10 includes a camera and an image processing unit, and acquires an image obtained by photographing a menu with such a physical configuration. The input unit 10 transmits the acquired image to the OCR 20.

情報処理部７０は、画像処理用ＤＳＰ（ＤｉｇｉｔａｌＳｉｇｎａｌＰｒｏｃｅｓｓｏｒ）、作業用領域として用いられるＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）、文字認識プログラム及び言語処理プログラムを記憶するＥＰＲＯＭ（ＥｒａｓａｂｌｅＰｒｏｇｒａｍｍａｂｌｅＲｅａｄＯｎｌｙＭｅｍｏｒｙ）等から構成される。情報処理部７０は、このような物理構成により、ＯＣＲ２０と、メニュー解析部３０と、翻訳部５０と、として機能する。 The information processing unit 70 includes an image processing DSP (Digital Signal Processor), a RAM (Random Access Memory) used as a work area, an EPROM (Erasable Programmable Read Only Memory) that stores a character recognition program and a language processing program, and the like. Is done. The information processing unit 70 functions as the OCR 20, the menu analysis unit 30, and the translation unit 50 with such a physical configuration.

ＯＣＲ２０は、入力部１０から伝達された画像の文字を認識し、メニューに記された文字列（料理名等）を取得する。ＯＣＲ２０は、取得した文字列をメニュー解析部３０に伝達する。 The OCR 20 recognizes the characters of the image transmitted from the input unit 10 and acquires a character string (such as a dish name) written on the menu. The OCR 20 transmits the acquired character string to the menu analysis unit 30.

メニュー解析部３０は、ＯＣＲ２０から伝達された文字列を単語に分割して単語列に変換する。
メニュー解析部３０は、単語列に現れるｎ個の単語からなる部分単語列(ｎグラム)を抽出する。さらに、ｎグラムから後述する区切パターンを生成し、そのうち確率係数を取得する必要のある区切パターンを選択する。 The menu analysis unit 30 divides the character string transmitted from the OCR 20 into words and converts them into word strings.
The menu analysis unit 30 extracts a partial word string (n-gram) composed of n words appearing in the word string. Furthermore, a delimiter pattern to be described later is generated from the n-gram, and a delimiter pattern for which a probability coefficient needs to be acquired is selected.

ここで、本実施形態に係る解析対象となる文字列（メニュー）と、教師データと、ｎグラムと、区切パターンと、について、図２を参照して説明する。
本実施形態で解析対象となる文字列は、図２（ａ）の上に示されるようなメニューを示す文字列である。図２に示すメニュー「豚バラ肉の赤ワイン煮温野菜添え」にタグを付し、単語毎・固まり毎に分割したデータが教師データ（図２（ａ）下）である。図２（ａ）の例では、教師データは「<m><c><s><w>豚</w>バラ肉</w><w>の</w></s><s><w>赤ワイン</w><w>煮</w></s><s><w></c>温野菜</w><w>添え</w></s></m>」である。 Here, the character string (menu) to be analyzed, the teacher data, the n-gram, and the partition pattern according to the present embodiment will be described with reference to FIG.
The character string to be analyzed in the present embodiment is a character string indicating a menu as shown in the upper part of FIG. The data shown in FIG. 2 is the teacher data (bottom of FIG. 2 (a)) with a tag attached to the menu "pork belly with red wine cooked with vegetables" and divided into words and chunks. In the example of Fig. 2 (a), the teacher data is "<m><c><s><w> pig </ w> rose meat </ w><w></w></s><s><w> Red wine </ w><w> boiled </ w></s><s><w></c> warm vegetables </ w><w> with </ w></s>< / m> ”.

この教師データでは、メニューが単語を示すタグ<w></w>によって、「豚」、「バラ肉」、…、「添え」、の７つの単語に分割されている。さらに、材料名、料理方法、等等の単位に分割するタグ<s></s>により、「豚バラ肉の」、「赤ワイン煮」、「温野菜添え」、という三つに分割されている。また、材料名と料理方法とその他の修飾語（例えば「プロバンス風」、「特選」、等）を含む一つの料理の単位に分割するタグ<c><c/>により、「豚バラ肉の赤ワイン煮」と「温野菜添え」との二つに分割されている。タグ<m></m>は文字列を一つのメニュー（献立）ごとに区切るタグである。ここで、教師データは文字列をタグ<w>、<s>、<c>、<m>で区切っているが、教師データの形式はこれに限られない。教師データは所定のカテゴリに含まれる文字列を、単語単位に区切るユニークなマーク（半角スペースでも可）と、さらに単語以外の少なくとも一つの区切り方で区切るユニークなマークと、を含む任意の文字列であってよい。教師データは、予め特定の言語（ここでは日本語）の特定のカテゴリ（ここでは献立や料理名）に属する文字列を収集して、人手でタグ付けされたデータである。なお、教師データをタグ付けする方法は人手に限らず、構文解析器等の既知の任意のタグ付け方法であって良い。 In this teacher data, the menu is divided into seven words “pork”, “rose meat”,..., “Attached” by tags <w> </ w> indicating words. Furthermore, the tag <s> </ s> that is divided into units such as ingredient name, cooking method, etc., is divided into three parts: "pork belly meat", "boiled red wine", "with warm vegetables" Yes. In addition, the tag <c> <c /> that divides into one cooking unit that includes the ingredient name, cooking method, and other modifiers (for example, “Provence style”, “Specialties”, etc.) It is divided into “red wine boiled” and “warm vegetables”. Tags <m> </ m> are tags that separate character strings into menus (menus). Here, in the teacher data, character strings are delimited by tags <w>, <s>, <c>, and <m>, but the format of the teacher data is not limited to this. The teacher data is an arbitrary character string that includes a unique mark (a single-byte space is allowed) that divides a character string included in a predetermined category into words, and a unique mark that is further separated by at least one separation method other than words. It may be. The teacher data is data in which character strings belonging to a specific category (here, menu or dish name) of a specific language (here, Japanese) are collected and tagged manually. Note that the method of tagging the teacher data is not limited to manual operation, and any known tagging method such as a syntax analyzer may be used.

教師データと、ｎグラムと、区切パターンの関係を図２（ｂ）に示す。教師データの単語列から、最初の単語からｎ個目の単語、２つの目の単語からｎ＋１個目の単語、…のようにｎ個の単語を含む単語列の集合を抽出したものがｎグラム列である。ｎグラム列を構成するそれぞれのｎ個の単語を含む単語列をｎグラムと呼ぶ。さらに、ｎ＝３のｎグラムをトライグラム、ｎ＝２のｎグラムをバイグラム、ｎ＝１のｎグラムをモノグラム、と呼ぶ。 FIG. 2B shows the relationship between the teacher data, the n-gram, and the division pattern. An n-gram is obtained by extracting a set of word strings including n words such as the nth word from the first word, the (n + 1) th word from the second word,... Is a column. A word string including n words constituting the n-gram string is called an n-gram. Further, n-grams with n = 3 are called trigrams, n-grams with n = 2 are called bigrams, and n-grams with n = 1 are called monograms.

「豚バラ肉の赤ワイン煮温野菜添え」から、トライグラム「豚バラ肉の」、「バラ肉の赤ワイン」、…、「煮温野菜添え」、から構成されるトライグラム列を得ることが出来る（図２（ｂ））。メニューの単語列は図２（ｂ）の上部に示すように、タグ構造によってツリー状に区切られる。そして、システムの設計上定められたツリーの所定の高さ（教師データの所定のタグに対応）で、単語と単語との間のどこで区切れるか、その区切り方を定めることが出来る。 You can obtain a trigram sequence consisting of trigram "pork rose meat", "red rose wine", ..., "boiled warm vegetables" from "pig rose meat with red wine stewed vegetables" (FIG. 2 (b)). As shown in the upper part of FIG. 2B, the menu word string is divided into a tree structure by the tag structure. Then, it is possible to determine where a word is divided between words at a predetermined height (corresponding to a predetermined tag of teacher data) determined by the system design.

図２（ｂ）上のツリーの例では、タグ<m>又は</m>がある部位、タグ<s>及び</s>がある部位、タグ<c>及び</c>がある部位、のそれぞれ（区切ライン）でメニューが区切れている。単語列の語間のそれぞれで、区切れている場合に１、区切れていない場合を０で示した情報を区切フラグと呼ぶ。
なお、どのタグがある部分で区切れていると判断するかの判断基準は、自由に設定可能である。例えば、<s></s>タグがある部分のみで区切れていると判断して区切フラグを配置する設定等の任意の設定が可能である。 In the example of the tree in FIG. 2 (b), the part with the tag <m> or </ m>, the part with the tags <s> and </ s>, the part with the tags <c> and </ c> The menu is separated by each (separation line). Information between each word in the word string is called a delimiter flag when it is delimited by 1 and when it is not delimited by 0.
Note that the criteria for determining which tag is delimited by a certain part can be freely set. For example, it is possible to make arbitrary settings such as a setting for determining that the <s></s> tag is separated only by a portion and arranging a separation flag.

トライグラム（ｎグラム）について、そのｎグラムの語間のそれぞれで教師データが区切れているか否かを、単語と区切フラグを並べて定義したパターンを区切パターンという。
例えば、トライグラムを構成する３つの単語（単語Ａ、単語Ｂ、単語Ｃ）について、単語Ａの前、単語Ｃの後ろを含むいずれの語間でも教師データが区切れて居ない場合に対応する区切パターンは「０Ａ０Ｂ０Ｃ０」、全ての語間で区切れている場合に対応する区切パターンは「１Ａ１Ｂ１Ｃ１」、である。 For trigrams (n-grams), a pattern that defines whether or not teacher data is divided between words of the n-gram is defined by arranging words and a delimiter flag side by side is called a delimiter pattern.
For example, for the three words (word A, word B, word C) constituting the trigram, this corresponds to the case where the teacher data is not divided between any words including the word A before and the word C. The delimiter pattern is “0A0B0C0”, and the delimiter pattern corresponding to the case where all words are delimited is “1A1B1C1”.

図２（ｂ）の例では、トライグラム「豚バラ肉の」について、構成する単語「豚」の前、「豚」と「バラ肉」の間、「バラ肉」と「の」の間、「の」の後、の４つの語間で、それぞれメニューが区切れる場合を１、区切れない場合を０として区切パターンが２＾４＝１６個定義できる。教師データに対応する区切パターンは「１豚０バラ肉０の１」である。 In the example of FIG. 2 (b), for the trigram “pig of pork”, in front of the word “pig”, between “pig” and “rose”, between “rose” and “no”, After “no”, it is possible to define 2 ^ 4 = 16 division patterns between the four words, where 1 is the case where the menu is divided and 0 is the case where the menu is not divided. The delimiter pattern corresponding to the teacher data is “1 pig 0 rose 1 0”.

以下、その言語（例えば日本語）に属するメニュー・料理名から教師データを十分量作成する。そして、あるｎグラムを含む教師データであって、ある区切パターンで教師データが区切れている確率を示す係数を、その区切パターンの確率係数（区切確率係数）と呼ぶ。また、あるｎグラムに対応する区切パターンの確率係数を、そのｎグラムの確率係数と呼ぶ。 In the following, a sufficient amount of teacher data is created from menu / cook names belonging to the language (for example, Japanese). A coefficient indicating the probability that the teacher data includes a certain n-gram and is divided by a certain delimiter pattern is called a probability coefficient (delimiter probability coefficient) of the delimiter pattern. In addition, the probability coefficient of the division pattern corresponding to a certain n-gram is referred to as the probability coefficient of the n-gram.

メニュー解析部３０は、確率係数出力装置４０から取得した確率係数を用いて、単語列がどのように区切れているか推定し、単語列を推定結果と対応づけて翻訳部５０に出力する。メニュー解析部３０が実行する具体的な処理については後述する。 The menu analysis unit 30 uses the probability coefficient acquired from the probability coefficient output device 40 to estimate how the word string is divided, and outputs the word string to the translation unit 50 in association with the estimation result. Specific processing executed by the menu analysis unit 30 will be described later.

メニュー解析部３０は、ｎ個の単語（ｎグラム）と、そのｎグラムの区切パターンのうち確率係数を必要とする区切パターンを示す情報を確率係数出力装置４０に伝達し、確率係数出力装置４０から区切パターンの確率係数を取得する。 The menu analysis unit 30 transmits information indicating n words (n-gram) and a delimiter pattern that requires a probability coefficient among n-gram delimiter patterns to the probability coefficient output device 40. Get the probability coefficient of the delimiter pattern from.

確率係数出力装置４０は、メニュー解析部３０からｎ個の単語（ｎグラム）と、そのｎグラムの区切パターンのうち確率係数を必要とする区切パターンを示す情報とを伝達されると、教師データから取得したその区切パターンでメニューが区切れている確からしさを示す確率係数をメニュー解析部３０に伝達する。
確率係数出力装置４０が実行する具体的な処理とその構成については後述する。 When the probability coefficient output device 40 receives n words (n-gram) and information indicating a delimiter pattern that requires a probability coefficient among the n-gram delimiter patterns from the menu analysis unit 30, the teacher data The probability coefficient indicating the probability that the menu is divided by the division pattern obtained from the above is transmitted to the menu analysis unit 30.
Specific processing executed by the probability coefficient output device 40 and its configuration will be described later.

翻訳部５０は、メニュー解析部３０から伝達された単語列を、その単語列の分割パターンが示す分割方法で分割してユーザが所望する言語に翻訳する。
翻訳部５０が翻訳する方法は既知の任意の翻訳方法であってよいが、ここでは分割された単語列に含まれる単語を、辞書データによって逐次翻訳することとする。
翻訳部５０は、翻訳結果を表示部６０に伝達する。 The translation unit 50 divides the word string transmitted from the menu analysis unit 30 by the division method indicated by the division pattern of the word string and translates it into a language desired by the user.
The translation unit 50 may translate any known translation method, but here, the words included in the divided word string are sequentially translated by dictionary data.
The translation unit 50 transmits the translation result to the display unit 60.

表示部６０は、液晶ディスプレイ等から構成され、翻訳部５０から伝達された情報を表示する。 The display unit 60 includes a liquid crystal display or the like, and displays information transmitted from the translation unit 50.

次に、確率係数出力装置４０の構成を、図３を参照して説明する。
確率係数出力装置４０は、物理的には、図３（ａ）に示すように情報処理部４０１と、データ記憶部４０２と、プログラム記憶部４０３と、入出力部４０４と、通信部４０５と、内部バス４０６と、から構成される。 Next, the configuration of the probability coefficient output device 40 will be described with reference to FIG.
The probability coefficient output device 40 physically includes an information processing unit 401, a data storage unit 402, a program storage unit 403, an input / output unit 404, a communication unit 405, as shown in FIG. And an internal bus 406.

情報処理部４０１は、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）、ＤＳＰ（ＤｉｇｉｔａｌＳｉｇｎａｌＰｒｏｃｅｓｓｉｎｇ）、等から構成され、プログラム記憶部４０３に記憶されている制御プログラム４０７に従って、後述する確率係数出力装置４０が実行する処理を実行する。 The information processing unit 401 includes a CPU (Central Processing Unit), a DSP (Digital Signal Processing), and the like, and is executed by a probability coefficient output device 40 described later according to a control program 407 stored in the program storage unit 403. Execute.

データ記憶部４０２は、ＲＡＭ（Ｒａｎｄｏｍ−ＡｃｃｅｓｓＭｅｍｏｒｙ）等から構成され、情報処理部４０１の作業領域として用いられる。 The data storage unit 402 includes a RAM (Random-Access Memory) and the like, and is used as a work area for the information processing unit 401.

プログラム記憶部４０３は、フラッシュメモリ、ハードディスク、等の不揮発性メモリから構成され、情報処理部４０１の動作を制御する制御プログラム４０７と、下記に示す処理を実行するためのデータを記憶する。 The program storage unit 403 includes a nonvolatile memory such as a flash memory or a hard disk, and stores a control program 407 for controlling the operation of the information processing unit 401 and data for executing the following processing.

通信部４０５は、ＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）デバイス、モデム等から構成され、ＬＡＮ回線や通信回線を介して接続された外部機器に情報処理部４０１の処理結果を送信する。また、外部機器から情報を受信して、情報処理部４０１に伝達する。
なお、情報処理部４０１と、データ記憶部４０２と、プログラム記憶部４０３と、入出力部４０４と、は内部バス４０６によってそれぞれ接続され、情報の送信が可能である。 The communication unit 405 includes a LAN (Local Area Network) device, a modem, and the like, and transmits the processing result of the information processing unit 401 to an external device connected via a LAN line or a communication line. In addition, information is received from an external device and transmitted to the information processing unit 401.
Note that the information processing unit 401, the data storage unit 402, the program storage unit 403, and the input / output unit 404 are connected to each other via an internal bus 406, and can transmit information.

入出力部４０４は、ＵＳＢ（ＵｎｉｖｅｒｓａｌＳｅｒｉａｌＢｕｓ）やシリアルポートによって確率係数出力装置４０に接続された外部機器との情報の入出力を制御するＩ／Ｏ部である。 The input / output unit 404 is an I / O unit that controls input / output of information to / from an external device connected to the probability coefficient output device 40 through a USB (Universal Serial Bus) or a serial port.

確率係数出力装置４０は、上記物理構成により、図３（ｂ）に示すように、単語列取得部４１０、判別部４２０、（ｎ−１）グラム生成部４３０、確率係数取得部４４０、確率係数算出部４５０、出力部４６０、記憶部４７０、として機能する。 As shown in FIG. 3B, the probability coefficient output device 40 has a word string acquisition unit 410, a determination unit 420, an (n-1) gram generation unit 430, a probability coefficient acquisition unit 440, a probability coefficient, as shown in FIG. It functions as a calculation unit 450, an output unit 460, and a storage unit 470.

単語列取得部４１０は、メニュー解析部３０からメニューに含まれるｎグラム（注目単語列）と、ｎグラムから生成された区切パターン（確率係数を取得する必要のある区切パターン）を示す情報を取得する。単語列取得部４１０は、取得したｎグラムと情報とを判別部４２０に伝達する。 The word string acquisition unit 410 acquires information indicating an n-gram (word-of-interest string) included in the menu from the menu analysis unit 30 and a delimiter pattern generated from the n-gram (a delimiter pattern for which a probability coefficient needs to be acquired). To do. The word string acquisition unit 410 transmits the acquired n-gram and information to the determination unit 420.

判別部４２０は、単語列取得部４１０から伝達されたｎグラムの区切パターンについて、その区切れ方でメニューが区切れる確からしさを示す確率係数が取得できるか否か判別する。判別部４２０は、判別処理にあたって、記憶部４７０のｎグラムリスト記憶部４７１０に記憶されたｎグラムリストを参照する。ｎグラムリスト及び判別部４２０が実行する判別処理の具体的な内容については後述する。
本実施形態では、教師学習がある区切パターンで区切れる確率で、解析対象となる単語列（メニュー）も区切れるとの仮定の下、メニューのｎグラムの区切パターンでメニューが区切れる確からしさを、（ｎ−１）グラム〜モノグラムの教師データから推測する。 The determination unit 420 determines whether or not the probability coefficient indicating the probability that the menu is divided according to the division method of the n-gram division pattern transmitted from the word string acquisition unit 410 can be acquired. The determination unit 420 refers to the n-gram list stored in the n-gram list storage unit 4710 of the storage unit 470 in the determination process. Specific contents of the discrimination process executed by the n-gram list and discrimination unit 420 will be described later.
In this embodiment, the probability that the learning will be divided by a certain division pattern is assumed, and the probability that the menu will be divided by the n-gram division pattern of the menu is assumed under the assumption that the word string (menu) to be analyzed is also divided. , (N-1) gram to monogram teacher data.

判別部４２０は、ｎグラムの確率係数が取得できると判別すると、確率係数取得部４４０にｎグラムを伝達する。
一方、ｎグラムの確率係数が取得できないと判別すると、（ｎ−１）グラム生成部４３０にｎグラムを伝達する。 If the determination unit 420 determines that n-gram probability coefficients can be acquired, the determination unit 420 transmits n-grams to the probability coefficient acquisition unit 440.
On the other hand, if it is determined that n-gram probability coefficients cannot be acquired, n-grams are transmitted to the (n−1) -gram generation unit 430.

（ｎ−１）グラム生成部４３０は、判別部４２０からｎグラムを伝達されると、ｎグラムを構成する最初の単語からｎ−１個目の単語から構成される（ｎ−１）グラム（前（ｎ−１）グラム）と、構成する２つ目の単語からｎ個目の単語から構成されるｎ−１グラム（後（ｎ−１）グラム）と、を生成する。
（ｎ−１）グラム生成部４３０は、生成した２つの（ｎ−１）グラムを、判別部４２０に伝達する。 When (n-1) gram generation unit 430 receives n gram from discrimination unit 420, (n-1) gram (n-1) gram (n-1) from the first word constituting n gram (n-1) gram ( (N-1) gram) and n-1 gram (after (n-1) gram) composed of the nth word from the second word constituting.
The (n-1) gram generation unit 430 transmits the generated two (n-1) grams to the determination unit 420.

判別部４２０は、（ｎ−１）グラム生成部４３０から２つの（ｎ−１）グラムを伝達されると、２つの（ｎ−１）グラムのそれぞれについて確率係数が取得できるか否か判別する。２つの（ｎ−１）グラムのどちらかについて確率係数が取得できない場合は、（ｎ−１）グラム生成部４３０に３つの（ｎ−２）グラムを生成させ、以下モノグラムになるまで同様に判別処理を繰り返す。判別部４２０と（ｎ−１）グラム生成部４３０が実行する処理の具体的な内容については後述する。
判別部４２０は、（ｎ−１）グラム生成部４３０からモノグラムを伝達されると、判別処理を実行せずに確率係数取得部４４０にそのモノグラムを伝達する。 When two (n-1) grams are transmitted from the (n-1) gram generation unit 430, the determination unit 420 determines whether a probability coefficient can be acquired for each of the two (n-1) grams. . If the probability coefficient cannot be acquired for either of the two (n-1) grams, the (n-1) gram generation unit 430 generates three (n-2) grams, and the same determination is made until a monogram is obtained. Repeat the process. Specific contents of processing executed by the determination unit 420 and the (n-1) gram generation unit 430 will be described later.
When the determination unit 420 receives the monogram from the (n-1) gram generation unit 430, the determination unit 420 transmits the monogram to the probability coefficient acquisition unit 440 without executing the determination process.

確率係数取得部４４０は、判別部４２０から単語列として単語列取得部４１０が取得したｎグラムを伝達されると、区切パターンのうち単語列取得部４１０が取得した情報が示す区切パターンの確率係数を取得して、出力部４６０に伝達する。
一方、単語列としてｎ−１グラム〜モノグラム（ここではｊグラムとする）を伝達されると、伝達された全てのｊグラムについて、単語列取得部４１０が取得した情報が示す区切パターンの確率係数を取得するために必要な区切パターンを生成し、それぞれの区切パターンについて確率係数を取得する。
そして、伝達された全てのｊグラムと、生成した区切パターンと、その確率係数と、を確率係数算出部４５０に伝達する。
確率係数取得部４４０は、確率係数を取得する処理で教師データ記憶部４７３０に記憶された教師データを参照する。確率係数取得部４４０がどのように確率係数を取得するか、その具体的な処理内容については後述する。 When the n-gram acquired by the word string acquisition unit 410 is transmitted as the word string from the determination unit 420, the probability coefficient acquisition unit 440 indicates the probability coefficient of the partition pattern indicated by the information acquired by the word string acquisition unit 410 among the partition patterns. Is transmitted to the output unit 460.
On the other hand, when n-1 gram to monogram (here, j-gram) is transmitted as a word string, the probability coefficient of the division pattern indicated by the information acquired by the word string acquisition unit 410 for all the transmitted j-grams To generate a delimiter pattern necessary to acquire the probability coefficient and obtain a probability coefficient for each delimiter pattern.
Then, all the transmitted j-grams, the generated segmentation patterns, and the probability coefficients thereof are transmitted to the probability coefficient calculation unit 450.
The probability coefficient acquisition unit 440 refers to the teacher data stored in the teacher data storage unit 4730 in the process of acquiring the probability coefficient. How the probability coefficient acquisition unit 440 acquires the probability coefficient will be described later in detail.

確率係数算出部４５０は確率係数取得部４４０から、伝達された全てのｊグラムから生成された区切パターンと、その確率係数を伝達されると、伝達された情報からｎグラムの区切パターンの確率係数を算出する。確率係数算出部４５０がどのようにｎグラムの区切パターンの確率係数を算出するかは後述する。
確率係数算出部４５０は、ｎグラムの区切パターンの確率係数を算出すると、そのうち単語列取得部４１０が取得した、確率係数が必要な区切パターンとその確率係数を抽出して、出力部４６０に伝達する。 The probability coefficient calculation unit 450 receives the delimiter pattern generated from all the transmitted j-grams from the probability coefficient acquisition unit 440, and the probability coefficient of the n-gram delimiter pattern from the transmitted information. Is calculated. How the probability coefficient calculation unit 450 calculates the probability coefficient of the n-gram segmentation pattern will be described later.
When the probability coefficient calculation unit 450 calculates the probability coefficient of the n-gram delimiter pattern, it extracts the delimiter pattern that requires the probability coefficient and the probability coefficient acquired by the word string acquisition unit 410 and transmits it to the output unit 460. To do.

出力部４６０は、確率係数算出部４５０から伝達された区切パターンとその確率係数をメニュー解析部３０に出力する。 The output unit 460 outputs the division pattern and the probability coefficient transmitted from the probability coefficient calculation unit 450 to the menu analysis unit 30.

記憶部４７０は、確率係数出力装置４０の記憶部４７０を除く各部から情報を伝達され、伝達された情報を記憶する。また、確率係数出力装置４０の記憶部４７０を除く各部からのコマンドに応答して、記憶する情報を出力する。 The storage unit 470 receives information from each unit other than the storage unit 470 of the probability coefficient output device 40 and stores the transmitted information. Further, in response to a command from each unit other than the storage unit 470 of the probability coefficient output device 40, information to be stored is output.

記憶部４７０は、ｎグラムリストを記憶するｎグラムリスト記憶部４７１０、確率係数出力装置４０が下記する処理を実行するための設定パラメータを記憶する設定記憶部４７２０、教師データを記憶する教師データ記憶部４７３０、を含む。 The storage unit 470 includes an n-gram list storage unit 4710 that stores an n-gram list, a setting storage unit 4720 that stores setting parameters for the probability coefficient output device 40 to execute processing described below, and a teacher data storage that stores teacher data Part 4730.

ｎグラムリストは、教師データ記憶部４７３０に記憶された教師データに現れる全てのｎグラムを登録したリストである。
ｎグラムリスト記憶部４７１０が記憶するｎグラムリストの例を、図４を参照して説明する。図４の例では、ｎグラム記憶部４７１０は、トライグラムリスト（図４（ａ））、バイグラムリスト（図４（ｂ））、モノグラムリスト（図４（ｃ））、をそれぞれ記憶する。
トライグラムリストは、教師データに現れる全てのトライグラムを、そのトライグラムを含む教師データが幾つあるかを示すデータ数と関連付けて記憶している。バイグラムリスト、モノグラムリストについても同様である。 The n-gram list is a list in which all n-grams appearing in the teacher data stored in the teacher data storage unit 4730 are registered.
An example of an n-gram list stored in the n-gram list storage unit 4710 will be described with reference to FIG. In the example of FIG. 4, the n-gram storage unit 4710 stores a trigram list (FIG. 4A), a bigram list (FIG. 4B), and a monogram list (FIG. 4C).
The trigram list stores all the trigrams appearing in the teacher data in association with the number of data indicating how many teacher data includes the trigram. The same applies to the bigram list and the monogram list.

次に、確率係数算出部４５０が、ｊグラムの区切パターン及びその確率係数を用いて、ｎグラムの区切パターンの確率係数を算出する処理について、図５を参照して説明する。 Next, a process in which the probability coefficient calculation unit 450 calculates the probability coefficient of the n-gram partition pattern using the j-gram partition pattern and the probability coefficient will be described with reference to FIG.

図５は、トライグラムリスト（図４（ａ））に、あるトライグラム（ここでは「の赤ワイン煮」）が登録されていない、あるいは十分な数（所定の閾値以上）が登録されていない場合に、確率係数算出部４５０はバイグラムやモノグラムの区切パターンの確率係数からトライグラムの区切パターンの確率係数を算出する。 FIG. 5 shows a case where a certain trigram (here, “no boiled red wine”) is not registered in the trigram list (FIG. 4A) or a sufficient number (a predetermined threshold or more) is not registered. In addition, the probability coefficient calculation unit 450 calculates the probability coefficient of the trigram partition pattern from the probability coefficient of the bigram or monogram partition pattern.

トライグラム「の赤ワイン煮」の注目区切パターン「０の１赤ワイン０煮１」を算出する場合の算出方法を、図５（ａ）を参照して説明する。ここで、トライグラムの、前半のバイグラム（前バイグラム）は「の赤ワイン」、後半のバイグラム（後バイグラム）は「赤ワイン煮」である。そして、注目区切パターンに対応するバイグラムの区切パターン（対応区切パターン）は、対応する語間の区切フラグが共通する「０の１赤ワイン０」と「１赤ワイン０煮１」とである。 A calculation method in the case of calculating the attention partition pattern “0 1 red wine boiled 1” of the trigram “no red wine boiled” will be described with reference to FIG. Here, the first half of the trigram (front bigram) is “red wine”, and the second bigram (second bigram) is “red wine boiled”. The bigram delimiter patterns (corresponding delimiter patterns) corresponding to the target delimiter pattern are “0 red wine 0” and “1 red wine 0 boiled 1”, which have common delimiter flags between words.

確率係数算出部４５０はこの対応するバイグラム（ｊグラム）の区切パターンとその確率係数を確率係数取得部４４０から伝達されると、区切フラグを比較して対応区切パターンを抽出する。そして、対応区切パターンの確率係数を加算平均して、注目区切パターンの確率係数とする（図５（ａ））。 When the probability factor calculation unit 450 receives the corresponding bigram (j-gram) segmentation pattern and its probability factor from the probability coefficient acquisition unit 440, the probability factor calculation unit 450 compares the segmentation flags and extracts the corresponding segmentation pattern. Then, the probability coefficients of the corresponding partition patterns are added and averaged to obtain the probability coefficient of the target partition pattern (FIG. 5A).

同様に、対応するモノグラムの区切パターンとその確率係数を確率係数取得部４４０から取得すると、前・中央・最後尾の各モノグラムから対応区切パターンを抽出し、各対応区切パターンの確率係数を加算平均して注目区切パターンの確率係数とする（図５（ｂ））。 Similarly, when the corresponding monogram delimiter pattern and its probability coefficient are acquired from the probability coefficient acquisition unit 440, the corresponding delimiter patterns are extracted from the front, center, and tail monograms, and the probability coefficients of the corresponding delimiter patterns are added and averaged. Thus, the probability coefficient of the target separation pattern is set (FIG. 5B).

続いて、メニュー表示装置１が実行する処理について、フローチャート（図６〜図１０）を参照して説明する。メニュー表示装置１の情報処理部７０は、ユーザがメニューを撮影するための操作を実行すると、図６に示すメニュー表示処理を実行する。 Next, processing executed by the menu display device 1 will be described with reference to flowcharts (FIGS. 6 to 10). When the user performs an operation for photographing a menu, the information processing unit 70 of the menu display device 1 executes a menu display process shown in FIG.

メニュー表示処理では、まず入力部１０を用いてメニューが印刷された画像を取得する（ステップＳ１０１）。
そして、取得した画像から、ＯＣＲ２０が文字を認識して文字列を取得する（ステップＳ１０２）。 In the menu display process, first, an image on which a menu is printed is acquired using the input unit 10 (step S101).
Then, from the acquired image, the OCR 20 recognizes a character and acquires a character string (step S102).

ＯＣＲ２０が文字列を取得してメニュー解析部３０に伝達すると、まずメニュー解析部３０が、文字列を単語単位に分割する分かち書き処理を実行して、文字列を単語列に変換する。（ステップＳ１０３）。
ここで、メニュー解析部３０は文字列から単語を抽出する任意の既知の方法を用いて上記分かち書き処理を実行してよいが、ここでは特許文献２が例示する方法を用いて分かち書き処理を実行することとする。
なお、メニュー解析部３０は、解析対象となるメニューが英語やフランス語等の単語毎にスペースで区切られる言語であった場合は、スペースを認識して上記分かち書き処理を実行する。 When the OCR 20 acquires a character string and transmits it to the menu analysis unit 30, the menu analysis unit 30 first executes a splitting process for dividing the character string into units of words to convert the character string into a word string. (Step S103).
Here, the menu analysis unit 30 may execute the above-described segmentation process using any known method for extracting a word from a character string, but here, the menu analysis unit 30 executes the segmentation process using a method exemplified in Patent Document 2. I will do it.
Note that if the menu to be analyzed is a language that is separated by a space for each word, such as English or French, the menu analysis unit 30 recognizes the space and executes the above-described splitting process.

そして、メニュー解析部３０は、メニューが単語列のどの部位で区切れるか推測し、メニューを分割する処理（メニュー分割処理）を実行する（ステップＳ１０４）。 Then, the menu analysis unit 30 estimates at which part of the word string the menu is divided, and executes a process for dividing the menu (menu division process) (step S104).

ステップＳ１０４で実行されるメニュー分割処理について、図７を参照して説明する。
メニュー分割処理が開始されると、まずメニュー解析部３０は単語列からｎグラム列を生成する（ステップＳ２０１）。ｎグラム列に含まれる各ｎグラムは、単語列の部分列である。
なお、ここでｎの値は任意に定められたデフォルト値であるってよいが、ここではｎ＝３とする。 The menu division process executed in step S104 will be described with reference to FIG.
When the menu division process is started, first, the menu analysis unit 30 generates an n-gram sequence from the word sequence (step S201). Each n-gram included in the n-gram string is a partial string of the word string.
Here, the value of n may be an arbitrarily determined default value, but here n = 3.

そして、カウンタ変数ｉ＝１とし、ｎグラム列の先頭（１番目）のｎグラム（トライグラム）を注目部分列（注目トライグラム）とする（ステップＳ２０２）。 Then, the counter variable i = 1 is set, and the first (first) n-gram (trigram) of the n-gram sequence is set as the target partial sequence (target trigram) (step S202).

そして、メニュー解析部３０が先頭のトライグラム（注目部分列）について、確率係数出力装置４０に１６種のうち、先頭の区切フラグが１であるすべての区切パターン（８個）について確率係数を求めるコマンドを確率係数出力装置４０に送信して、確率係数取得処理が開始される（ステップＳ２０３）。 Then, the menu analysis unit 30 obtains the probability coefficient for all the delimiter patterns (eight) having the delimiter flag of 1 among the 16 types of probability coefficient output device 40 for the first trigram (target subsequence). The command is transmitted to the probability coefficient output device 40, and the probability coefficient acquisition process is started (step S203).

ステップＳ２０３で実行される確率係数取得処理について、図８を参照して説明する。
ステップＳ２０３に至り、メニュー解析部３０からｎグラムと、確率係数を算出すべき区切パターンについてのコマンドを単語列取得部４１０が受信すると、確率係数出力装置４０は確率係数取得処理を開始する。
確率係数取得処理では、まず判別部４２０がｎグラムリスト記憶部４７１０に記憶されているトライグラムリストを参照して、注目トライグラムを含む教師データのデータ数を取得する（ステップＳ３０１）。 The probability coefficient acquisition process executed in step S203 will be described with reference to FIG.
In step S203, when the word string acquisition unit 410 receives an n-gram from the menu analysis unit 30 and a command regarding a delimiter pattern for which a probability coefficient is to be calculated, the probability coefficient output device 40 starts a probability coefficient acquisition process.
In the probability coefficient acquisition process, first, the determination unit 420 refers to the trigram list stored in the n-gram list storage unit 4710, and acquires the number of teacher data including the attention trigram (step S301).

そして、注目トライグラムの確率係数を求めるのに十分な数の教師データが存在しているか否か、設定記憶部４７２０に記憶されたトライグラム用の閾値と注目部分列のデータ数とを比較して判別する。 Then, whether or not there is a sufficient number of teacher data for obtaining the probability coefficient of the attention trigram, the threshold for the trigram stored in the setting storage unit 4720 and the number of data in the attention subsequence are compared. To determine.

注目部分列のデータ数が閾値以上である場合（ステップＳ３０２；ＹＥＳ）、十分な数の教師データが存在していると判断できるため、現在の注目部分列（ｎグラム）をそのまま用いて区切パターンの確率係数を取得する（ステップＳ３０３〜３０４）。 If the number of data in the target subsequence is equal to or greater than the threshold (step S302; YES), it can be determined that there is a sufficient number of teacher data, and therefore the delimiter pattern using the current target subsequence (n-gram) as it is. Are obtained (steps S303 to S304).

まず、確率係数取得部４４０がメニュー解析部３０から伝達された区切パターンを生成し（ステップＳ３０３）、その確率係数を取得する（ステップＳ３０４）。具体的には、教師データ記憶部４７３０に記憶された教師データであって注目部分列を含む教師データを抽出する。このとき抽出されたデータ数をｎ１とする。抽出された教師データの対応部分の区切フラグと、区切パターンの区切フラグとを比較し、同一の区切れ方をしている教師データを抽出する。このとき抽出されたデータ数をｎ２とする。確率係数ｐは、ｎ１とｎ２の比で求められる。
すなわち、ｐ＝ｎ２／ｎ１である。
なお、ｐを求める方法はこれに限らず、ｐの値が、ｎ２が大きければ大きいほど大きくなり、ｎ１が大きければ大きいほど小さくなる任意の式（例えばｐ＝ｎ２＾２／ｎ１＾２）で求めることが出来る。 First, the probability coefficient acquisition unit 440 generates a break pattern transmitted from the menu analysis unit 30 (step S303), and acquires the probability coefficient (step S304). Specifically, the teacher data stored in the teacher data storage unit 4730 and including the target subsequence is extracted. The number of data extracted at this time is n1. The division flag of the corresponding part of the extracted teacher data is compared with the division flag of the division pattern, and the teacher data having the same division method is extracted. The number of data extracted at this time is n2. The probability coefficient p is obtained by the ratio of n1 and n2.
That is, p = n2 / n1.
Note that the method of obtaining p is not limited to this, and the value of p increases as n2 increases and decreases as n1 increases (for example, p = n2 ^ 2 / n1 ^ 2). You can ask.

一方、注目部分列のデータ数が閾値より小さいか、トライグラムリストに登録されていない場合（ステップＳ３０２；ＮＯ）、十分な数の教師データが記憶されていないと判断できるため、ｎ−１グラム（バイグラム）〜モノグラムを用いて確率係数を算出する処理（確率係数算出処理、ここでは確率係数算出処理１）を実行する（ステップＳ３０５）。 On the other hand, if the number of data in the subsequence of interest is smaller than the threshold value or not registered in the trigram list (step S302; NO), it can be determined that a sufficient number of teacher data is not stored, so n-1 gram Processing for calculating a probability coefficient using (bigram) to monogram (probability coefficient calculation processing, here, probability coefficient calculation processing 1) is executed (step S305).

ステップＳ３０５で実行される確率係数算出処理１について、図９を参照して説明する。
確率係数算出処理１では、まず（ｎ−１）グラム生成部４３０が注目文字列（ｎグラム）の部分列である（ｎ−１）グラムを二つ（ここでは図５（ａ）の前バイグラムと後バイグラム）を生成する。 The probability coefficient calculation process 1 executed in step S305 will be described with reference to FIG.
In the probability coefficient calculation process 1, first, the (n-1) gram generation unit 430 generates two (n-1) grams which are partial strings of the target character string (n gram) (here, the previous bigram in FIG. 5A). And later bigram).

そして、判別部４２０が、二つの前バイグラムと後バイグラムとの両方について、確率係数が取得可能であるか、確率係数取得処理（図８）のステップＳ３０２と同様にｎグラムリスト記憶部４７１０に記憶されたバイグラムリストを比較して判別する。具体的には、対応するバイグラムのデータ数と、所定のバイグラム用の閾値の数とを比較する（ステップＳ４０２）。ここで、ｎグラム用の閾値はそれぞれ任意に設定可能であるが、好ましくはｎが大きいほど閾値が大きい。ｎグラムから定義できる区切パターンの数は２＾ｎ個であり、ｎが大きくなるにつれてその数は大きくなる。大きな数の区切パターンのそれぞれの確率係数を取得するために十分な教師データの数は、それだけ大きくなるからである。 Then, the determination unit 420 can acquire probability coefficients for both of the two front bigrams and the rear bigrams, or stores them in the n-gram list storage unit 4710 as in step S302 of the probability coefficient acquisition processing (FIG. 8). The bigram list is compared and determined. Specifically, the number of corresponding bigram data is compared with the number of thresholds for a predetermined bigram (step S402). Here, the threshold for n-gram can be set arbitrarily, but preferably the larger the n, the larger the threshold. The number of delimiter patterns that can be defined from n-grams is 2 ^ n, and the number increases as n increases. This is because the number of teacher data sufficient to obtain the probability coefficients for each of a large number of delimiter patterns increases accordingly.

全てのバイグラムのデータ数が閾値以上である場合（ステップＳ４０２；ＹＥＳ）は、その（ｎ−１）グラム（バイグラム）全てについて確率係数を取得できると判断できるので、バイグラムを用いて確率係数を算出する（図５（ａ））ためにステップＳ４０６に移行する。 If the number of data of all bigrams is equal to or greater than the threshold (step S402; YES), it can be determined that probability coefficients can be obtained for all of the (n-1) grams (bigrams), so the probability coefficients are calculated using bigrams. In order to do this (FIG. 5A), the process proceeds to step S406.

何れかのバイグラムのデータ数が閾値より小さい場合（ステップＳ４０２；ＮＯ）は、何れかのバイグラムについて確率係数を取得できないと判断できるので、モノグラムを用いて確率係数を算出する（図５（ｂ））。
即ち、現在のｎ−１が１でないか判別し（ステップＳ４０３）、１で無い場合は（ステップＳ４０３；ＮＯ）、ｎを１減算して２とし（ステップＳ４０４）、ステップＳ４０１にもどって（ｎ−１）グラム（ここではモノグラム）を生成する。 If the number of data of any bigram is smaller than the threshold (step S402; NO), it can be determined that the probability coefficient cannot be obtained for any bigram, so the probability coefficient is calculated using the monogram (FIG. 5B). ).
That is, it is determined whether the current n-1 is not 1 (step S403). If it is not 1 (step S403; NO), n is decremented by 1 to 2 (step S404), and the process returns to step S401 (n -1) A gram (here, a monogram) is generated.

一方、ｎ−１が１である場合（ステップＳ４０３；ＹＥＳ）、さらにｎを減少させることが出来ないため、確率係数が取得不能なモノグラムについて、区切パターンを生成し、その確率係数をデフォルト値（ここでは０．５）とする（ステップＳ４０５）。 On the other hand, when n-1 is 1 (step S403; YES), since n cannot be further reduced, a delimiter pattern is generated for a monogram for which a probability coefficient cannot be obtained, and the probability coefficient is set to a default value ( Here, 0.5) is set (step S405).

そして、ステップＳ４０１〜ステップＳ４０５で全ての確率係数を取得できる（ｎ−１）の値を決定すると、次にその（ｎ−１）グラムの全てについて、メニュー解析部３０から伝達された区切パターンと対応する語間で共通する区切フラグをもつ区切パターンを生成する（ステップＳ４０６）。そして、生成した区切パターンについて、確率係数取得処理（図８）のステップＳ３０４と同様に確率係数を取得する（ステップＳ４０７）。 And if the value of (n-1) which can acquire all the probability coefficients is determined in step S401 to step S405, then for all of the (n-1) gram, the delimiter pattern transmitted from the menu analysis unit 30 and A delimiter pattern having a delimiter flag common to corresponding words is generated (step S406). Then, a probability coefficient is acquired for the generated delimiter pattern in the same manner as in step S304 of the probability coefficient acquisition process (FIG. 8) (step S407).

そして、注目部分列（トライグラム）の区切パターン毎に確率係数を算出する処理（区切パターン毎算出処理１）を実行して（ステップＳ４０８）、確率係数算出処理１は終了する。 Then, a process (calculation process 1 for each delimiter pattern) for calculating a probability coefficient for each delimiter pattern of the target subsequence (trigram) is executed (step S408), and the probability coefficient calculation process 1 is completed.

ステップＳ４０８で実行される区切パターン毎算出処理１について、図１０を参照して説明する。区切パターン毎算出処理１では、まず注目部分列（注目トライグラム）から生成できる全ての区切パターンのうち、メニュー解析部３０から要求された確率係数を取得すべき区切パターンを生成する（ステップＳ５０１）。 The delimiter pattern calculation process 1 executed in step S408 will be described with reference to FIG. In the calculation process 1 for each delimiter pattern, first, among all delimiter patterns that can be generated from the target subsequence (target trigram), a delimiter pattern for which the probability coefficient requested from the menu analysis unit 30 is to be acquired is generated (step S501). .

そして、ｋをカウンタ変数として、生成した区切パターンのｋ番目の区切パターンに注目する（ステップＳ５０２）。図５の例では、「０の１赤ワイン０煮１」が注目区切パターンである。 Then, paying attention to the kth delimiter pattern of the generated delimiter pattern using k as a counter variable (step S502). In the example of FIG. 5, “0 1 red wine 0 boiled 1” is the notable separation pattern.

次に、ステップＳ４０６で生成された注目区切パターンに対応する（ｎ−１）グラムの区切パターン（対応区切パターン、対応する語間の区切フラグが同一）について、ステップＳ４０７で取得した確率係数を注目区切パターンの確率係数に加算する（ステップＳ５０３）。 Next, for the (n-1) gram delimiter pattern (corresponding delimiter pattern and corresponding delimiter flag between corresponding words) corresponding to the delimiter pattern generated in step S406, the probability coefficient acquired in step S407 is noted. It adds to the probability coefficient of a division | segmentation pattern (step S503).

次に、ステップＳ５０１で生成された全区切パターンを処理したか（全区切パターンの確率係数を算出して加算したか）判別する（ステップＳ５０４）。
未処理の区切パターンがある場合（ステップＳ５０４；ＮＯ）、ｋをインクリメントし（ステップＳ５０５）、次の区切パターンについてステップＳ５０２からの処理を繰り返す。 Next, it is determined whether all the delimiter patterns generated in step S501 have been processed (whether the probability coefficients of all delimiter patterns have been calculated and added) (step S504).
If there is an unprocessed delimiter pattern (step S504; NO), k is incremented (step S505), and the process from step S502 is repeated for the next delimiter pattern.

一方、全ての区切パターンについて処理を終えている場合（ステップＳ５０４；ＹＥＳ）、これまでの処理で（ｎ−１）グラムの確率係数を加算した注目区切パターンの確率係数を加算した数（現時点のｎ）で割ってその値を加算平均値とし、区切パターン毎算出処理１を終了する。 On the other hand, when the processing has been completed for all the partition patterns (step S504; YES), the number obtained by adding the probability coefficients of the target partition pattern obtained by adding the probability coefficient of (n-1) grams in the above processing (current Divide by n) to make the value an addition average value, and the calculation process 1 for each separation pattern ends.

図８にもどって、ステップＳ３０４又はステップＳ３０５で確率係数を取得すると、出力部４６０が取得した確率係数をメニュー解析部３０に出力して（ステップＳ３０６）、確率係数取得処理は終了する。 Returning to FIG. 8, when the probability coefficient is acquired in step S304 or step S305, the probability coefficient acquired by the output unit 460 is output to the menu analysis unit 30 (step S306), and the probability coefficient acquisition process ends.

図７にもどって、確率係数取得処理（ステップＳ２０３）が終わると、先頭の区切フラグが１であるすべての区切パターンのうち、確率係数がもっとも高いパターン（最尤パターン）を選択する（ステップＳ２０４）。 Returning to FIG. 7, when the probability coefficient acquisition process (step S203) is completed, a pattern having the highest probability coefficient (maximum likelihood pattern) is selected from all the division patterns whose head division flag is 1 (step S204). ).

次に、未処理のｎグラムがあるか否かを判別する（ステップＳ２０５）。i番目のｎグラムが、解析対象となる単語列の最後のｎグラムでなかった場合、未処理のｎグラムがあると判別され（ステップＳ２０５；ＹＥＳ）、ｉがインクリメントされる（ステップＳ２０６）。 Next, it is determined whether or not there is an unprocessed n-gram (step S205). If the i-th n-gram is not the last n-gram of the word string to be analyzed, it is determined that there is an unprocessed n-gram (step S205; YES), and i is incremented (step S206).

そして、ステップＳ２０２にもどって、次のループが開始される。２回目以降のループのステップＳ２０３では、ｉ番目のｎグラムの１６種の区切パターンのうち、最後以外の区切フラグが前回のループのステップＳ２０４で選択されたパターンと共通である２つの区切パターンについて確率係数を求めるコマンドを送信する。
即ち、前回のループまでで選択されたパターンと共通の区切フラグを持つ２つのうち、ｉ番目のｎグラムの最後の単語の後ろで区切れるか否かが、２回目以降のループのステップＳ２０２からステップＳ２０４で決定される。 Then, returning to step S202, the next loop is started. In step S203 of the second and subsequent loops, among the 16 types of division patterns of the i-th n-gram, two division patterns whose division flags other than the last are the same as the pattern selected in step S204 of the previous loop. Send a command to get the probability coefficient.
That is, from the step S202 of the second and subsequent loops, it is determined whether or not the pattern is divided after the last word of the i-th n-gram among the two patterns having the same segmentation flag as the pattern selected up to the previous loop. Determined in step S204.

一方、i番目のｎグラムが、解析対象となる単語列の最後のｎグラムであった場合、未処理のｎグラムは無いと判別される（ステップＳ２０５；ＮＯ）。そして、これまでのループのステップＳ２０４で選択された区切パターンにおける区切フラグが１の部位で単語列を分割し（ステップＳ２０７）、メニュー分割処理は終了する。 On the other hand, if the i-th n-gram is the last n-gram of the word string to be analyzed, it is determined that there is no unprocessed n-gram (step S205; NO). Then, the word string is divided at the part where the delimiter flag in the delimiter pattern selected in step S204 of the previous loop is 1 (step S207), and the menu division process ends.

図６にもどって、メニュー分割処理が終わると、翻訳部５０が分割された単語列に含まれる単語それぞれを、翻訳辞書を用いて翻訳する（ステップＳ１０５）。 Returning to FIG. 6, when the menu division process is completed, the translation unit 50 translates each word included in the divided word string using the translation dictionary (step S105).

そして、表示部６０が翻訳結果を表示し（ステップＳ１０６）、メニュー表示処理は終了する。 Then, the display unit 60 displays the translation result (step S106), and the menu display process ends.

以上説明したように、本実施形態に係る確率係数出力装置４０によれば、注目するｎグラムを含む教師データが十分に得られない場合、あるいは無い場合であっても、そのｎグラムの部分列から区切り方に係るデータを取得して、区切パターンの確率係数を求めることが出来る。
即ち、教師学習がある区切パターンで区切れる確率で、解析対象となる単語列（メニュー）も区切れるとの仮定の下、メニューのｎグラムの区切パターンでメニューが区切れる確からしさを、（ｎ−１）グラム〜モノグラムの教師データから推測して求めることができる。
そのため、ｎグラムそのものを含む教師データのみから区切パターンの確率係数を取得してメニューの区切位置を推定するよりも必要な教師データの数が少なくてすむ。 As described above, according to the probability coefficient output device 40 according to the present embodiment, even when teacher data including the focused n-gram cannot be obtained sufficiently or not, a subsequence of the n-gram The data related to the separation method can be acquired from the above, and the probability coefficient of the separation pattern can be obtained.
That is, assuming that the word string (menu) to be analyzed is also divided by the probability of being divided by a certain division pattern, the probability that the menu will be divided by the n-gram division pattern of the menu is (n -1) It can be estimated from gram-monogram teacher data.
For this reason, the number of necessary teacher data is smaller than the probability of the partition pattern obtained from only the teacher data including the n-gram itself and estimating the menu partition position.

また、本実施形態ではｎグラムの区切パターンの確率係数を、区切フラグが一致するｎ−１グラム〜モノグラムの区切パターンの確率係数に基づいて算出する。そのため、単語の共通性のみを用いて算出に使用する係数を抽出する場合に比べて、より算出結果の精度が高い。 In this embodiment, the probability coefficient of the n-gram partition pattern is calculated based on the probability coefficient of the n-1 gram to monogram partition pattern with the same partition flag. Therefore, the accuracy of the calculation result is higher than that in the case where the coefficient used for the calculation is extracted using only the word commonality.

さらに、ｎグラムの部分列である（ｎ−１）グラム〜モノグラム（ｊグラム）のすべてについて信頼できる確率係数を取得できると判別できるｊの値を定め、単語数の区切パターンから確率係数を算出する。そのため、確率係数を算出するにあたって、前のｊグラムの情報量と後ろのｊグラムの情報量に偏りが無い。そのため、どちらかのｊグラムの確率係数がより強く算出結果に影響を与える、といった偏り無くｎグラムの確率係数を算出することが出来る。 Further, a value of j that can be determined to be able to obtain a reliable probability coefficient for all of (n-1) grams to monograms (j gram), which are n-gram substrings, is determined, and the probability coefficient is calculated from the delimiter pattern of the number of words. To do. Therefore, when calculating the probability coefficient, there is no bias between the information amount of the previous j-gram and the information amount of the subsequent j-gram. Therefore, the probability coefficient of n-gram can be calculated without any bias such that the probability coefficient of either j-gram more strongly affects the calculation result.

また、本実施形態に係る確率係数出力装置４０によれば、教師データが所定のカテゴリの文字列（ここではメニュー）から生成されているため、広範なカテゴリ（例えば日本語全体）の教師データを用いて区切パターンの確率係数を求めた場合よりも、カテゴリに合致した確率係数を求めることが出来る。
そのため、確率係数出力装置４０を含むメニュー表示装置１を用いてメニューを分割すると、メニューを分割する精度が高い。 Further, according to the probability coefficient output device 40 according to the present embodiment, since the teacher data is generated from a character string of a predetermined category (here, a menu), the teacher data of a wide category (for example, the entire Japanese language) is obtained. The probability coefficient that matches the category can be obtained as compared with the case where the probability coefficient of the segmentation pattern is obtained by using it.
Therefore, when the menu is divided using the menu display device 1 including the probability coefficient output device 40, the accuracy of dividing the menu is high.

なお、上記説明ではｎグラムの確率係数を抽出した部分列（ｊグラム）の対応パターンの確率係数を加算平均して求めるとしたが、ｎグラムの確率係数を求める方法はこれに限らない。
ｎグラムの確率係数は、ｊグラムの対応パターンの少なくとも一つが大きくなるにつれて、ｎグラムの確率係数も大きくなるような任意の計算式で代替可能である。例えば、対応パターンの確率係数のうち、最も前に位置する対応パターンの確率係数の影響が大きくなるように重み付けして加算する式、各対応パターンの確率係数を累乗平均する式、等に置換することができる。
また、ｎグラムの確率係数は所定の最大値（たとえば０．８）をもち、算出値が最大値以上であれば最大値を算出結果としてもよい。
さらに、対応パターンの確率係数と算出値とを対応づけて記憶するテーブルを記憶部４７０に記憶し、算出式によらずこのテーブルを参照してｎグラムの確率係数を求めても良い。 In the above description, the probability coefficient of the corresponding pattern of the partial sequence (j-gram) from which the n-gram probability coefficient is extracted is obtained by averaging. However, the method for obtaining the probability coefficient of n-gram is not limited to this.
The probability coefficient of n-gram can be replaced by any calculation formula that increases the probability coefficient of n-gram as at least one of the corresponding patterns of j-gram increases. For example, among the probability coefficients of the corresponding pattern, the weighted addition is performed so that the influence of the probability coefficient of the corresponding pattern located at the earliest is increased, and the expression of averaging the probability coefficient of each corresponding pattern is a power average. be able to.
Further, the probability coefficient of n-gram has a predetermined maximum value (for example, 0.8), and if the calculated value is equal to or greater than the maximum value, the maximum value may be the calculation result.
Further, a table storing the correspondence pattern probability coefficient and the calculated value in association with each other may be stored in the storage unit 470, and the n-gram probability coefficient may be obtained by referring to this table regardless of the calculation formula.

（実施形態２）
次に、本願発明の実施形態２に係るメニュー表示装置１及び確率係数出力装置４０について説明する。 (Embodiment 2)
Next, the menu display device 1 and the probability coefficient output device 40 according to Embodiment 2 of the present invention will be described.

本実施形態のメニュー表示装置１及び確率係数出力装置４０は、実施形態１に係るメニュー表示装置１及び確率係数出力装置４０と同様の構成を持つ（図１、図３）。 The menu display device 1 and the probability coefficient output device 40 of the present embodiment have the same configuration as the menu display device 1 and the probability coefficient output device 40 according to the first embodiment (FIGS. 1 and 3).

本実施形態の確率係数出力装置４０は、実施形態１に係る確率係数出力装置４０と確率係数の算出方法が異なる。
ここで、本実施形態の確率係数出力装置４０の確率係数の算出方法について、図１１を参照して説明する。 The probability coefficient output device 40 according to the present embodiment is different from the probability coefficient output device 40 according to the first embodiment in the probability coefficient calculation method.
Here, a calculation method of the probability coefficient of the probability coefficient output device 40 of the present embodiment will be described with reference to FIG.

本実施形態の確率係数出力装置４０が、トライグラムの区切パターンを算出するに当たって、前バイグラムと後バイグラムの確率係数を用いて算出する方法を、図１１（ａ）を参照して説明する。 A method in which the probability coefficient output device 40 of the present embodiment calculates the trigram segmentation pattern using the probability coefficients of the front bigram and the rear bigram will be described with reference to FIG.

このとき、トライグラムの区切パターン（ここでは「０の１赤ワイン０煮１」）の確率係数を算出するにあたって、まず共通する区切フラグを持つ前バイグラム（「０の１赤ワイン０」、確率係数ｐ１＝０．３１）を対応パターンとして抽出する。そして、後バイグラムにおける対応パターンとして、前バイグラムにおける対応パターンと共通する区切フラグをもつ二つのバイグラムである「１赤ワイン０煮１」（確率係数ｐ２＝０．４５）と「１赤ワイン０煮０」（確率係数ｐ３＝０．１１）とを抽出する。 At this time, in calculating the probability coefficient of the trigram division pattern (here, “0 1 red wine 0 boiled 1”), first, the previous bigram (“0 1 red wine 0”, probability coefficient p1 having a common division flag). = 0.31) is extracted as a corresponding pattern. Then, as corresponding patterns in the rear bigram, two bigrams having a delimiter flag common to the corresponding pattern in the previous bigram “1 red wine 0 boiled 1” (probability coefficient p2 = 0.45) and “1 red wine 0 boiled 0” (Probability coefficient p3 = 0.11) is extracted.

そして、前バイグラムの対応パターンの確率係数を、後バイグラムにおける対応パターンの確率係数に基づいて振り分けて、トライグラムの確率係数を算出する。即ち、トライグラムの区切パターンの確率係数を、「０の１赤ワイン０」の次に「１赤ワイン０煮１」が来る場合の確率ｐａとして、ｐ１・（ｐ２／（ｐ２＋ｐ３））として算出する。同様に、「０の１赤ワイン０」の次に「１赤ワイン０煮０」が来る確立ｐｂはｐ１・（ｐ３／（ｐ２＋ｐ３））として算出できる。
なお、この算出式は、前（ｎ−１）グラムにおける対応パターンの確率係数ｐ１を後（ｎ−１）グラムにおける対応パターンの確率係数で振り分ける任意の式（例えばｐａ＝ｐ１＾２・（ｐ２＾２／（ｐ２＋ｐ３）＾２）に置き換えることができる。
なお、ここでは前の（ｎ−１）グラムにおける対応パターンの確率係数ｐ１を後ろの（ｎ−１）グラムにおける対応パターンの確率係数で振り分けたが、処理の順序は前後逆でも良い。以下同じである。
また、前後それぞれの（ｎ−１）グラムの対応パターンの確率係数と算出値とを対応づけて記憶するテーブルを記憶部４７０に記憶し、算出式によらずこのテーブルを参照してｎグラムの確率係数を求めても良い。 Then, the probability coefficient of the corresponding pattern of the previous bigram is distributed based on the probability coefficient of the corresponding pattern of the subsequent bigram, and the probability coefficient of the trigram is calculated. That is, the probability coefficient of the trigram segmentation pattern is calculated as p1 · (p2 / (p2 + p3)) as the probability pa when “1 red wine 0 boiled 1” comes after “0 red wine 0”. Similarly, the probability pb that “1 red wine 0 boiled 0” comes after “0 red wine 0” can be calculated as p1 · (p3 / (p2 + p3)).
This calculation formula is an arbitrary formula (for example, pa = p1 ^ 2 · (p2) in which the probability coefficient p1 of the corresponding pattern in the previous (n−1) gram is distributed by the probability coefficient of the corresponding pattern in the subsequent (n−1) gram. It can be replaced with {circumflex over (^) / (p2 + p3)} 2).
Here, the probability coefficient p1 of the corresponding pattern in the previous (n-1) gram is sorted by the probability coefficient of the corresponding pattern in the subsequent (n-1) gram, but the processing order may be reversed. The same applies hereinafter.
In addition, a table for storing the probability coefficient of the corresponding pattern of (n-1) grams before and after and the calculated value in association with each other is stored in the storage unit 470. A probability coefficient may be obtained.

同様に、トライグラムの区切パターンを算出するに当たって、前バイグラムと後モノグラムの確率係数を用いて算出する場合は、前のバイグラムの対応パターンの確率係数を、後ろのモノグラムにおける対応パターンの確率係数に基づいて振り分ける（図１１（ｂ））。 Similarly, when calculating the trigram separation pattern using the probability coefficient of the previous bigram and the subsequent monogram, the probability coefficient of the corresponding pattern of the previous bigram is changed to the probability coefficient of the corresponding pattern of the subsequent monogram. Based on the distribution (FIG. 11B).

また、トライグラムの区切パターンを、モノグラムの確率係数のみを用いて算出する方法を図１１（ｃ）を参照して説明する。このとき、まず前モノグラムにおける対応パターンの確率係数ｐ７を後モノグラムにおける対応パターンの確率係数ｐ８とｐ９に振り分けて、前バイグラムの確率係数ｐ１０を求める（ｃ１）。そして、（ｃ１）で求めた前バイグラムにおける確率係数ｐ１０を用いて、図１１（ｂ）と同様にトライグラムの確率係数を算出する。 A method for calculating the trigram segmentation pattern using only the monogram probability coefficient will be described with reference to FIG. At this time, first, the probability coefficient p7 of the corresponding pattern in the previous monogram is assigned to the probability coefficients p8 and p9 of the corresponding pattern in the subsequent monogram to obtain the probability coefficient p10 of the previous bigram (c1). Then, using the probability coefficient p10 in the previous bigram obtained in (c1), the trigram probability coefficient is calculated in the same manner as in FIG.

本実施形態の確率係数出力装置４０が実行する図１１に示す算出処理を、フローチャート（図１２〜図１３）を用いて具体的に説明する。 The calculation process shown in FIG. 11 executed by the probability coefficient output device 40 of this embodiment will be specifically described with reference to flowcharts (FIGS. 12 to 13).

本実施形態のメニュー表示装置１及び確率係数出力装置４０は、ユーザがメニューを撮影するための操作を実行すると、実施形態１と同様に、図６のメニュー表示処理と、図７のメニュー分割処理と、図８の確率係数取得処理を実行する。
本実施形態の確率係数出力装置４０は、確率係数取得処理（図８）のステップＳ３０５で、確率係数算出処理２（図１２）を実行する。 When the user performs an operation for photographing a menu, the menu display device 1 and the probability coefficient output device 40 of the present embodiment perform the menu display processing of FIG. 6 and the menu division processing of FIG. 7 as in the first embodiment. Then, the probability coefficient acquisition process of FIG. 8 is executed.
The probability coefficient output device 40 of the present embodiment executes the probability coefficient calculation process 2 (FIG. 12) in step S305 of the probability coefficient acquisition process (FIG. 8).

確率係数算出処理２では、まず（ｎ−１）グラム生成部４３０が注目文字列（ｎグラム）の部分列である（ｎ−１）グラムを二つ（ここでは図１１（ａ）の前バイグラムと後バイグラム）を生成して、何れかのバイグラム（ここでは前バイグラム）に注目する（ステップＳ６０１）。 In the probability coefficient calculation process 2, first, the (n-1) gram generation unit 430 generates two (n-1) grams which are partial strings of the target character string (n gram) (here, the previous bigram in FIG. 11A). And the subsequent bigram), and attention is paid to any bigram (here, the bigram) (step S601).

次に、判別部４２０が、注目バイグラム（前バイグラム）について、確率係数が取得可能であるか、確率係数取得処理（図８）のステップＳ３０２と同様にｎグラムリスト記憶部４７１０に記憶されたバイグラムリストの対応するバイグラムのデータ数と、所定のバイグラム用の閾値の数と、を比較して判別する（ステップＳ６０２）。 Next, whether the determination unit 420 can acquire the probability coefficient for the attention bigram (previous bigram) or the bigram stored in the n-gram list storage unit 4710 as in step S302 of the probability coefficient acquisition process (FIG. 8). The number of bigram data corresponding to the list is compared with the number of thresholds for a predetermined bigram (step S602).

前バイグラムのデータ数がバイグラムの閾値以上である場合（ステップＳ６０２；ＹＥＳ）は、前バイグラムについて確率係数を取得できると判断できるので、注目バイグラム（前バイグラム）について対応パターンを特定し、その確率係数を取得する（ステップＳ６０３）。 If the number of data of the previous bigram is equal to or larger than the bigram threshold (step S602; YES), it can be determined that a probability coefficient can be acquired for the previous bigram. Is acquired (step S603).

前バイグラムのデータ数が閾値より小さい場合（ステップＳ６０２；ＮＯ）は、前バイグラムについて確率係数を取得できないと判断できるので、図１１（ｃ）の（ｃ１）のようにモノグラムを用いて前バイグラムの確率係数を算出する。即ち、現在のｎ−１が１でないか判別し（ステップＳ６０４）、１で無い場合は（ステップＳ６０３；ＮＯ）、ｎを１減算して２とし（ステップＳ６０５）、減算したｎで前バイグラムを注目部分列として確率係数算出処理２を再帰的に実行して、区切パターンを生成し、その確率係数を取得する（ステップＳ６０６）。 If the number of data of the previous bigram is smaller than the threshold value (step S602; NO), it can be determined that the probability coefficient cannot be obtained for the previous bigram. Therefore, as shown in (c1) of FIG. Calculate the probability coefficient. That is, it is determined whether or not the current n-1 is 1 (step S604). If it is not 1 (step S603; NO), 1 is subtracted from 1 to 2 (step S605), and the previous bigram is subtracted from the subtracted n. The probability coefficient calculation process 2 is recursively executed as the target subsequence to generate a delimiter pattern, and the probability coefficient is acquired (step S606).

一方、ｎ−１が１である場合（ステップＳ６０４；ＹＥＳ）、さらにｎを減少させることが出来ないため、確率係数が取得不能なモノグラムについて、区切パターンを生成しその確率係数をデフォルト値（ここでは０．５）とする（ステップＳ６０７）。 On the other hand, when n-1 is 1 (step S604; YES), since n cannot be further reduced, a delimiter pattern is generated for a monogram for which a probability coefficient cannot be obtained, and the probability coefficient is set to a default value (here Is 0.5) (step S607).

次に、前後両方の（ｎ−１）グラムについて処理が終了したか判別する（ステップＳ６０８）。前後どちらかの（ｎ−１）グラムについて、確率係数を取得していない場合は（ステップＳ６０８；ＮＯ）、未処理の（ｎ−１）グラムを注目（ｎ−１）グラムとして（ステップＳ６０９）、ステップＳ６０２から処理を繰り返す。 Next, it is determined whether or not the processing has been completed for both (n-1) grams before and after (step S608). If the probability coefficient has not been acquired for either (n-1) gram before or after (step S608; NO), the unprocessed (n-1) gram is regarded as the attention (n-1) gram (step S609). The process is repeated from step S602.

一方、前後両方の（ｎ−１）グラムについて確率係数を取得し終えたと判別すると（ステップＳ６０８；ＹＥＳ）、次に区切パターン毎に確率係数を算出する処理（区切パターン毎算出処理２）を実行して（ステップＳ６０９）、確率係数算出処理２は終了する。 On the other hand, if it is determined that the probability coefficients have been acquired for both (n-1) grams before and after (step S608; YES), a process of calculating probability coefficients for each delimiter pattern (delimiter pattern calculation process 2) is executed. Then (step S609), the probability coefficient calculation process 2 ends.

ステップＳ６１０で実行される区切パターン毎算出処理２について、図１３を参照して説明する。区切パターン毎算出処理２では、まず注目部分列（注目トライグラム）から生成できる全ての区切パターンのうち、メニュー解析部３０から要求された確率係数を取得すべき区切パターンを生成する（ステップＳ７０１）。 The delimiter pattern calculation process 2 executed in step S610 will be described with reference to FIG. In the calculation process 2 for each delimiter pattern, first, among all delimiter patterns that can be generated from the target subsequence (target trigram), a delimiter pattern for which the probability coefficient requested from the menu analysis unit 30 is to be acquired is generated (step S701). .

そして、ｋをカウンタ変数として、生成した区切パターンのｋ番目の区切パターンに注目する（ステップＳ７０２）。図１１（ａ）及び（ｂ）の例では、「０の１赤ワイン０煮１」が注目区切パターンである。図１１（ｃ）の（ｃ１）の例では、「０の１赤ワイン０」が注目区切パターンである。 Then, paying attention to the k-th delimiter pattern of the generated delimiter pattern using k as a counter variable (step S702). In the examples of FIGS. 11A and 11B, “0 1 red wine 0 boiled 1” is the notable separation pattern. In the example of (c1) in FIG. 11 (c), “0 red wine 0” is the notable separation pattern.

次に、図１１に示した注目区切パターンに対応する（ｎ−１）グラムの区切パターン（対応パターン）を、確率係数算出処理２（図１２）のステップＳ６０３、ステップＳ６０６又はステップＳ６０７で生成された（ｎ−１）グラムの区切パターンの中から抽出する（ステップＳ７０３）。図１１（ａ）又は（ｃ）の（ｃ２）の場合は、「０の１赤ワイン０」と「１赤ワイン０煮１」とが対応パターンである。また、図１１（ｂ）の場合は、「０の１赤ワイン０」と「０煮１」とが対応パターンである。また、図１１（ｃ）の（ｃ１）の例では、「０の１」と「赤ワイン０」とが対応区切パターンである。 Next, the (n−1) -gram delimiter pattern (corresponding pattern) corresponding to the delimiter pattern of interest shown in FIG. 11 is generated in step S603, step S606 or step S607 of the probability coefficient calculation process 2 (FIG. 12). (N-1) Gram separation patterns are extracted (step S703). In the case of (c2) in FIG. 11 (a) or (c), “0 1 red wine 0” and “1 red wine 0 boiled 1” are the corresponding patterns. In the case of FIG. 11B, “0 red wine 0” and “0 boiled 1” are the corresponding patterns. Further, in the example of (c1) in FIG. 11C, “1 of 0” and “red wine 0” are the corresponding division patterns.

そして、抽出した対応パターンの確率係数から、図１１に示した算出方法で注目区切パターンの確率係数を算出する（ステップＳ７０４）。 Then, the probability coefficient of the target separation pattern is calculated from the extracted probability coefficient of the corresponding pattern by the calculation method shown in FIG. 11 (step S704).

次に、ステップＳ７０１で生成された全区切パターンについて、確率係数を算出したか判別する（ステップＳ７０５）。未処理の区切パターンがある場合（ステップＳ７０５；ＮＯ）、ｋをインクリメントし（ステップＳ７０６）、次の区切パターンについてステップＳ７０２からの処理を繰り返す。 Next, it is determined whether probability coefficients have been calculated for all the delimiter patterns generated in step S701 (step S705). If there is an unprocessed division pattern (step S705; NO), k is incremented (step S706), and the processing from step S702 is repeated for the next division pattern.

一方、全ての区切パターンについて処理を終えている場合（ステップＳ７０５；ＹＥＳ）、区切パターン毎算出処理２を終了する。 On the other hand, when the process has been completed for all the delimiter patterns (step S705; YES), the delimiter pattern calculation process 2 is terminated.

以上説明したように、本実施形態に係る確率係数出力装置４０によれば、注目するｎグラムを含む教師データが十分に得られない場合、前のｎ−１グラム（注目ｎ−１グラム）について得られる確率変数を、後ろのｎ−１グラムに振り分けてｎグラムにおける区切パターンの確率変数を算出する（あるいはその逆）。
即ち、（ｎ−１）グラム列で教師データを分割して、ある区切パターンの区切方を、その区切パターンの区切方で区切れるとした場合にあり得る次の区切パターンの確率に基づいて分配してｎグラムの確率係数を算出するため、より多くの情報に基づいて確率係数を算出することができる。そのため、算出精度が高い。 As described above, according to the probability coefficient output device 40 according to the present embodiment, when sufficient teacher data including the focused n-gram cannot be obtained, the previous n−1 gram (target n−1 gram) is obtained. The obtained random variable is assigned to the subsequent n-1 gram, and the random variable of the division pattern in the n gram is calculated (or vice versa).
That is, (n-1) the teacher data is divided by the gram sequence, and the division method of a certain division pattern is divided based on the probability of the next division pattern that can be divided by the division method of the division pattern. Since the n-gram probability coefficient is calculated, the probability coefficient can be calculated based on more information. Therefore, the calculation accuracy is high.

また、（ｎ−１）グラムの一部について確率係数が得られない場合でも、得られる限りの（ｎ−１）グラムの確率係数を利用してｎグラムの確率係数を算出できるため、一律に（ｎ−２）グラム〜モノグラムの確率係数を用いる場合と比べて精度の劣化が少ない。 In addition, even when the probability coefficient cannot be obtained for a part of (n-1) grams, the probability coefficient of n grams can be calculated using the probability coefficient of (n-1) grams as long as it can be obtained. (N-2) There is little deterioration in accuracy compared with the case of using a probability coefficient of gram to monogram.

（変形例）
以上、本願発明の実施形態について説明したが、本願の実施形態はこれに限られず、さまざまな変形が可能である。
例えば、上記実施形態１乃至２では、ｎグラムのデータ数が所定の閾値以下だった場合に、（ｎ−１）グラム〜モノグラムの確率係数から算出したが、本願発明の実施形態はこれに限らない。例えば、このような場合に（ｎ−１）グラム〜モノグラムの確率係数から算出し、さらに所定の閾値以下のデータ数からｎグラムの確率係数を求め、算出した値と加算した数値を加算平均として求める値としても良い。 (Modification)
As mentioned above, although embodiment of this invention was described, embodiment of this application is not restricted to this, A various deformation | transformation is possible.
For example, in the first and second embodiments, when the number of n-gram data is equal to or less than a predetermined threshold, the calculation is performed from the probability coefficient of (n-1) gram to monogram, but the embodiment of the present invention is not limited thereto. Absent. For example, in such a case, the probability coefficient of (n-1) gram to monogram is calculated, and the probability coefficient of n gram is obtained from the number of data below a predetermined threshold, and the calculated value and the added value are used as the addition average. It may be a value to be obtained.

また、上記実施形態１乃至２では、教師データは教師データ記憶部４７３０に記憶されていたが、教師データは確率係数出力装置の内部ではなく、外部装置に記憶されているとしても良い。
このとき、確率係数出力装置は、通信部４０５を用いて外部装置にアクセスして教師データを取得する。 In the first and second embodiments, the teacher data is stored in the teacher data storage unit 4730. However, the teacher data may be stored not in the probability coefficient output device but in an external device.
At this time, the probability coefficient output device uses the communication unit 405 to access the external device and acquire teacher data.

また、区切パターンはすべての語間について一意に区切フラグを定義し、区切フラグを比較するにあたって完全に一致する場合のみを対応する区切パターンとした。しかし、区切パターンにおいて一部の区切フラグを未知数として定義し、区切フラグの比較にあたっては未知の部分を考慮しない、とする構成も可能である。 In addition, as a delimiter pattern, a delimiter flag is uniquely defined between all words, and a delimiter pattern corresponding to only a case where the delimiter flags are completely matched is compared. However, it is also possible to define that some delimiter flags are defined as unknowns in the delimiter pattern, and that unknown parts are not taken into account when comparing delimiter flags.

また、上記実施形態１乃至２では、ｎグラムの区切パターンの確率係数を求めるにあたって、タグ付の文字列である教師データにおける区切パターンの出現確率を逐一求めていた。しかし、確率係数出力装置又は外部装置が区切パターンの確率係数を登録したパターン確率係数リストを記憶しており、このパターン確率係数リストを参照して確率係数を取得する構成も可能である。このようなパターン確率係数リストの例を、図１４を参照して説明する。図１４は、トライグラムと、区切フラグと、に対応する確率係数を登録したトライグラムパターン確率係数リストの例である。例えば、パターン「００１０」の列、「豚−バラ肉−の」の行、に数値０．０２が登録されていることは、区切パターン「０豚０バラ肉１の０」の確率係数が０．０２であることを示す。 In the first and second embodiments, when the probability coefficient of the n-gram delimiter pattern is obtained, the appearance probability of the delimiter pattern in the teacher data, which is a tagged character string, is obtained one by one. However, a configuration is also possible in which a probability coefficient output device or an external device stores a pattern probability coefficient list in which the probability coefficients of the segmentation pattern are registered, and the probability coefficient is acquired with reference to this pattern probability coefficient list. An example of such a pattern probability coefficient list will be described with reference to FIG. FIG. 14 is an example of a trigram pattern probability coefficient list in which the probability coefficients corresponding to the trigram and the delimiter flag are registered. For example, if the numerical value 0.02 is registered in the column of the pattern “0010” and the row of “Pig-Rose”, the probability coefficient of the delimiter pattern “0 pork 0 rose 1 0” is 0. .02.

また、上記実施形態１乃至２では、解析対象となる単語列はメニューであったが、本発明はメニュー以外の任意のカテゴリの単語列について応用可能である。本発明の解析対象となる単語列は、現れる単語が限られていること、単語と単語との区切り方のルールが限定されていること、を特徴とするカテゴリの単語列であることが好ましい。このようなカテゴリの単語列の例として、メニューの他に住所、薬品の効能書き・説明書、等があげられる。 In the first and second embodiments, the word string to be analyzed is a menu, but the present invention can be applied to word strings of any category other than the menu. The word string to be analyzed according to the present invention is preferably a word string of a category characterized by the fact that the words that appear are limited and the rules for how to separate the words are limited. Examples of word strings in such categories include addresses, medicinal benefits / instructions, etc. in addition to menus.

また、上記実施形態１乃至２の確率係数出力装置は、メニュー解析部３０からｎ個の単語（ｎグラム）と、そのｎグラムの区切パターンのうち確率係数を必要とする区切パターンを示す情報を伝達され、確率係数を必要とする区切パターンについて確率係数を算出して出力したが、本発明に係る確率係数出力装置が実行する処理はこれに限らない。例えば、確率係数出力装置は外部装置からｎグラムのみ伝達され、そのｎグラムについて定義できる全ての区切パターンを生成し、生成した全ての区切パターンの確率係数を算出して出力するとしても良い。 In addition, the probability coefficient output devices of the first and second embodiments receive information indicating n words (n-grams) from the menu analysis unit 30 and a delimiter pattern that requires a probability coefficient among the n-gram delimiter patterns. Although the probability coefficient is calculated and output for the division pattern that is transmitted and requires the probability coefficient, the process executed by the probability coefficient output device according to the present invention is not limited to this. For example, the probability coefficient output device may receive only n-grams from an external device, generate all delimiter patterns that can be defined for the n-gram, and calculate and output the probability coefficients of all generated delimiter patterns.

また、情報処理部４０１、データ記憶部４０２、プログラム記憶部４０３、等から構成される確率係数出力装置のための処理を行う中心となる部分は、専用のシステムによらず、通常のコンピュータシステムを用いて実現可能である。たとえば、前記の動作を実行するためのコンピュータプログラムを、コンピュータが読み取り可能な記録媒体（フレキシブルディスク、ＣＤ−ＲＯＭ、ＤＶＤ−ＲＯＭ等）に格納して配布し、当該コンピュータプログラムをコンピュータにインストールすることにより、前記の処理を実行する情報端末を構成してもよい。また、インターネット等の通信ネットワーク上のサーバ装置が有する記憶装置に当該コンピュータプログラムを格納しておき、通常のコンピュータシステムがダウンロード等することで情報処理装置を構成してもよい。 In addition, a central part that performs processing for the probability coefficient output device including the information processing unit 401, the data storage unit 402, the program storage unit 403, and the like is not a dedicated system, but a normal computer system. It can be realized using. For example, a computer program for executing the above operation is stored and distributed in a computer-readable recording medium (flexible disk, CD-ROM, DVD-ROM, etc.), and the computer program is installed in the computer. Thus, an information terminal that executes the above-described processing may be configured. Alternatively, the computer program may be stored in a storage device included in a server device on a communication network such as the Internet, and the information processing device may be configured by being downloaded by a normal computer system.

また、確率係数出力装置の機能を、ＯＳ（オペレーティングシステム）とアプリケーションプログラムの分担、またはＯＳとアプリケーションプログラムとの協働により実現する場合などには、アプリケーションプログラム部分のみを記録媒体や記憶装置に格納してもよい。 In addition, when the function of the probability coefficient output device is realized by sharing of the OS (operating system) and application program, or in cooperation with the OS and application program, only the application program portion is stored in a recording medium or a storage device. May be.

また、搬送波にコンピュータプログラムを重畳し、通信ネットワークを介して配信することも可能である。たとえば、通信ネットワーク上の掲示板（ＢＢＳ：ＢｕｌｌｅｔｉｎＢｏａｒｄＳｙｓｔｅｍ）に前記コンピュータプログラムを掲示し、ネットワークを介して前記コンピュータプログラムを配信してもよい。そして、このコンピュータプログラムを起動し、ＯＳの制御下で、他のアプリケーションプログラムと同様に実行することにより、前記の処理を実行できるように構成してもよい。 It is also possible to superimpose a computer program on a carrier wave and distribute it via a communication network. For example, the computer program may be posted on a bulletin board (BBS: Bulletin Board System) on a communication network, and the computer program may be distributed via the network. The computer program may be started and executed in the same manner as other application programs under the control of the OS, so that the above-described processing may be executed.

また、実行する処理の一部を、確率係数出力装置とは独立したコンピュータを用いて実現しても良い。。 A part of the processing to be executed may be realized by using a computer independent of the probability coefficient output device. .

以上、本発明の好ましい実施形態について説明したが、本発明は係る特定の実施形態に限定されるものではなく、本発明には、特許請求の範囲に記載された発明とその均等の範囲が含まれる。以下に、本願出願の当初の特許請求の範囲に記載された発明を付記する。 As mentioned above, although preferable embodiment of this invention was described, this invention is not limited to the specific embodiment which concerns, This invention includes the invention described in the claim, and its equivalent range It is. Hereinafter, the invention described in the scope of claims of the present application will be appended.

（付記１）
注目単語列を取得する単語列取得部と、
前記単語列取得部が取得した注目単語列に含まれる一又は複数の単語を含む部分単語列を複数抽出する抽出部と、
単語列を構成する単語と単語との間である語間それぞれで単語列が区切れる場合と区切れない場合とのそれぞれの区切り方に対応する区切パターンを、前記抽出部が抽出した部分単語列それぞれについて取得し、当該区切パターンに対応する区切り方で当該部分単語列を含む教師単語列が区切れる確からしさを示す区切確率係数を、当該抽出した区切パターンそれぞれについて取得する確率係数取得部と、
前記注目単語列の区切パターンの区切確率係数を、前記確率係数取得部が取得した区切確率係数に基づいて求める確率係数獲得部と、
を備えることを特徴とする情報処理装置。 (Appendix 1)
A word string acquisition unit for acquiring the attention word string;
An extraction unit for extracting a plurality of partial word strings including one or more words included in the attention word string acquired by the word string acquisition unit;
The partial word sequence extracted by the extraction unit with a delimiter pattern corresponding to each delimiter pattern between the case where the word sequence is delimited and the case where the word sequence is not delimited between each word between the words constituting the word sequence A probability coefficient acquisition unit that acquires for each of the extracted delimiter patterns, and acquires a delimiter probability coefficient indicating the probability that the teacher word string including the partial word string is delimited in a delimiter corresponding to the delimiter pattern;
A probability coefficient obtaining unit for obtaining a separation probability coefficient of the separation pattern of the attention word string based on the separation probability coefficient acquired by the probability coefficient acquisition unit;
An information processing apparatus comprising:

（付記２）
前記教師単語列は、前記注目単語列と同一カテゴリに属する単語列であって、当該単語列の語間のそれぞれで単語列が区切れるか否かを定義した単語列であり、
前記抽出部が抽出した部分単語列を含む単語列を、前記区切確率係数を取得するための教師単語列として十分な数だけ取得できるか否か判別する判別部をさらに備え、
前記抽出部は、前記判別部が抽出した部分単語列を含む単語列が前記区切確率係数を取得するに十分な数だけ取得できないと判別すると、当該抽出した部分単語列の部分単語列をさらに抽出する、
ことを特徴とする付記１に記載の情報処理装置。 (Appendix 2)
The teacher word string is a word string belonging to the same category as the attention word string, and is a word string defining whether or not the word string is divided between words of the word string,
A determination unit that determines whether a sufficient number of word strings including the partial word string extracted by the extraction unit can be acquired as a teacher word string for acquiring the break probability coefficient;
When the extraction unit determines that a sufficient number of word strings including the partial word string extracted by the determination unit cannot be acquired, the partial word string of the extracted partial word string is further extracted. To
The information processing apparatus according to appendix 1, wherein

（付記３）
前記確率係数取得部は、前記部分単語列に対して定義できる区切パターンのうち、前記確率係数獲得部が獲得する注目単語列の区切パターンと対応する語間については同じ区切り方である区切パターンを取得する、
ことを特徴とする付記１又は２に記載の情報処理装置。 (Appendix 3)
The probability coefficient acquisition unit, among the delimiter patterns that can be defined for the partial word string, a delimiter pattern that is the same delimiter between words corresponding to the delimiter pattern of the attention word string acquired by the probability coefficient acquiring unit get,
The information processing apparatus according to appendix 1 or 2, characterized in that:

（付記４）
前記確率係数獲得部が求める注目単語列の区切パターンの区切確率係数は、前記確率係数取得部が取得した区切確率係数の少なくとも一つが大きくなるにつれて大きくなる、
ことを特徴とする付記１乃至３の何れか一つに記載の情報処理装置。 (Appendix 4)
The delimitation probability coefficient of the delimiter pattern of the attention word string obtained by the probability coefficient acquisition unit increases as at least one of the delimitation probability coefficients acquired by the probability coefficient acquisition unit increases.
The information processing apparatus according to any one of supplementary notes 1 to 3, wherein:

（付記５）
前記抽出部が抽出する部分単語列がそれぞれ同一数の単語から構成される、
ことを特徴とする付記１乃至４の何れか一つに記載の情報処理装置。 (Appendix 5)
The partial word strings extracted by the extraction unit are each composed of the same number of words.
The information processing apparatus according to any one of supplementary notes 1 to 4, wherein

（付記６）
前記抽出部は、すくなくとも注目単語列の先頭の単語を含む部分単語列である前部分単語列と最後尾の単語を含む部分単語列である後部分単語列とを抽出し、
前記確率係数獲得部は、前記前部分単語列又は前記後部分単語列のいずれか一方である注目部分単語列から取得された前記区切パターンの区切確率係数を、当該注目部分単語列の区切パターンと対応する語間については同じ区切り方に対応する、前記前部分単語列又は後部分単語列のうち注目単語列でない方の部分単語列から取得された区切パターンの区切確率係数に基づいて割り振って、前記注目単語列の区切パターンの区切確率係数を求める、
ことを特徴とする付記１乃至３の何れか一つに記載の情報処理装置。 (Appendix 6)
The extraction unit extracts at least a front partial word string that is a partial word string including the first word of the attention word string and a rear partial word string that is a partial word string including the last word;
The probability coefficient acquisition unit uses the delimitation probability coefficient of the delimiter pattern acquired from the target partial word string that is either the previous partial word string or the rear partial word string as the delimiter pattern of the target partial word string For corresponding words, corresponding to the same delimiter, allocating based on the delimiter probability coefficient of the delimiter pattern obtained from the partial word string that is not the attention word string of the previous partial word string or the rear partial word string, Obtaining a delimitation probability coefficient of the delimiter pattern of the attention word string;
The information processing apparatus according to any one of supplementary notes 1 to 3, wherein:

（付記７）
前記注目単語列と前記教師単語列とが献立を表現する単語列である、
ことを特徴とする付記１乃至６の何れか１つに記載の情報処理装置。 (Appendix 7)
The attention word string and the teacher word string are word strings expressing menus,
The information processing apparatus according to any one of supplementary notes 1 to 6, wherein:

（付記８）
コンピュータに、
注目単語列を取得する処理、
前記取得した注目単語列に含まれる一又は複数の単語を含む部分単語列を複数抽出する処理、
単語列を構成する単語と単語との間である語間それぞれで単語列が区切れる場合と区切れない場合とのそれぞれの区切り方に対応する区切パターンを、前記抽出した部分単語列それぞれについて取得する処理、
前記取得した区切パターンに対応する区切り方で当該部分単語列を含む教師単語列が区切れる確からしさを示す区切確率係数を、当該抽出した区切パターンそれぞれについて取得する処理、
前記注目単語列の区切パターンの区切確率係数を、前記取得した区切確率係数に基づいて求める処理、
を実行させることを特徴とするプログラム。 (Appendix 8)
On the computer,
Processing to obtain the word sequence of interest,
A process of extracting a plurality of partial word strings including one or more words included in the acquired attention word string;
For each of the extracted partial word strings, a delimiter pattern corresponding to each of the case where the word string is divided and the case where the word string is not divided between words constituting the word string is obtained. Processing,
A process of obtaining a delimitation probability coefficient indicating the probability that the teacher word string including the partial word string is delimited by the delimiter corresponding to the acquired delimiter pattern for each of the extracted delimiter patterns;
A process for obtaining a delimitation probability coefficient of the delimiter pattern of the attention word string based on the acquired delimitation probability coefficient;
A program characterized by having executed.

１…メニュー表示装置、１０…入力部、２０…ＯＣＲ、３０…メニュー解析部、４０…確率係数出力装置、５０…翻訳部、６０…表示部、７０…情報処理部、４０１…情報処理部、４０２…データ記憶部、４０３…プログラム記憶部、４０５…入出力部、４０５…通信部、４０６…内部バス、４０７…制御プログラム、４１０…単語列取得部、４２０…判別部、４３０…（ｎ−１）グラム生成部、４４０…確率係数取得部、４５０…確率係数算出部、４６０…出力部、４７０…記憶部、４７１０…ｎグラムリスト記憶部、４７２０…設定記憶部、４７３０…教師データ記憶部 DESCRIPTION OF SYMBOLS 1 ... Menu display apparatus, 10 ... Input part, 20 ... OCR, 30 ... Menu analysis part, 40 ... Probability coefficient output apparatus, 50 ... Translation part, 60 ... Display part, 70 ... Information processing part, 401 ... Information processing part, 402 ... Data storage unit, 403 ... Program storage unit, 405 ... Input / output unit, 405 ... Communication unit, 406 ... Internal bus, 407 ... Control program, 410 ... Word string acquisition unit, 420 ... Determination unit, 430 ... (n- 1) Gram generation unit, 440 ... probability coefficient acquisition unit, 450 ... probability coefficient calculation unit, 460 ... output unit, 470 ... storage unit, 4710 ... n-gram list storage unit, 4720 ... setting storage unit, 4730 ... teacher data storage unit

Claims

A word string acquisition unit for acquiring a word string having a plurality of words;
An extraction unit for extracting a plurality of partial word strings including one or more words included in the word string acquired by the word string acquisition unit;
The partial word sequence extracted by the extraction unit with a delimiter pattern corresponding to each delimiter pattern between the case where the word sequence is delimited and the case where the word sequence is not delimited between each word between the words constituting the word sequence A word string belonging to the same category as the word string for each of the extracted division patterns is obtained for each of the division probability coefficients indicating the degree of probability that the partial word string is divided by the division method corresponding to the division pattern. A teacher word string having the same pattern as the delimiter pattern from a teacher word string storage unit storing a teacher word string defining whether or not the word string is delimited between words of the word string Based on the number of teacher word strings extracted and extracted, a probability coefficient acquisition unit to acquire,
A dividing unit that divides the word string acquired by the word string acquisition unit based on the division probability coefficient acquired by the probability coefficient acquisition unit;
An information processing apparatus comprising:

A discriminating unit for discriminating whether or not there is a sufficient number of teacher word strings in the teacher word string storage unit to obtain the segmentation probability coefficient for the partial word string extracted by the extraction unit;
The extraction unit further extracts a partial word string of the extracted partial word string when it is determined that there is not a sufficient number of teacher word strings for the determination unit to acquire the break probability coefficient,
The information processing apparatus according to claim 1 .

When the number of teacher word strings stored in the teacher word string storage unit is not sufficient, and the number of words in the extracted partial word string is 1, the probability coefficient acquisition unit calculates the segmentation probability Set the specified value for the coefficient,
The information processing apparatus according to claim 2 .

The partial word strings extracted by the extraction unit are each composed of the same number of words.
The information processing apparatus according to any one of claims 1 to 3, characterized in that.

The word string and the teacher word string are word strings expressing a menu,
The information processing apparatus according to claim 1 , wherein the information processing apparatus is an information processing apparatus.

  Computer
  A word string acquisition unit for acquiring a word string having a plurality of words;
  An extraction unit for extracting a plurality of partial word strings including one or more words included in the word string acquired by the word string acquisition unit;
  The partial word sequence extracted by the extraction unit with a delimiter pattern corresponding to each delimiter pattern between the case where the word sequence is delimited and the case where the word sequence is not delimited between each word between the words constituting the word sequence A word string belonging to the same category as the word string for each of the extracted division patterns is obtained for each of the division probability coefficients indicating the degree of probability that the partial word string is divided by the division method corresponding to the division pattern. A teacher word string having the same pattern as the delimiter pattern from a teacher word string storage unit storing a teacher word string defining whether or not the word string is delimited between words of the word string Based on the number of teacher word strings extracted and extracted, a probability coefficient acquisition unit to acquire,
  A dividing unit that divides the word string acquired by the word string acquisition unit based on the division probability coefficient acquired by the probability coefficient acquisition unit;
  Program to function as.