JP5305971B2

JP5305971B2 - Abbreviation estimation apparatus and method

Info

Publication number: JP5305971B2
Application number: JP2009036879A
Authority: JP
Inventors: 裕美若木; 優鈴木; 一男住田; 寛子藤井
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2009-02-19
Filing date: 2009-02-19
Publication date: 2013-10-02
Anticipated expiration: 2029-02-19
Also published as: JP2010191804A

Description

本発明は、入力された単語または単語の集合に対し、略語であるものを推定する略語推定装置および方法に関する。 The present invention relates to an abbreviation estimation apparatus and method for estimating an abbreviation for an input word or set of words.

原語（正式名称）の文字列に基づく略語生成規則により略語の生成をおこなった後、生成した略語の発声確率を略語生成規則の尤度から計算して求める方法がある（例えば、特許文献１参照）。また、入力された単語に対し略語か正式名称かを判定し、正式名称であると判定された場合は入力された単語から略語生成規則により略語を生成し検索語とする。一方、略語となる候補と判定された場合であれば、データベースを検索しその単語に一致する略語・正式名称の対があれば検索語として検索し、検索語を補間しながら検索漏れを防ぐ方法がある（例えば、特許文献２参照）。 There is a method in which after generating an abbreviation according to an abbreviation generation rule based on a character string of an original word (official name), the utterance probability of the generated abbreviation is calculated from the likelihood of the abbreviation generation rule (for example, see Patent Document 1). ). Also, it is determined whether the input word is an abbreviation or a formal name. If it is determined that the word is a formal name, an abbreviation is generated from the input word according to an abbreviation generation rule and used as a search word. On the other hand, if it is determined that the candidate is an abbreviation, the database is searched, and if there is an abbreviation / official name pair that matches the word, a search term is searched and a search term is interpolated to prevent a search omission (See, for example, Patent Document 2).

特許第０３７２４６４９号公報Japanese Patent No. 0324649 特開平１１−２５１１７号公報Japanese Patent Laid-Open No. 11-25117

以上のように、従来では、単語の略語らしさを計算するのにその原語を必要としている。しかし、カタカナ語のように表記が英語やカタカナで変換できる単語やテレビ番組名のような造語を正式名称とした単語に対する略語は、生成規則が複雑になるため、低頻度の生成規則に対しては生成規則の尤度から略語らしさを判断するのが難しいという課題がある。例えば、名称「ミュージックスマッシュ」に対する略語「Ｍスマ」の場合、「Ｍスマ」の発音「エムスマ」は原語「ミュージックスマッシュ」の省略ではないため表記の変換を考慮する必要があり、「ミュージックスマッシュ」の分割可能位置で分割した基本語「ミュージック」と「スマッシュ」のそれぞれに対し表記変換規則と省略変換規則を組み合わせて生成することになる。このとき、「ミュージック」を英語表記「Ｍｕｓｉｃ」に表記変換をした後に省略変換した要素語「Ｍ」と、「スマッシュ」を英語表記「Ｓｍａｓｈ」に表記変換はせずに元表記を省略変換した要素語「スマ」を連結させて単語「Ｍスマ」が構成され、表記変換規則と省略変換規則を適宜組み合わせて生成する必要があり計算が難しい。
また、単語の略語らしさを計算するのに原語を必要とする方法では、原語がなければ略語候補の略語らしさを計算できない。例えば、テキスト中の括弧書きなどで単語が明示された語や検索文字列として入力された単語のように１単語だけ入力されたときに、原語がないため略語らしさを判断することができないという、生成規則と略語推定を切り離すことができない課題がある。さらに、略語生成の規則が多様になると、その規則が適用されることが少ない規則に対して確率が低くなってしまったり、確率計算のための学習データが沢山必要になったりする。事前に決めた略語生成規則に対する確率に基づいて推定する方法では、略語らしさを測るときにその生成規則を利用するため、略語らしさの推定に使う特徴は略語生成規則そのものとなり、これはあらかじめ人手で用意しておく必要があるため、略語推定に使う特徴そのものを学習することはできない。 As described above, conventionally, the original word is required to calculate the abbreviation of the word. However, abbreviations for words that can be converted in English or Katakana, such as Katakana, or words that have a formal name such as a TV program name, are complicated to generate. Has a problem that it is difficult to determine the abbreviation from the likelihood of the generation rule. For example, in the case of the abbreviation “M Smash” for the name “Music Smash”, the pronunciation “M Smash” of “M Smash” is not an abbreviation of the original word “Music Smash”, so it is necessary to consider the conversion of the notation, “Music Smash” Are generated by combining the notation conversion rule and the abbreviated conversion rule for each of the basic words “music” and “smash” divided at the division positions. At this time, the element word “M” obtained by converting the notation of “Music” to the English notation “Music” and abbreviated conversion and the original notation without converting the notation of the “smash” to the English notation “Smash” The word “M suma” is formed by concatenating the element word “suma”, and it is necessary to generate a combination of notation conversion rules and abbreviated conversion rules as appropriate, which is difficult to calculate.
Further, in a method that requires an original word to calculate the abbreviation of a word, the abbreviation of an abbreviation candidate cannot be calculated without the original word. For example, when only one word is input, such as a word that is clearly specified in parentheses in the text or a word that is input as a search character string, the abbreviation cannot be determined because there is no original word. There is a problem that generation rules and abbreviation estimation cannot be separated. Furthermore, if the rules for generating abbreviations are diversified, the probability is low for a rule to which the rule is rarely applied, and a lot of learning data for probability calculation is required. In the estimation method based on the probability for the abbreviation generation rule determined in advance, the generation rule is used when measuring the abbreviation likelihood. Therefore, the feature used to estimate the abbreviation likelihood is the abbreviation generation rule itself, and this is done manually in advance. Because it is necessary to prepare, it is not possible to learn the characteristics used for abbreviation estimation.

一方、基本語データベースを必要とする略語推定の方法では、造語や新語に対応しにくい。また、基本語データベースによる略語生成の方法では、略語生成前の正式名称の部分語が基本語としてデータベースに登録されていれば略語生成されるが、生成された略語そのものの略語らしさについては判定できない。 On the other hand, abbreviation estimation methods that require a basic word database are difficult to deal with coined words and new words. In addition, in the method of abbreviation generation using the basic word database, an abbreviation is generated if a partial word of an official name before the abbreviation generation is registered in the database as a basic word, but the abbreviation of the generated abbreviation itself cannot be determined. .

本発明は、上述の課題を解決するためになされたものであり、略語生成規則に依存することなく、略語候補の特徴と要素語への分割位置に基づいて略語らしさを推定できる略語推定装置および方法を提供することを目的とする。 The present invention has been made to solve the above-described problems, and an abbreviation estimation device capable of estimating abbreviations based on features of abbreviation candidates and division positions into element words without depending on abbreviation generation rules, and It aims to provide a method.

上述の課題を解決するため、本発明に係る略語推定装置は、略語の候補となりうる文字列と、該文字列を分割する分割可能位置情報とを取得する第１取得手段と、前記文字列および該文字列の分割可能位置情報から導かれる文字列の特徴を示す複数の第１特徴を計算する計算手段と、特徴の概念を示す指標であるカテゴリのうち所望のカテゴリに属する第２特徴と、該第２特徴に対応する尤度とを格納する格納手段と、前記複数の第１特徴とカテゴリごとの前記第２特徴とを比較し、同一の特徴に対応する尤度を前記格納手段から取得する第２取得手段と、前記第２取得手段で取得した尤度から、前記文字列の正式名称に対する略語としての正しさを示す順位である略語候補順位を計算する計算手段と、を具備することを特徴とする。 In order to solve the above-described problem, an abbreviation estimation apparatus according to the present invention includes a first acquisition unit that acquires a character string that can be an abbreviation candidate, and slicable position information that divides the character string, the character string, A calculation means for calculating a plurality of first features indicating the characteristics of the character string derived from the position information that can be divided from the character string; a second feature belonging to a desired category among categories that are indices indicating the concept of the characteristics; The storage means for storing the likelihood corresponding to the second feature, the plurality of first features and the second feature for each category are compared, and the likelihood corresponding to the same feature is obtained from the storage means. Second calculating means for calculating, and calculating means for calculating an abbreviation candidate rank, which is a rank indicating the correctness as an abbreviation for the formal name of the character string, from the likelihood acquired by the second acquiring means. It is characterized by.

また、本発明に係る略語推定装置は、略語の候補となりうる文字列を取得する第１取得手段と、前記文字列の文字間で分割をおこない、該文字列を分割する分割可能位置情報を得る分割手段と、前記文字列および前記分割可能位置情報から導かれる文字列の特徴を示す複数の第１特徴を計算する計算手段と、特徴の概念を示す指標であるカテゴリのうち所望のカテゴリに属する第２特徴と、該第２特徴に対応する尤度とを格納する格納手段と、前記複数の第１特徴とカテゴリごとの前記第２特徴とを比較し、同一の特徴に対応する尤度を前記格納手段から取得する第２取得手段と、前記第２取得手段で取得した尤度から、前記文字列の正式名称に対する略語としての正しさを示す順位である略語候補順位を計算する計算手段と、を具備することを特徴とする。 Also, the abbreviation estimation apparatus according to the present invention obtains a slicable position information for dividing the character string by dividing the character string between a first acquisition unit that acquires a character string that can be a candidate for the abbreviation, and the character string. Division means, calculation means for calculating a plurality of first features indicating the characteristics of the character string derived from the character string and the severable position information, and a category that is an index indicating the concept of the feature belongs to a desired category The storage means for storing the second feature and the likelihood corresponding to the second feature, the plurality of first features and the second feature for each category are compared, and the likelihood corresponding to the same feature is obtained. Second calculating means for acquiring from the storage means; and calculating means for calculating an abbreviation candidate rank that is a rank indicating the correctness of the character string as an abbreviation for the formal name from the likelihood acquired by the second acquiring means; , The features.

また、本発明に係る略語推定装置は、文字列を正式名称として取得する第１取得手段と、前記文字列から略語の候補である略語候補を少なくとも１つ以上生成し、該略語候補を分割する分割可能位置情報を少なくとも１つ以上生成する生成手段と、前記略語候補および該略語候補の分割可能位置情報から導かれる文字列の特徴を示す複数の第１特徴を計算する計算手段と、特徴の概念を示す指標であるカテゴリのうち所望のカテゴリに属する第２特徴と、該第２特徴に対応する尤度とを格納する格納手段と、前記複数の第１特徴とカテゴリごとの前記第２特徴とを比較し、同一の特徴に対応する尤度を前記格納手段から取得する第２取得手段と、前記第２取得手段で取得した尤度から、前記略語候補の正式名称に対する略語としての正しさを示す順位である略語候補順位を計算する計算手段と、を具備することを特徴とする。 In addition, the abbreviation estimation apparatus according to the present invention generates at least one abbreviation candidate that is a candidate for an abbreviation from the first acquisition unit that acquires a character string as an official name, and divides the abbreviation candidate. Generating means for generating at least one piece of position information that can be divided, calculation means for calculating a plurality of first features that indicate the abbreviation candidates and features of character strings derived from the pieces of position information of the abbreviation candidates; Storage means for storing a second feature belonging to a desired category among categories that are indices indicating the concept, likelihood corresponding to the second feature, the plurality of first features, and the second feature for each category And the second acquisition means for acquiring the likelihood corresponding to the same feature from the storage means, and the correctness as an abbreviation for the formal name of the abbreviation candidate from the likelihood acquired by the second acquisition means The A calculating means for calculating the abbreviation candidate ranking is to rank, characterized by comprising a.

本発明の略語推定装置および方法によれば、略語生成規則に依存することなく、略語候補の特徴と要素語への分割位置に基づいて略語らしさを推定できるため、略語生成規則に表記変換規則(例：カタカナ⇔アルファベット)を必要とする略語でも表記変換規則を使わない略語と同じように比較できる。また、略語生成規則が複雑であったり、低頻度の規則が適用された略語候補であったりしても、略語らしさの推定には影響しにくく、学習データが比較的少数でも精度良く略語推定が可能である。さらに、略語生成規則を事前に決めた場合でも、それとは異なる特徴から略語推定し略語推定に使う特徴の学習をすることが可能である。 According to the abbreviation estimation apparatus and method of the present invention, it is possible to estimate the likelihood of abbreviations based on the characteristics of abbreviation candidates and the division positions into element words without depending on the abbreviation generation rules. For example, abbreviations that require katakana (alphabet) can be compared in the same way as abbreviations that do not use a notation conversion rule. In addition, even if abbreviation generation rules are complex or abbreviation candidates to which low-frequency rules are applied, estimation of abbreviations is unlikely to be affected, and abbreviation estimation can be performed accurately even with a relatively small amount of learning data. Is possible. Furthermore, even when the abbreviation generation rules are determined in advance, it is possible to learn features used for abbreviation estimation by estimating abbreviations from features different from those.

第１の実施形態に係る略語推定装置を示すブロック図。The block diagram which shows the abbreviation estimation apparatus which concerns on 1st Embodiment. 第１の実施形態に係る略語推定装置の動作を示すフローチャート。The flowchart which shows operation | movement of the abbreviation estimation apparatus which concerns on 1st Embodiment. 特徴格納部に格納されるデータの一例を示す図。The figure which shows an example of the data stored in the characteristic storage part. 特徴抽出部の動作を示すフローチャート。The flowchart which shows operation | movement of a feature extraction part. 重み導出部の動作を示すフローチャート。The flowchart which shows operation | movement of a weight derivation | leading-out part. 略語推定計算部の動作を示すフローチャート。The flowchart which shows operation | movement of an abbreviation estimation calculation part. 第１の実施形態に係る略語推定装置の第１の変形例を示すブロック図。The block diagram which shows the 1st modification of the abbreviation estimation apparatus which concerns on 1st Embodiment. 第１の実施形態に係る略語推定装置の第２の変形例を示すブロック図。The block diagram which shows the 2nd modification of the abbreviation estimation apparatus which concerns on 1st Embodiment. 第１の実施形態に係る略語推定装置の第３の変形例を示すブロック図。The block diagram which shows the 3rd modification of the abbreviation estimation apparatus which concerns on 1st Embodiment. 第１の実施形態に係る略語推定装置の第４の変形例を示すブロック図。The block diagram which shows the 4th modification of the abbreviation estimation apparatus which concerns on 1st Embodiment.

以下、図面を参照しながら本発明の実施形態に係る略語推定装置および方法について詳細に説明する。なお、以下の実施形態では、同一の番号を付した部分については同様の動作をおこなうものとして、重ねての説明を省略する。さらに以下では、基本語はそのままで単語となる語である。要素語とは、略語に対応する正式名称である文字列の分割可能位置で分かれる語である。また要素語は、文字列から１つ以上の変換規則の適用によって変換された文字列も含む。原語とは略語へ略される前の元の語である正式名称と定義する。また、分割可能位置とは、文字列を分割する方法で分割されると判断される位置を指し、必ずしも単語で分割する必要はなく、文字列の文字間で分割してもよい。変換規則とは、例えば、漢字、カタカナ、ひらがな、英字に文字列の変換をおこなう規則である。
（第１の実施形態）
本実施形態に係る略語推定装置の構成について図１を参照して詳細に説明する。
本実施形態に係る略語推定装置１００は、入力部１０１、特徴抽出部１０２、重み導出部１０３、略語推定計算部１０４、出力部１０５、特徴格納部１０６とを含む。
また、特徴抽出部１０２、重み導出部１０３、略語推定計算部１０４をまとめて略語推定部１０７と呼ぶ。
入力部１０１は、外部より入力される文字列として、１つ以上の略語の候補となりうる文字列の入力を受け付ける。略語の候補とは、文字列の正式名称を略した略語の候補を示し、以下では略語候補と呼ぶ。また、文字列にはこの文字列を分割する分割可能位置情報が付加されている。文字列は１文字である場合も含む。ここで、分割可能位置情報とは、入力された文字列において、正式名称から略語を生成する際に略語に使用される語が抽出される意味の一かたまりを示す文字列単位の前の位置および後の位置を示す。例えば、正式名称「ミュージックスマッシュ」から略語「Ｍスマ」が生成されるとする。そのとき、略語に使用される語である「Ｍ」が抽出される際の意味の一かたまりを示す文字列単位は「ミュージック」であり、同様に略語に使用される語である「スマ」が抽出される際の意味の一かたまりを示す文字列単位は「スマッシュ」となる。この文字列単位である「ミュージック」と「スマッシュ」の間の位置と「ミュージック」の前の位置と「スマッシュ」の後の位置を示す情報が分割可能位置情報である。「ミュージック」の後の位置と「スマッシュ」の前の位置は同じ位置であるため、どちらかの位置で代表させる。また、要素語は、分割可能位置情報により分割される文字列単位であるとしても定義される。
なお、分割可能位置情報には、略語候補における各要素語がその正式名称中での語頭・語末であったという情報を追加してもよい。この略語候補と分割可能位置情報を特徴抽出部１０２へ送る。
特徴抽出部１０２は、入力部１０１から受け取った略語候補と分割可能位置情報より、略語候補の特徴を抽出して重み導出部１０３へ送る。ここで、特徴とは、文字列および文字列の分割可能位置情報から導かれる文字列固有の情報であり、例えば、文字列は１字の英字と２字のカタカナからなることを示す情報（図３の「１字の英字＋２字のカタカナ」に対応）であり、具体例は後の図３に示されている。また、複数の特徴は一般的にあるカテゴリにまとめられる。換言すれば、カテゴリとは、複数の特徴の共通概念を示す指標であり、例えば、文字数と文字種との組み合わせ（図３の「Ｃ１：文字数＋文字種の組み合わせ」）である。ただし、カテゴリが１つの特徴のみを示す場合もあり、具体的には図３の「Ｃ５：要素数」は１つの特徴のみを示す指標である。なお、文字種とは要素語の文字の種類を示し、例えば、英字、ひらがな、カタカナである。また、この特徴の抽出処理については、図４を参照して詳細に後述する。特徴抽出部１０２の動作については、後に図４を用いて詳細に説明する。 Hereinafter, an abbreviation estimation apparatus and method according to embodiments of the present invention will be described in detail with reference to the drawings. In the following embodiments, the same reference numerals are assigned to the same numbered parts, and repeated description is omitted. Furthermore, in the following, basic words are words that are words as they are. An element word is a word that is divided at a position where a character string that is an official name corresponding to an abbreviation can be divided. The element word also includes a character string converted from the character string by applying one or more conversion rules. The original word is defined as the formal name that is the original word before being abbreviated to an abbreviation. Moreover, the position which can be divided | segmented points out the position judged to be divided | segmented by the method of dividing | segmenting a character string, and does not necessarily need to divide | segment into a word, You may divide between the characters of a character string. The conversion rule is, for example, a rule for converting a character string into Kanji, Katakana, Hiragana, or English.
(First embodiment)
The configuration of the abbreviation estimation apparatus according to this embodiment will be described in detail with reference to FIG.
The abbreviation estimation apparatus 100 according to the present embodiment includes an input unit 101, a feature extraction unit 102, a weight derivation unit 103, an abbreviation estimation calculation unit 104, an output unit 105, and a feature storage unit 106.
The feature extraction unit 102, the weight derivation unit 103, and the abbreviation estimation calculation unit 104 are collectively referred to as an abbreviation estimation unit 107.
The input unit 101 accepts input of a character string that can be one or more candidate abbreviations as a character string input from the outside. The abbreviation candidate indicates an abbreviation candidate that abbreviates the official name of the character string, and is hereinafter referred to as an abbreviation candidate. In addition, splittable position information for splitting the character string is added to the character string. The character string includes the case of one character. Here, the severable position information is a position in front of a character string unit indicating a group of meanings in which words used for abbreviations are extracted when an abbreviation is generated from an official name in an input character string, and The back position is shown. For example, it is assumed that the abbreviation “M Smart” is generated from the official name “Music Smash”. At that time, the character string unit indicating a group of meanings when “M”, which is a word used as an abbreviation, is extracted is “music”, and similarly, “suma”, which is a word used as an abbreviation, is used. A character string unit indicating a group of meanings at the time of extraction is “smash”. Information indicating the position between “music” and “smash”, the position before “music”, and the position after “smash”, which are character string units, is the position information that can be divided. Since the position after “Music” and the position before “Smash” are the same position, they are represented by either position. An element word is also defined as a character string unit divided by position information that can be divided.
Note that information that each element word in the abbreviation candidate was the beginning / end of the official name may be added to the position information that can be divided. This abbreviation candidate and the position information that can be divided are sent to the feature extraction unit 102.
The feature extraction unit 102 extracts the features of the abbreviation candidates from the abbreviation candidates and the splittable position information received from the input unit 101, and sends them to the weight derivation unit 103. Here, the feature is information unique to the character string derived from the character string and the position information where the character string can be divided. For example, information indicating that the character string includes one alphabetic character and two katakana characters (see FIG. 3) (corresponding to “1 English letter + 2 katakana”), and a specific example is shown in FIG. A plurality of features are generally grouped into a certain category. In other words, the category is an index indicating a common concept of a plurality of features, and is, for example, a combination of the number of characters and a character type (“C1: combination of character number + character type” in FIG. 3). However, the category may indicate only one feature, and specifically, “C5: number of elements” in FIG. 3 is an index indicating only one feature. The character type indicates the character type of the element word, and is, for example, an alphabetic character, hiragana or katakana. The feature extraction process will be described later in detail with reference to FIG. The operation of the feature extraction unit 102 will be described later in detail with reference to FIG.

特徴格納部１０６は、文字列が持ちうる特徴とその重みを関連付けて格納している。必要に応じて重み導出部１０３からの検索に対して特徴と特徴に対応した重みを送る。特徴格納部１０６で保持するデータについては、後に図３を参照して詳細に説明する。
重み導出部１０３は、特徴抽出部１０２から受け取った、略語候補から抽出された特徴が特徴格納部１０６の中にあるかをまず検索する。もし格納された特徴の中で、特徴格納部１０６に該当する特徴が格納されていれば、その特徴および特徴に対応する重みを略語推定計算部１０４に送る。また、格納された特徴の中に該当するものが無かった場合、格納された特徴の中にない特徴であるとして、予め定められた一定の重みを与えてよい。あるいは、あるカテゴリ（例えば、文字数）では格納された特徴の中にない特徴を持つとして、その重み与えてもよい。重み導出部１０３の動作については、後に図５を用いて詳細に説明する。
略語推定計算部１０４は、重み導出部１０３から受けとった特徴とその重みから、略語候補の略語らしさを計算し、略語らしさの指標を出力部１０５に送る。略語らしさとは、略語候補としての正しさを示す順位であり、略語候補順位とも呼ぶ。略語らしさの計算では、例えば、重みの総乗を取る計算で略称らしさを推定するが、総乗に限らず、重みに基づく値が大きいほど略語らしさの値が大きくなるような関数であればどんな関数に従って計算してもよい。例えば、特徴の重みが正負のいずれでもよく、略語らしさの計算では総和により計算を行うという方法でもよい。
出力部１０５は、略語候補と略語推定計算部１０４で計算された略語らしさとを対応付けて出力する。また、略語候補が複数個あった場合には、いずれの略語候補についても略語推定部１０７が処理をおこない、出力部１０５が各略語候補と各略語候補に対して算出された略語らしさとを対応づけて出力する。 The feature storage unit 106 stores features associated with character strings and their weights in association with each other. If necessary, a feature and a weight corresponding to the feature are sent to the search from the weight deriving unit 103. The data held in the feature storage unit 106 will be described in detail later with reference to FIG.
The weight deriving unit 103 first searches whether or not the feature extracted from the abbreviation candidate received from the feature extraction unit 102 is in the feature storage unit 106. If the feature corresponding to the feature storage unit 106 is stored among the stored features, the feature and the weight corresponding to the feature are sent to the abbreviation estimation calculation unit 104. In addition, when there is no corresponding feature among the stored features, a predetermined constant weight may be given as a feature not included in the stored feature. Alternatively, a weight may be given assuming that a certain category (for example, the number of characters) has a feature not included in the stored features. The operation of the weight deriving unit 103 will be described in detail later with reference to FIG.
The abbreviation estimation calculation unit 104 calculates the abbreviation likelihood of the abbreviation candidate from the feature received from the weight derivation unit 103 and its weight, and sends an abbreviation index to the output unit 105. The abbreviation is a rank indicating correctness as an abbreviation candidate, and is also referred to as an abbreviation candidate rank. In the calculation of abbreviations, for example, the abbreviation is estimated by calculating the sum of the weights. However, the function is not limited to the sum of powers, and any function that increases the value of abbreviations as the value based on the weight increases. You may calculate according to a function. For example, the feature weight may be positive or negative, and the abbreviation may be calculated by summation.
The output unit 105 outputs the abbreviation candidates and the abbreviations calculated by the abbreviation estimation calculation unit 104 in association with each other. When there are a plurality of abbreviation candidates, the abbreviation estimation unit 107 performs processing for any abbreviation candidate, and the output unit 105 associates each abbreviation candidate with the abbreviation calculated for each abbreviation candidate. Then output.

次に、本実施形態に係る略語推定装置１００の動作の詳細について図２のフローチャートを参照して詳細に説明する。ここでは、略語候補の例として「Ｍスマ」が入力されたとする。
Ｓ２０１では、入力部１０１において文字列「Ｍスマ」とその分割位置「／Ｍ／スマ／」が入力される。ただし、ここでは「／」が分割可能位置を示し、要素語は「Ｍ」「スマ」である。同様に、略語候補が「朝ピタ」であるとすれば、「朝ピタ」の分割位置「／朝／ピタ／」が入力され、要素語は「朝」「ピタ」である。
Ｓ２０２では、特徴抽出部１０２において略語候補とその分割可能位置情報から得られる特徴を抽出する。前記特徴とは、例えば、『１字の英字＋２字のカタカナ』『３文字』『２要素』といった情報である。 Next, details of the operation of the abbreviation estimation apparatus 100 according to the present embodiment will be described in detail with reference to the flowchart of FIG. Here, it is assumed that “M sum” is input as an example of an abbreviation candidate.
In S <b> 201, the character string “M summa” and its division position “/ M / summer /” are input in the input unit 101. However, here, “/” indicates the position where division is possible, and the element words are “M” and “smart”. Similarly, if the abbreviation candidate is “morning pita”, the division position “/ morning / pita /” of “morning pita” is input, and the element words are “morning” and “pita”.
In S202, the feature extraction unit 102 extracts a feature obtained from the abbreviation candidate and its divisionable position information. The feature is, for example, information such as “one alphabetic character + two katakana”, “three characters”, “two elements”.

ここで、特徴格納部１０６に格納されるデータの一例を図３に示す。特徴格納部１０６内のテーブル１には、所望のカテゴリに属する特徴と、特徴にそれぞれ対応する重みが予め用意されている。図３に挙げられているカテゴリは、要素語の文字数および文字種の組み合わせ（Ｃ１：文字数＋文字種の組み合わせ）、要素語の文字種の組み合わせ（Ｃ２：文字種の組み合わせ）、要素語の文字数の組み合わせ（Ｃ３：文字数の組み合わせ）、文字列の文字数（「Ｃ４：文字数」）、および、要素語の数（「Ｃ５：要素数」）である。 An example of data stored in the feature storage unit 106 is shown in FIG. In the table 1 in the feature storage unit 106, features belonging to a desired category and weights corresponding to the features are prepared in advance. The categories listed in FIG. 3 are combinations of the number of characters in the element word and the character type (C1: combination of the number of characters + character type), the combination of the character type of the element word (C2: combination of the character type), and the combination of the number of characters in the element word (C3 : Combination of the number of characters), the number of characters in the character string ("C4: number of characters"), and the number of element words ("C5: number of elements").

また、テーブル１に列挙された特徴以外の特徴の重みも保持してよい。これは、図３のテーブル１に列挙した特徴以外のもの、すなわちテーブル２に対応する。例えば、文字数というカテゴリＣ４では、特徴格納部１０６内のテーブル１に『３文字』『４文字』しかなかった場合に、それ以外の文字数の場合には、『カテゴリＣ４に属するが、テーブル１にない文字数』といった特徴を保持してもよい。あるいは、文字種の組み合わせといったカテゴリＣ２では、特徴格納部１０６内のテーブル１に『カタカナ』『英字』があり、入力された略語候補が『ひらがな』だった場合は、特徴格納部１０６内のテーブル１に『ひらがな』がないので、『カテゴリＣ２に属するが、テーブル１にない文字種』という特徴を保持してもよい。また、テーブル１にある特徴やテーブル２にある特徴のいずれの特徴に対しても、予め定められた重みを保持している。
略語候補「／Ｍ／スマ／」の場合の特徴は、『１字の英字＋２字のカタカナ』『英字＋カタカナ』『３文字』『２要素』である。また、略語候補が「／朝／ピタ／」の場合の特徴は、『１字の漢字＋２字のカタカナ』『漢字＋カタカナ』『３文字』『２要素』である。 Also, the weights of features other than those listed in Table 1 may be held. This corresponds to features other than those listed in Table 1 of FIG. For example, in the category C4 of the number of characters, when there are only “3 characters” and “4 characters” in the table 1 in the feature storage unit 106, the other number of characters belongs to the “category C4 but in the table 1. Features such as “no characters” may be retained. Alternatively, in category C2 such as a combination of character types, when there are “Katakana” and “English characters” in table 1 in feature storage unit 106 and the input abbreviation candidate is “Hiragana”, table 1 in feature storage unit 106 Since “Hiragana” does not exist, the character type “character type belonging to category C2 but not in table 1” may be retained. In addition, a predetermined weight is held for any of the features in the table 1 and the features in the table 2.
The characteristics of the abbreviation candidate “/ M / smart /” are “1 letter + 2 katakana”, “English + katakana”, “3 letters” and “2 elements”. In addition, when the abbreviation candidate is “/ morning / pita /”, the characteristics are “1 kanji + 2 katakana” “kanji + katakana” “3 characters” “2 elements”.

ここで、特徴抽出部１０２における特徴抽出処理の詳細について図４のフローチャートを参照して詳細に説明する。
Ｓ４０１では、はじめに略語候補である文字列Ｓを入力する。ここでは、文字列Ｓが「Ｍスマ」とする。続いて、Ｓ４０２では、文字列Ｓの文字数を計算し、「Ｍスマ」の場合、文字数は３文字となる。さらに、Ｓ４０３で文字列Ｓを構成する全ての要素語Ｅ_ｉ（ｉ＝１〜Ｎ）のうちの１つを取り出す。ここでは、「Ｍ」「スマ」が要素語であるので先頭の要素語である「Ｍ」を取り出す。なお、要素語を取り出す順番は先頭からではなく、ランダムに取り出してもよい。つまり、最終的に全ての要素語を取り出すことができればよい。Ｓ４０４では、Ｓ４０３で取り出した要素語に対し要素語の文字数を計算し、「Ｍ」の場合は文字数が１字であることがわかる。続いて、Ｓ４０５では、要素語の文字種を取得し、「Ｍ」の場合は文字種が英字であることがわかる。そして、Ｓ４０６では、Ｓ４０３からＳ４０５の処理を繰り返して、文字列Ｓの全ての要素語Ｅ_ｉについて処理をおこなう。この例では次に最後の要素語である「スマ」が取り出され、文字数が２字、文字種がカタカナという特徴を得ることができる。
最後に、Ｓ４０７では、上記の処理によって得られた文字列Ｓのカテゴリである、文字数、文字列Ｓを構成する各要素語Ｅ_ｉの文字種の組み合わせ、文字数の組み合わせ、文字数と文字種の組み合わせからなる特徴Ｆ_ｊ（ｊ＝１〜Ｍ、Ｍは特徴の数）をＭ個生成して特徴抽出部１０２での特徴抽出処理を終了する。「／Ｍ／スマ／」の場合、例えば『１字の英字＋２字のカタカナ』『英字＋カタカナ』『１字＋２字』『３文字』『２要素』という特徴を得ることができる。また、「／朝／ピタ／」が文字列Ｓであれば、『１字の漢字＋２字のカタカナ』『漢字＋カタカナ』『１字＋２字』『３文字』『２要素』といった特徴を得ることができる。テーブル１，２のカテゴリごとに含まれる特徴が全て異なる場合には、文字列の特徴はカテゴリごとせいぜい１つ生成されるので、ある文字列の特徴は最大でカテゴリの数だけ生成されることになる。 Here, the details of the feature extraction processing in the feature extraction unit 102 will be described in detail with reference to the flowchart of FIG.
In S401, first, a character string S that is an abbreviation candidate is input. Here, it is assumed that the character string S is “M Smart”. Subsequently, in S402, the number of characters of the character string S is calculated. In the case of “M summa”, the number of characters is three. In S403, one of all element words E _i (i = 1 to N) constituting the character string S is extracted. Here, since “M” and “suma” are element words, the first element word “M” is extracted. Note that the order of extracting element words may be extracted randomly rather than from the top. That is, it is only necessary that all element words can be finally extracted. In S404, the number of characters of the element word is calculated for the element word extracted in S403. In the case of “M”, it can be seen that the number of characters is one. Subsequently, in S405, the character type of the element word is acquired. In the case of “M”, it is understood that the character type is an alphabetic character. In step S406, the processing in steps S403 to S405 is repeated to perform processing for all element words E _{i in} the character string S. In this example, “suma”, which is the last element word, is taken out, and the characteristics that the number of characters is two and the character type is katakana can be obtained.
Finally, in S407, a category string S obtained by the above processing, the number of characters, a combination of character types in each element word E _i constituting the string S, a combination of characters, a combination of characters and character type M features F _j (j = 1 to M, M is the number of features) are generated, and the feature extraction process in the feature extraction unit 102 is terminated. In the case of “/ M / smart /”, for example, it is possible to obtain the characteristics of “1 letter + 2 katakana”, “English + katakana”, “1 + 2 letters”, “3 letters”, and “2 elements”. Further, if “/ morning / pita /” is a character string S, characteristics such as “1 kanji + 2 katakana”, “kanji + katakana”, “1 +2 characters”, “3 characters”, “2 elements” are obtained. be able to. When the features included in each category in Tables 1 and 2 are all different, at most one character string feature is generated for each category. Therefore, a maximum number of character string features are generated for each category. Become.

次にＳ２０３では、重み導出部１０３において文字列Ｓの特徴と同一の特徴があるかどうかについて特徴格納部１０６を参照して検索し、Ｓ２０４では、同じく重み導出部１０３においてそれぞれの特徴に対応する重みを算出する。
ここで、Ｓ２０３、Ｓ２０４における重み導出部１０３での重み導出処理を図５のフローチャートを用いて詳細に説明する。
まず、Ｓ５０１では、図４の特徴抽出部１０２における特徴抽出処理で得られた文字列ＳのＭ個の特徴Ｆ_ｊを入力とする。次に、Ｓ５０３で、入力された各特徴Ｆ_ｊが特徴格納部１０６のテーブル１の中にあるかを検索する。Ｓ５０４の条件判定において、特徴Ｆ_ｊがテーブル１に存在すればＳ５０５に進み、その特徴の重みを取得する。一方、特徴Ｆ_ｊがテーブル１に存在しなければＳ５０６に進み、同一カテゴリのテーブル２の特徴に変更してＳ５０５にてテーブル２からその重みを取得する。そして、Ｓ５０７では、Ｓ５０２からＳ５０６までの重み導出処理を全ての特徴Ｆ_ｊについて終了するまで繰り返す。
例えば、「／Ｍ／スマ／」に対して『１字の英字＋２字のカタカナ』『英字＋カタカナ』『１字＋２字』『３文字』『２要素』という５つの特徴Ｆ_ｊがあり、この特徴Ｆ_ｊが特徴格納部１０６のテーブル１にあるどうかを検索する。『１字の英字＋２字のカタカナ』『英字＋カタカナ』『１字＋２字』『３文字』『２要素』のそれぞれの特徴Ｆ_ｊがテーブル１に存在するので、これらの特徴Ｆ_ｊに対応する重みをそれぞれ、『１字の英字＋２字のカタカナ』は０．３０、『英字＋カタカナ』は０．２０、『１字＋２字』は０．２１、『３文字』は０．２５、『２要素』は０．４０をテーブル１より取得する。
また、「／朝／ピタ／」が入力された文字列Ｓである場合は、『１字の漢字＋２字のカタカナ』『漢字＋カタカナ』『１字＋２字』『３文字』『２要素』という５つの特徴で、同様に特徴格納部１０６のテーブル１にこれらの特徴があるかどうかを検索する。この場合、『１字＋２字』『３文字』『２要素』の特徴はテーブル１に存在するが、『１字の漢字＋２字のカタカナ』『漢字＋カタカナ』という特徴はテーブル１に存在しないため、Ｓ５０６において同一カテゴリのテーブル２の特徴に変更して、『カテゴリＣ１に属するが、テーブル１にない文字数、文字種の組み合わせ』『カテゴリＣ２に属するが、テーブル１にない文字種の組み合わせ』に係る重みをテーブル２より取得することになる。よって、『カテゴリＣ１に属するが、テーブル１にない文字数、文字種の組み合わせ』は０．１、『カテゴリＣ２に属するが、テーブル１にない文字種の組み合わせ』は０．０００５、『１字＋２字』は０．２１、『３文字』は０．２５、『２要素』は０．４０という重みをそれぞれ取得する。
最後に、Ｓ５０８では、上記の処理によって文字列ＳのＭ個の特徴Ｆ_ｊの重みを取得して出力する。以上で重み導出部１０３における重み導出処理を終了する。 Next, in S203, the weight derivation unit 103 searches the feature storage unit 106 for the same feature as the character string S. In S204, the weight derivation unit 103 also corresponds to each feature. Calculate the weight.
Here, the weight deriving process in the weight deriving unit 103 in S203 and S204 will be described in detail with reference to the flowchart of FIG.
First, in S501, M features _Fj of the character string S obtained by the feature extraction processing in the feature extraction unit 102 in FIG. 4 are input. Next, in S503, it is searched whether or not each inputted feature F _j is in the table 1 of the feature storage unit 106. In the condition determination of S504, wherein _{F j} is the flow proceeds to S505 if present table 1, to obtain the weight of its features. On the other hand, if the feature F _j does not exist in the table 1, the process proceeds to S506, the feature is changed to the feature of the table 2 of the same category, and the weight is acquired from the table 2 in S505. Then, in S507, and repeats the weight derivation process from S502 to S506 until the completion of all of the features _{F j.}
For example, for “/ M / smart /”, there are five characteristics F _{j of} “1 alphabet + 2 katakana” “English + katakana” “1 + 2” “3 characters” “2 elements” It is searched whether or not the feature F _j is in the table 1 of the feature storage unit 106. Since each feature F _{j of} “1 letter + 2 katakana”, “English + katakana”, “1 + 2 letters”, “3 letters” and “2 elements” exists in the table 1, it corresponds to these characteristics F _j The weights of “1 letter + 2 katakana” are 0.30, “English + katakana” is 0.20, “1 + 2 letters” is 0.21, “3 letters” is 0.25, “2 elements” is obtained from Table 1 as 0.40.
In addition, when “/ morning / pita /” is the input character string S, “1 character kanji + 2 characters katakana” “kanji + katakana” “1 character + 2 characters” “3 characters” “2 elements” Similarly, it is searched whether there are these features in the table 1 of the feature storage unit 106. In this case, the characteristics of “1 character + 2 characters”, “3 characters”, and “2 elements” exist in Table 1, but the characteristics of “1 character Kanji + 2 characters Katakana” and “Kanji + Katakana” do not exist in Table 1. Therefore, in S506, the characteristics are changed to the characteristics of the table 2 of the same category, and the “combination of the number of characters and character types belonging to the category C1 but not in the table 1” “the combination of character types belonging to the category C2 but not in the table 1” The weight is acquired from Table 2. Therefore, “a combination of the number of characters and character types belonging to category C1 but not in table 1” is 0.1, “a combination of character types belonging to category C2 but not in table 1” is 0.0005, “1 character + 2 characters” Is 0.21, “3 characters” is 0.25, and “2 elements” is 0.40.
Finally, in S508, and it outputs the acquired weights of M feature F _j of a string S by the above process. Thus, the weight derivation process in the weight derivation unit 103 ends.

上述したＳ２０４、Ｓ２０５の処理を終えて、Ｓ２０５ではＳ２０４で取得した重みから略語推定処理をおこなう。
略語推定計算部１０４における略語推定処理の一例を図６のフローチャートを用いて詳細に説明する。
まず、Ｓ６０１では、図５の重み導出処理で得られた文字列ＳのＭ個の各特徴Ｆ_ｊに対する重みを入力とする。次に、Ｓ６０２では、重みから文字列Ｓの略語らしさの計算をおこなう。略語らしさの計算方法としては、例えば、重みの総乗を取ることで略語らしさとする。最後に、Ｓ６０３では、Ｓ６０２で得られた文字列Ｓの略語らしさを出力する。
例として、例えば「Ｍスマ」に対しては、５つの特徴Ｆ_ｊに対する重みである、０．３０、０．２０、０．２１、０．２５、０．４０の総乗をとり、０．００１２６を「Ｍスマ」の略語らしさとして出力する。「朝ピタ」であれば、５つの特徴Ｆ_ｊに対する重みである、０．１、０．０００５、０．２１、０．２５、０．４０の総乗の０．０００００１０５が「朝ピタ」の略語らしさとして出力される。 After the processing of S204 and S205 described above is completed, in S205, abbreviation estimation processing is performed from the weight acquired in S204.
An example of the abbreviation estimation process in the abbreviation estimation calculation unit 104 will be described in detail with reference to the flowchart of FIG.
First, in S601, the weights for the M features _Fj of the character string S obtained by the weight derivation process of FIG. 5 are input. Next, in S602, the abbreviation of the character string S is calculated from the weight. As a method for calculating the abbreviation, for example, the abbreviation is obtained by taking the sum of weights. Finally, in S603, the abbreviation of the character string S obtained in S602 is output.
For example, for “M summa”, for example, the weights of five features F _j are 0.30, 0.20, 0.21, 0.25, and 0.40. “00126” is output as an abbreviation of “M-sum”. If it is “morning pita”, the weight of five features F _j , which is the sum of 0.1, 0.0005, 0.21, 0.25, and 0.40, 0.00000105 is “morning pita” Output as an abbreviation.

Ｓ２０６では、出力部１０５において略語候補と略語候補の略語らしさとを対応付けて外部に出力される。以上で略語推定装置１００の動作を終了する。 In S206, the output unit 105 associates the abbreviation candidates with the abbreviations of the abbreviation candidates and outputs them to the outside. This is the end of the operation of the abbreviation estimation apparatus 100.

以上に示した第１の実施形態によれば、略語生成規則や原語の情報を使わずに、入力された文字列の特徴を抽出して、予め定められた特徴と特徴に対応する重みとを参照することにより略語らしさを推定することができる。また、略語生成規則に表記変換規則(例：カタカナ⇔アルファベット)を必要とする略語でも表記変換規則を使わない略語と同じように比較できる。さらに、略語生成規則が複雑であったり、低頻度の規則が適用された略語候補であったりしても、略語らしさの推定には影響しにくく、学習データが比較的少数でも精度良く略語推定が可能である。 According to the first embodiment described above, the characteristics of the input character string are extracted without using the abbreviation generation rules and the original language information, and the predetermined characteristics and the weights corresponding to the characteristics are obtained. The abbreviation can be estimated by referring to it. Also, abbreviations that require notation conversion rules (eg, Katakana ⇔ alphabet) in abbreviation generation rules can be compared in the same way as abbreviations that do not use notation conversion rules. Furthermore, even if abbreviation generation rules are complex or abbreviation candidates to which low-frequency rules are applied, estimation of abbreviation is unlikely to be affected, and abbreviation estimation can be performed accurately even with a relatively small amount of learning data. Is possible.

（第１の変形例）
第１の実施形態では、分割可能位置情報が付加されていない場合は要素語の数が不明となるため、要素数の特徴を抽出することができない。しかし本変形例では、入力部１０１へ与えられた文字列である略語候補に、分割可能位置情報が付加されていなかった場合に分割可能位置情報を付加して略語推定をおこなうことができる。
第１の変形例に係る略語推定装置の構成について図７を参照して詳細に説明する。
第１の変形例に係る略語推定装置３００は、第１の実施形態に係る略語推定装置１００に加え、入力部１０１の後に、語分割部７０１を含む。語分割部７０１は、入力部１０１から受け取った略語候補に対して語分割をおこない、少なくとも１つ以上の語分割結果を得る。その後、この語分割結果を分割可能位置情報として略語候補に付加し、この略語候補と分割可能位置情報を対応付けて特徴抽出部１０２へ送る。本変形例における出力部１０５は、略語候補とその略語らしさと分割位置情報とを対応づけて出力する。
例えば、入力部１０１に入力された略語候補が「Ｍスマ」であった場合に、語分割部７０１では、任意の分割位置「／Ｍスマ／」「／Ｍ／スマ／」「／Ｍス／マ／」によって分割をおこなって分割結果を得る。ただし、入力された略語候補が１文字である場合は分割されない。その後「／Ｍスマ／」「／Ｍ／スマ／」「／Ｍス／マ／」のそれぞれについて、略語推定部１０７の処理をおこなう。「／Ｍスマ／」を特徴抽出部１０２に入力した場合は、「３字」「１要素」などの特徴Ｆ_ｊが抽出され、「／Ｍス／マ／」の場合は「２字＋１字」「２要素」などの特徴Ｆ_ｊが抽出される。よって、各分割結果に対して第１の実施形態と同様の処理をおこなうことにより、「Ｍスマ」に対する３つの分割位置それぞれの場合の略語らしさが出力部１０５から出力される。なお、入力された略語候補の分割結果が複数存在する場合は、全て特徴抽出部１０２へ送ってもよいし、複数の分割結果の中から１つの分割結果を選択して特徴抽出部１０２へ送ってもよい。 (First modification)
In the first embodiment, when the position information that can be divided is not added, the number of element words is unknown, so the feature of the number of elements cannot be extracted. However, in this modification, the abbreviation estimation can be performed by adding the severable position information to the abbreviation candidate that is the character string given to the input unit 101 when the severable position information is not added.
The configuration of the abbreviation estimation apparatus according to the first modification will be described in detail with reference to FIG.
The abbreviation estimation apparatus 300 according to the first modification includes a word division unit 701 after the input unit 101 in addition to the abbreviation estimation apparatus 100 according to the first embodiment. The word division unit 701 performs word division on the abbreviation candidates received from the input unit 101 and obtains at least one word division result. Thereafter, the word division result is added to the abbreviation candidate as dividable position information, and the abbreviation candidate and the dividable position information are associated with each other and sent to the feature extraction unit 102. The output unit 105 in this modification outputs the abbreviation candidates, the abbreviations, and the division position information in association with each other.
For example, if the abbreviation candidate input to the input unit 101 is “M summa”, the word segmentation unit 701 selects an arbitrary division position “/ M summa /” “/ M / summer /” “/ M sum / Divide by “/” to obtain the result of division. However, when the inputted abbreviation candidate is one character, it is not divided. Thereafter, the abbreviation estimation unit 107 performs processing for each of “/ M Smart /”, “/ M / Smart /”, and “/ M Smart / Ma /”. When “/ M Smart /” is input to the feature extraction unit 102, features F _j such as “3 characters” and “1 element” are extracted, and in the case of “/ M Smart / Ma /”, “2 characters + 1 character”. The feature F _j such as “2 elements” is extracted. Therefore, by performing the same processing as that of the first embodiment on each division result, the abbreviation likelihood in each of the three division positions for “M summa” is output from the output unit 105. When there are a plurality of input abbreviation candidate division results, all of them may be sent to the feature extraction unit 102, or one division result is selected from the plurality of division results and sent to the feature extraction unit 102. May be.

以上に示した第１の変形例によれば、入力された略語候補に分割可能位置情報が付加されていない場合でも、略語候補を語分割することにより分割可能位置情報を生成し、これに基づいて第１の実施形態と同様に略語推定をおこなうことができる。 According to the first modified example described above, even when the dividable position information is not added to the input abbreviation candidate, the dividable position information is generated by dividing the abbreviation candidate into words, and based on this As in the first embodiment, abbreviation estimation can be performed.

（第２の変形例）
上述した第１の実施形態および第１の変形例では、入力される文字列として略語候補が入力されることを想定している。本変形例では、正式名称（原語）が入力された場合に、略語となる略語候補を複数生成してそれぞれに対応する略語らしさを出力することができる。
第２の変形例に係る略語推定装置の構成について図８を参照して詳細に説明する。
第２の変形例に係る略語推定装置８００は、第１の実施形態に係る略語推定装置１００に加え、入力部１０１の後に、略語生成部８０１を含む。略語生成部８０１は、入力部１０１から受け取った正式名称に対して、略語候補を少なくとも１つ以上生成して、生成した略語を特徴抽出部１０２を送る。本変形例における出力部１０５は、少なくとも１つ以上の略語候補と略語候補にそれぞれ対応する略語らしさとを出力する。 (Second modification)
In the first embodiment and the first modification described above, it is assumed that abbreviation candidates are input as input character strings. In this modification, when an official name (original word) is input, a plurality of abbreviation candidates that become abbreviations can be generated and the abbreviation corresponding to each can be output.
The configuration of the abbreviation estimation apparatus according to the second modification will be described in detail with reference to FIG.
The abbreviation estimation device 800 according to the second modification includes an abbreviation generation unit 801 after the input unit 101 in addition to the abbreviation estimation device 100 according to the first embodiment. The abbreviation generation unit 801 generates at least one abbreviation candidate for the formal name received from the input unit 101, and sends the generated abbreviation to the feature extraction unit 102. The output unit 105 in this modification outputs at least one abbreviation candidate and abbreviations corresponding to the abbreviation candidates.

例えば、入力部１０１に入力された正式名称が「ミュージックスマッシュ」であり、その分割可能位置が「／ミュージック／スマッシュ／」である場合を仮定する。この正式名称に対し、略語生成部８０１により生成された複数の略語候補とその分割可能位置とが「／ミュ／スマ／」「／Ｍ／スマ／」「／ミュ／Ｓ／」「／Ｍ／Ｓ／」として得られたとき、各略語候補に対し略語推定部１０７において略語推定処理をおこない、略語候補と、それぞれの略語候補に対応する略語らしさを出力部１０５から出力する。なお、正式名称（原語）が入力された場合には、形態素解析から略語候補の分割位置を決めてもよい。 For example, it is assumed that the official name input to the input unit 101 is “music smash” and the division possible position is “/ music / smash /”. For this formal name, a plurality of abbreviation candidates generated by the abbreviation generation unit 801 and their divisionable positions are “/ Mu / Suma /”, “/ M / Suma /”, “/ Mu / S /”, “/ M / When obtained as “S /”, the abbreviation estimation unit 107 performs abbreviation estimation processing on each abbreviation candidate, and outputs from the output unit 105 abbreviation candidates and abbreviations corresponding to the respective abbreviation candidates. When a formal name (original language) is input, the division position of the abbreviation candidate may be determined from morphological analysis.

以上に示した第２の変形例によれば、入力される文字列として正式名称（原語）が入力された場合でも、略語となる略語候補を複数生成してそれぞれに対応する略語らしさを出力することができ、第１の実施形態と同様に略語推定をおこなうことができる。 According to the second modified example described above, even when a formal name (original word) is input as an input character string, a plurality of abbreviation candidates as abbreviations are generated and the abbreviation corresponding to each is output. And abbreviation estimation can be performed as in the first embodiment.

（第３の変形例）
第３の変形例では、複数の略語候補の入力があった場合、複数の略語候補から略語らしさの高いものだけを選定して出力することができ、加えて外部の文書データなどを参照して使用頻度の高い略語候補を絞り込むことが可能である。
第３の変形例にかかる略語推定装置の構成について図９を参照して詳細に説明する。
第３の変形例にかかる略語推定装置９００は、第１の実施形態に係る略語推定装置１００に加え、略語推定計算部１０４の後に、略語候補選定部９０１を含む。さらに、略語候補選定部９０１は外部にある頻度算出部９０２と接続されている。略語候補選定部９０１は、略語推定計算部１０４から得られた複数の略語候補から、略語らしさの高い略語候補、つまり略語候補順位の高い順に略語候補を選定して出力部１０５へ送る。または、後述する頻度算出部９０２から文書データ中の頻度情報を取得して、頻度情報が多い順に略語候補の選定をおこない出力部１０５へ送る。頻度情報とは、文書データ中に略語候補が出現した度数を示す情報である。最後に本変形例における出力部１０５は、略語候補と略語候補にそれぞれ対応する略語らしさとを出力する。
頻度算出部９０２は、略語候補が使用されている頻度を測定する。それは外部にある文書データでの出現頻度を頻度情報としてもよいし、ｗｅｂなどの外部データベースにおける文字列の検索ヒット数を頻度情報として参照してもよい。算出した頻度情報として略語候補選定部９０１へ送る。 (Third Modification)
In the third modified example, when a plurality of abbreviation candidates are input, only those having high abbreviations can be selected and output from the plurality of abbreviation candidates. In addition, referring to external document data or the like It is possible to narrow down abbreviation candidates that are frequently used.
The configuration of the abbreviation estimation apparatus according to the third modification will be described in detail with reference to FIG.
The abbreviation estimation device 900 according to the third modification includes an abbreviation candidate selection unit 901 after the abbreviation estimation calculation unit 104 in addition to the abbreviation estimation device 100 according to the first embodiment. Furthermore, the abbreviation candidate selection unit 901 is connected to an external frequency calculation unit 902. The abbreviation candidate selection unit 901 selects abbreviation candidates that are more likely to be abbreviations from a plurality of abbreviation candidates obtained from the abbreviation estimation calculation unit 104, that is, abbreviated word candidates in descending order of abbreviation candidate ranking, and sends them to the output unit 105. Alternatively, frequency information in the document data is acquired from a frequency calculation unit 902 to be described later, abbreviation candidates are selected in the descending order of frequency information, and are sent to the output unit 105. The frequency information is information indicating the frequency of occurrence of abbreviation candidates in the document data. Finally, the output unit 105 in the present modification outputs abbreviation candidates and abbreviations corresponding to the abbreviation candidates.
The frequency calculation unit 902 measures the frequency with which abbreviation candidates are used. The frequency information may be the appearance frequency of external document data, or the number of character string search hits in an external database such as a web may be referred to as frequency information. The calculated frequency information is sent to the abbreviation candidate selection unit 901.

ここで例として、略語推定計算部１０４において、略語候補が「／ミュ／スマ／」「／Ｍ／スマ／」「／ミュ／Ｓ／」「／Ｍ／Ｓ／」、略語候補に対応する略語らしさがそれぞれ、０．１、０．２、０．０１、０．１５と計算された場合を仮定する。上位２つを略語候補選定部９０１で選定するとき、「Ｍスマ」と「ＭＳ」が出力される。あるいは、「Ｍスマ」と「ＭＳ」について、頻度算出部９０２において「Ｍスマ」や「ＭＳ」そのものの文書中での出現回数が上位のもの（例えば、上位３０位など）を選定する。さらに、前後に文字パターン「が（＊）出演」を追加して「がＭスマ出演」「がＭＳ出演」といった略語候補の文書中での出現回数を利用したり、「（＊）は、（＊）が」といった文字パターンで「Ｍスマは、」「ＭＳが、」など主語になりうるものを検索し、条件（例えば、出現回数が１回以上）を満たした略語候補を選定する。 As an example, in the abbreviation estimation calculation unit 104, abbreviations corresponding to abbreviation candidates are “/ Mu / Suma /”, “/ M / Suma /”, “/ Mu / S /”, “/ M / S /”. Assume that the likelihoods are calculated as 0.1, 0.2, 0.01, and 0.15, respectively. When the upper two are selected by the abbreviation candidate selection unit 901, “M summa” and “MS” are output. Alternatively, with respect to “M summa” and “MS”, the frequency calculation unit 902 selects “M summa” and “MS” having the highest number of appearances in the document (for example, the top 30). Furthermore, the character pattern “(*) Appearance” is added before and after, and the number of appearances in the abbreviation candidate document such as “Appearance of M Smartphone” or “Appearance of MS” is used, or “(*) is ( A character pattern such as “*)” is searched for a possible subject such as “M Smart is” or “MS is”, and an abbreviation candidate satisfying a condition (for example, the number of appearances is 1 or more) is selected.

以上に示した第３の変形例によれば、複数の略語候補の中から、略語らしさの高いものを選定して出力することができ、さらに、外部にある文書データにおける頻度や、ｗｅｂヒット数などを頻度情報として参照し、より使用頻度の高い略語候補を絞り込むことができ、略語らしさの精度を高めることができる。 According to the third modification example described above, it is possible to select and output an abbreviation-like one from a plurality of abbreviation candidates, and further, the frequency in the external document data and the number of web hits Etc. can be referred to as frequency information, and abbreviation candidates that are used more frequently can be narrowed down, and the accuracy of abbreviations can be improved.

（第４の変形例）
上述した実施形態および変形例では、特徴格納部１０６に格納されている特徴に対する重みについては予め定められた値を使用することを前提としているが、第４の変形例では、特徴格納部１０６に格納されている特徴に対する重みを学習データにより更新することで特徴に対する重みの精度を向上することができる。
第４の変形例に係る略語推定装置の構成について図１０を参照して詳細に説明する。
第４の変形例に係る略語推定装置１０００は、第１の実施形態に係る略語推定装置１００に加え、学習データ入力部１００１、略語生成部８０１、学習データ特徴抽出部１００２、重み学習部１００３を含む。また、学習データ入力部１００１から重み学習部１００３までをまとめて学習部１００４と呼び、特徴格納部１０６に接続されている。また、略語生成部８０１は、第２の変形例で示した略語生成部８０１と同じ動作をおこなうためここでの説明は省略する。
学習データ入力部１００１は、既知である正しい略語とその正式名称の入力を受け付け、略語生成部８０１へ送る。
学習データ特徴抽出部１００２は、略語生成部８０１で生成された各略語候補に対して特徴の抽出をおこない、学習データ入力部１００１で入力された正式名称に対応する正しい略語があった場合には、その略語の特徴を特徴格納部１０６に格納する。また、すべての略語候補に対し特徴格納部１０６への格納が終わった後は、すべての略語候補の特徴と、そのうちいずれが正しい略語であるかの情報を重み学習部１００３に送る。 (Fourth modification)
In the embodiment and the modification described above, it is assumed that a predetermined value is used as the weight for the feature stored in the feature storage unit 106, but in the fourth modification, the feature storage unit 106 includes By updating the weight for the stored feature with the learning data, the accuracy of the weight for the feature can be improved.
The configuration of the abbreviation estimation apparatus according to the fourth modification will be described in detail with reference to FIG.
The abbreviation estimation apparatus 1000 according to the fourth modification includes a learning data input unit 1001, an abbreviation generation unit 801, a learning data feature extraction unit 1002, and a weight learning unit 1003 in addition to the abbreviation estimation apparatus 100 according to the first embodiment. Including. The learning data input unit 1001 to the weight learning unit 1003 are collectively referred to as a learning unit 1004 and connected to the feature storage unit 106. Moreover, since the abbreviation generation unit 801 performs the same operation as the abbreviation generation unit 801 shown in the second modification, description thereof is omitted here.
The learning data input unit 1001 receives an input of a known correct abbreviation and its formal name, and sends it to the abbreviation generation unit 801.
The learning data feature extraction unit 1002 extracts features from each abbreviation candidate generated by the abbreviation generation unit 801, and when there is a correct abbreviation corresponding to the official name input by the learning data input unit 1001. The characteristics of the abbreviation are stored in the characteristic storage unit 106. In addition, after storing all abbreviation candidates in the feature storage unit 106, information on all the abbreviation candidate features and which one is the correct abbreviation is sent to the weight learning unit 1003.

重み学習部１００３は、学習データ特徴抽出部１００２から受け取った各略語候補の特徴と既知の正しい略語とを対応付けたものを学習データとし、特徴格納部１０６に含まれる特徴と学習データの特徴とが一致する場合、特徴格納部１０６に含まれる特徴に対する重みを学習（更新）する。例えば、Ｌｏｇ−ＬｉｎｅａｒＭｏｄｅｌによって、各特徴の重みを学習する。 The weight learning unit 1003 uses, as learning data, the features of each abbreviation candidate received from the learning data feature extraction unit 1002 and the known correct abbreviations, and the features included in the feature storage unit 106 and the features of the learning data Are matched, the weights for the features included in the feature storage unit 106 are learned (updated). For example, the weight of each feature is learned by Log-Linear Model.

以上に示した第４の変形例によれば、特徴格納部１０６に格納されている特徴に対する重みを、既知の略語と正式名称を用いて略語候補の特徴を学習（更新）することにより、特徴に対する重みの精度を向上することができ、さらに略語生成規則を事前に決めた場合でも、それとは異なる特徴から略語推定し略語推定に使う特徴の学習をすることが可能である。 According to the fourth modified example described above, the feature for the feature stored in the feature storage unit 106 is learned (updated) by learning (updating) the feature of the abbreviation candidate using the known abbreviation and the formal name. It is possible to improve the accuracy of the weight for, and to learn the features used for abbreviation estimation by estimating abbreviations from different features even when abbreviation generation rules are determined in advance.

なお、本発明は上記実施形態そのままに限定されるものではなく、実施段階ではその要旨を逸脱しない範囲で構成要素を変形して具体化できる。また、上記実施形態に開示されている複数の構成要素の適宜な組み合わせにより、種々の発明を形成できる。例えば、実施形態に示される全構成要素から幾つかの構成要素を削除してもよい。さらに、異なる実施形態にわたる構成要素を適宜組み合わせてもよい。 Note that the present invention is not limited to the above-described embodiment as it is, and can be embodied by modifying the constituent elements without departing from the scope of the invention in the implementation stage. In addition, various inventions can be formed by appropriately combining a plurality of components disclosed in the embodiment. For example, some components may be deleted from all the components shown in the embodiment. Furthermore, constituent elements over different embodiments may be appropriately combined.

１００、７００、８００、９００、１０００、１１００、１２００・・・略語推定装置、１０１・・・入力部、１０２・・・特徴抽出部、１０３・・・重み導出部、１０４・・・略語推定計算部、１０５・・・出力部、１０６・・・特徴格納部、１０７・・・略語推定部、７０１・・・語分割部、８０１・・・略語生成部、９０１・・・略語候補選定部、９０２・・・頻度算出部、１００１・・・学習データ入力部、１００２・・・学習データ特徴抽出部、１００３・・・重み学習部、１００４・・・学習部。 100, 700, 800, 900, 1000, 1100, 1200 ... abbreviation estimation device, 101 ... input unit, 102 ... feature extraction unit, 103 ... weight derivation unit, 104 ... abbreviation estimation calculation 105, output unit, 106 ... feature storage unit, 107 ... abbreviation estimation unit, 701 ... word division unit, 801 ... abbreviation generation unit, 901 ... abbreviation candidate selection unit, 902 ... Frequency calculation unit, 1001 ... Learning data input unit, 1002 ... Learning data feature extraction unit, 1003 ... Weight learning unit, 1004 ... Learning unit.

Claims

A first acquisition means for acquiring a character string that can be a candidate for an abbreviation and splittable position information for dividing the character string;
Calculating means for calculating a plurality of first characteristics indicating characteristics of the character string derived from the character string and position information of the character string that can be divided;
Storage means for storing a second feature belonging to a desired category among categories that are indices indicating the concept of the feature, and a likelihood corresponding to the second feature;
A second obtaining unit that compares the plurality of first features with the second feature for each category and obtains likelihood corresponding to the same feature from the storage unit;
An abbreviation estimation apparatus comprising: a calculation unit that calculates an abbreviation candidate rank that is a rank indicating a correctness as an abbreviation for an official name of the character string from the likelihood acquired by the second acquisition unit .

First acquisition means for acquiring a character string that can be a candidate for an abbreviation;
Division means for dividing the character string between characters and obtaining splittable position information for dividing the character string;
Calculating means for calculating a plurality of first characteristics indicating characteristics of the character string derived from the character string and the splittable position information;
Storage means for storing a second feature belonging to a desired category among categories that are indices indicating the concept of the feature, and a likelihood corresponding to the second feature;
A second obtaining unit that compares the plurality of first features with the second feature for each category and obtains likelihood corresponding to the same feature from the storage unit;
An abbreviation estimation apparatus comprising: a calculation unit that calculates an abbreviation candidate rank that is a rank indicating a correctness as an abbreviation for an official name of the character string from the likelihood acquired by the second acquisition unit .

First acquisition means for acquiring a character string as an official name;
Generating means for generating at least one abbreviation candidate that is an abbreviation candidate from the character string, and generating at least one splittable position information for dividing the abbreviation candidate;
Calculating means for calculating a plurality of first features indicating features of the character string derived from the abbreviation candidates and the position information of the abbreviation candidates that can be divided;
Storage means for storing a second feature belonging to a desired category among categories that are indices indicating the concept of the feature, and a likelihood corresponding to the second feature;
A second obtaining unit that compares the plurality of first features with the second feature for each category and obtains likelihood corresponding to the same feature from the storage unit;
An abbreviation estimation apparatus comprising: a calculation unit that calculates an abbreviation candidate rank that is a rank indicating the correctness of the abbreviation candidate as an abbreviation from the likelihood acquired by the second acquisition unit. .

The category includes the number of characters of the abbreviation candidate, the number of element words that are character string units divided by the position information that can be divided, the combination of the character type that is the character type of the element word and the number of characters of the element word, the element word The abbreviation estimation apparatus according to any one of claims 1 to 3, wherein the abbreviation estimation device is at least one of a combination of character types and a combination of the number of characters and the character type of the element word.

The abbreviation estimation apparatus according to any one of claims 1 to 4, further comprising a selection unit that selects a plurality of abbreviation candidates in descending order of the abbreviation candidate rank.

A calculation means for calculating frequency information that is information indicating the frequency of occurrence of the abbreviation candidate from document data in an external document or in an external database;
The abbreviation estimation apparatus according to claim 5, wherein the selection unit selects the abbreviation candidates in descending order of the frequency information.

The abbreviation estimation apparatus according to any one of claims 1 to 6, wherein the likelihood is a weight indicating a degree of whether the feature is a correct element as an abbreviation of a formal name.

The second acquisition means compares the first feature and the second feature, and acquires a predetermined likelihood if there is no corresponding identical feature. The abbreviation estimation apparatus according to any one of the above.

Obtaining means for obtaining a formal name and a correct abbreviation corresponding to the formal name;
Abbreviation candidate generating means for generating a plurality of abbreviation candidates from the official name;
When a plurality of abbreviation candidate features are extracted and the correct abbreviation feature corresponding to the official name is compared with the abbreviation candidate feature and there is an identical feature, the same feature is extracted and the storage means Extracting means for storing the same feature in
The characteristics of the abbreviation candidates are compared with the characteristics included in the storage means, the characteristics of the abbreviation candidates are rewritten to the characteristics included in the storage means, and which of the characteristics of the plurality of abbreviation candidates is correct? Learning means that updates the likelihood of the feature included in the storage means when the feature included in the storage means matches the feature of the learning data using the learning data that is data associated with The abbreviation estimation device according to any one of claims 1 to 8, further comprising:

The first acquisition means acquires a character string that can be a candidate for an abbreviation and splittable position information for dividing the character string,
A first calculating means for calculating a plurality of first features indicating the characteristics of the character string derived from the character string and the severable position information of the character string;
Storage means stores a second characteristic which belongs to the desired category among the categories is an index showing the concept of features, and a likelihood corresponding to the second characteristic,
A second acquisition unit that compares the plurality of first features with the second feature for each category and acquires a likelihood corresponding to the same feature from the storage unit;
An abbreviation estimation method, wherein the second calculation means calculates an abbreviation candidate rank, which is a rank indicating the correctness as an abbreviation for the formal name of the character string, from the acquired likelihood.

The first acquisition means acquires a character string that can be a candidate for an abbreviation,
The dividing means divides between the characters of the character string to obtain splittable position information for dividing the character string,
A first calculating means for calculating a plurality of first characteristics indicating characteristics of the character string derived from the character string and the splittable position information;
Storage means stores a second characteristic which belongs to the desired category among the categories is an index showing the concept of features, and a likelihood corresponding to the second characteristic,
A second acquisition unit that compares the plurality of first features with the second feature for each category and acquires a likelihood corresponding to the same feature from the storage unit;
An abbreviation estimation method, wherein the second calculation means calculates an abbreviation candidate rank, which is a rank indicating the correctness as an abbreviation for the formal name of the character string, from the acquired likelihood.

The first acquisition means acquires the character string as a formal name,
Generating means generates at least one abbreviation candidate that is a candidate for an abbreviation from the character string, and generates at least one or more splittable position information for dividing the abbreviation candidate;
The first calculation means calculates a plurality of first features indicating the characteristics of the character string derived from the abbreviation candidates and the dividable position information of the abbreviation candidates;
Storage means stores a second characteristic which belongs to the desired category among the categories is an index showing the concept of features, and a likelihood corresponding to the second characteristic,
A second acquisition unit that compares the plurality of first features with the second feature for each category and acquires a likelihood corresponding to the same feature from the storage unit;
An abbreviation estimation method, wherein the second calculation means calculates an abbreviation candidate rank, which is a rank indicating the correctness of the abbreviation candidate as an abbreviation with respect to the formal name of the abbreviation candidate.