WO2013121988A1 - Abbreviation generating system - Google Patents

Abbreviation generating system Download PDF

Info

Publication number
WO2013121988A1
WO2013121988A1 PCT/JP2013/052968 JP2013052968W WO2013121988A1 WO 2013121988 A1 WO2013121988 A1 WO 2013121988A1 JP 2013052968 W JP2013052968 W JP 2013052968W WO 2013121988 A1 WO2013121988 A1 WO 2013121988A1
Authority
WO
WIPO (PCT)
Prior art keywords
abbreviation
word
words
abbreviations
generation
Prior art date
Application number
PCT/JP2013/052968
Other languages
French (fr)
Japanese (ja)
Inventor
石川 開
正明 土田
貴士 大西
早人 山名
孝徳 及川
Original Assignee
日本電気株式会社
学校法人早稲田大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電気株式会社, 学校法人早稲田大学 filed Critical 日本電気株式会社
Priority to JP2013558668A priority Critical patent/JP6135867B2/en
Publication of WO2013121988A1 publication Critical patent/WO2013121988A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Definitions

  • the present invention relates to an abbreviation generation system, an abbreviation generation method, and an abbreviation generation program for generating abbreviations from original words input by information processing.
  • Non-Patent Document 1 An example of such an abbreviation generation method is described in Non-Patent Document 1.
  • the abbreviation automatic estimation method described in Non-Patent Document 1 it is proposed to extract likely abbreviation candidates created by a probability model and to narrow down candidates for the extracted abbreviation candidates by information on the Web.
  • Non-Patent Document 1 also describes existing technologies such as conversion rules and desirable presentation methods.
  • Abbreviations used in the community are often preferred to have expressions that are important to the original word, thoughts, and characteristics, but are short enough to be distinguished from other abbreviations and have little redundancy.
  • the morpheme used for the abbreviation is selected from the character string type and the position information of the mora with respect to the original word, the semantic content of the morpheme itself and the relationship between the morphemes are considered. It has not been. In such a situation, there arises a problem that a candidate different from an abbreviation that will be generated in the community is preferentially selected as an abbreviation candidate depending on the pattern of the original word.
  • the abbreviation candidates according to the community desired by the user have not been generated.
  • abbreviation candidates obtained by the abbreviation automatic estimation method described in Non-Patent Document 1 are required to be already used in the Internet in order to perform verification using existing Web information. For this reason, there are problems that cannot be used for new abbreviations or fields not used in the Internet.
  • the present invention provides an abbreviation generation system that accurately generates abbreviations that are likely to be generated in a community from original words.
  • the abbreviation generation system for generating an abbreviation from a character string of an original word is similar to an information group used in a community with a predetermined word and an index indicating a degree used for generation of the word into an abbreviation.
  • the abbreviation generation from the plurality of words is performed by comparing and ordering the index for each of the plurality of words, with respect to the importance rule storage unit stored in association with the original word composed of the plurality of words received
  • An important word selection unit that selects words to be used in order of priority, and an abbreviation candidate generation unit that generates abbreviation candidates using the selected words and outputs the abbreviation candidates.
  • FIG. 1 is a block diagram showing a system configuration of an embodiment of the present invention.
  • FIG. 2 is a flowchart illustrating an example of a processing operation in the embodiment.
  • FIG. 3 is a schematic diagram illustrating a presentation example of abbreviation candidates.
  • FIG. 4 is a block diagram showing an example of realization relating to the present invention.
  • the abbreviation generation system includes an input device 1, a data processing device 2, a storage device 3, and an output device 4.
  • the input device 1 is a device that receives an original word, a desired number of abbreviations, a display number of abbreviation candidates, and the like from a user.
  • the input device 4 is a device that presents the generated abbreviation to the user.
  • the data processing device 2 includes an important morpheme selection unit 20 and an abbreviation candidate generation unit 21.
  • the important morpheme selection unit 20 performs morpheme analysis on the original word input from the input device 1 and based on an index indicating the importance based on the contents of the morpheme stored in the morpheme importance rule storage unit 30. , Configured to select morphemes to be used for abbreviations.
  • the abbreviation candidate generation unit 21 converts the character string of each morpheme into each selected morpheme based on the conversion rules stored in the morpheme conversion rule storage unit 31 and generates abbreviation candidates to be presented to the user. Configured.
  • the storage device 3 includes a morpheme importance rule storage unit 30 and a morpheme conversion rule storage unit 31 that hold rules used in each process of the data processing device 2.
  • morpheme importance rule storage unit 30 rules for quantifying the importance of morphemes for morpheme selection are created and stored as indices based on information groups used in the community.
  • the morpheme importance rule storage unit 30 stores an index indicating the degree used for generating abbreviations within the community for each morpheme.
  • Such a morpheme importance rule for calculating importance is a set of indices constructed by collecting and analyzing abbreviations and original words based on various information that has been used in the community.
  • These morpheme importance rules include manually created data, data obtained from corpus and abbreviation databases where at least pairs of abbreviations and abbreviations are recorded, and the acronyms used in the community and their synonyms. Can be used.
  • sentences and sound sources used in the community can be used.
  • a text corpus or a speech corpus based on a plurality of documents created by the community or a source language system used in the community may be used.
  • an index to be recorded in the morpheme importance rule storage unit 30 for each morpheme combination, an index indicating which morpheme treated as a combination is relatively easy to use for generating an abbreviation is used. it can.
  • an index in a combination of a plurality of morphemes, an index indicating which morpheme or a combination of morphemes is relatively easy to use for generating an abbreviation or a combination of morphemes can be used.
  • the morpheme conversion rule storage unit 31 stores a rule for converting each morpheme into a character string for abbreviation generation. This conversion rule is preferably determined by collecting and analyzing conversion rules that have been used based on various information used in the community.
  • the conversion rule is, for example, a rule of “adopting the first letter of the morpheme”, “adopting the first letter of the first morpheme, and adopting the first two letters of the second morpheme”, “reducing muddy sounds”, Conversion rules such as “eliminate long tones”, “take initials as a result of translation into English”, and “do not convert specific morphemes”.
  • Conversion rules such as “eliminate long tones”, “take initials as a result of translation into English”, and “do not convert specific morphemes”.
  • Various existing conversion rules may be used. When there are a plurality of conversion rules, a candidate is generated for each combination of application of those rules.
  • the original language to be entered is “National Institute of Science and Technology for Disaster Prevention”.
  • the abbreviation generation system receives the original word requested to generate the abbreviation from the input device 1 (step S1). At this time, input of conditions desired by the user may be accepted.
  • the important morpheme selection unit 20 performs a morpheme analysis on the received original word, and selects a morpheme used for abbreviation generation (step S2). For example, “Disaster Prevention Science and Technology Institute” is divided into “disaster prevention / science / technology / research / place” and morphemes. Note that the processing can be omitted if it is accepted from a user in a state divided into words (for example, “disaster prevention / science / technology / laboratory”).
  • the important morpheme selection unit 20 refers to the morpheme importance degree rule storage unit 30, calculates the importance according to the contents of each morpheme, and selects the morpheme used for the abbreviation based on the importance (step S3).
  • the important morpheme selection unit 20 refers to the morpheme importance degree rule storage unit 30, calculates the importance according to the contents of each morpheme, and selects the morpheme used for the abbreviation based on the importance (step S3).
  • two morphemes are treated as a set, the two morphemes contained in this set are compared, and the score is calculated using the probability that one is preferentially adopted as an abbreviation for the other,
  • the morphemes that should prioritize the result as the importance are selected according to the level of importance.
  • the abbreviation candidate generation unit 21 refers to the morpheme conversion rule storage unit 31 and applies the morpheme conversion rules (rules for character string conversion and combination) to the selected morphemes to obtain abbreviation candidates. Is generated (step S4). For example, applying the rule of “adopting the first letter of a morpheme” to “disaster prevention”, “science”, and “research”, which have a high degree of importance, is combined to become “National Science and Technology Research Institute”. When there are a plurality of conversion rules, one or a plurality of candidates may be generated for each combination of application of those rules.
  • the conversion rule may be directly selected by the user, or the system may determine the number of characters input by the user.
  • the abbreviation candidate generation unit 21 presents one or more abbreviation candidates generated via the output device 4 to the user (step S6).
  • An example of a screen presented to the user via the output device 4 at this time is shown in FIG.
  • the abbreviation candidates are presented by the number previously specified by the user, the score obtained in the above process, the score based on the co-occurrence probability with the original word, the abbreviation received from the user and the character string of the thought to the original word.
  • the degree of coincidence may be used in an integrated manner. It is also desirable to present to the user the correspondence between the characters of each abbreviation candidate and the original language characters.
  • FIG. 3 only the abbreviation candidate 1 that is most suitable for the community is visually presented as the relationship between the original word and the abbreviation as a character string.
  • the display may be such that the abbreviation candidate and the original word selected by the user are visually presented as related.
  • a free description field may be provided in the presentation screen, and the presentation order of abbreviation candidates generated by adjusting the score using the character string input in the description field may be changed. In this description field, for example, “thought” and “priority” may be received separately, and different processing may be assigned to each.
  • the conversion rule may be selected based on the character string input in the free description field.
  • reliability is assigned to each rule, abbreviation candidates generated by a combination of rules with low reliability may not be output. For example, a method of taking the product of the reliability of the rule and not outputting if it is below a certain value using a threshold value can be considered.
  • the generated abbreviations may be scored using the reliability of the rules and the importance of the morphemes, and abbreviation candidates may be output together with the scores.
  • rules and reliability can be created manually or various values collected by existing technology.
  • the morpheme importance rule described below is an index indicating the probability that a specific morpheme should be given priority over another morpheme, determined based on various types of information used in the community. is there. In other words, the index indicates the relative probability between the remaining morphemes in the community that can be obtained from the information obtained in the community.
  • the morpheme importance rule is determined as follows.
  • the probability of the rule in the opposite direction may be indexed.
  • the importance of a certain morpheme is calculated as the sum of the values obtained from the comparison results with other remaining morphemes.
  • the importance of “disaster prevention” which is a morpheme is 2.8 as a result of comparison according to the content of the original language (National Science and Technology Research Institute). This value is "0.7 [comparison between disaster prevention and science]”"0.7 [comparison between disaster prevention and technology]”"0.5 [comparison between disaster prevention and research]”"0.9 [comparison between disaster prevention and places] ] ”.
  • the important morpheme selection unit 20 performs the same calculation process on all morphemes included in the original language, and calculates the importance of each morpheme.
  • the number of words to be selected is arbitrary, but may be selected based on a threshold value or rank for importance. It should be noted that all words may be left as candidates and adjusted on the abbreviation candidate generation unit 21 side.
  • the score for using each morpheme as an abbreviation candidate is calculated based on the probability that one of the two pairs of morphemes remains preferentially in the abbreviation candidate with respect to the other. For this reason, the semantic content of the morpheme itself and the relationship between the morphemes are taken into account through the statistical viewpoint obtained from the abbreviation examples used in the actual community, leading to the derivation of good candidates.
  • the comparison between multiple morphemes can be handled in the same way as the method between two morphemes.
  • a word or combination of words such as “Technology, Research> Place: 0.9” or “Disaster Prevention, Technology> Research, Place: 0.4”, etc.
  • an index indicating whether it can be easily used to generate a relative abbreviation may be used.
  • what is necessary is just to implement
  • an abbreviation generation program is developed in the RAM, and each unit is realized as various means by operating hardware such as a control unit (CPU) based on the program.
  • the program may be recorded in a fixed manner on a storage medium and distributed.
  • the program recorded on the recording medium is read into a memory via a wired, wireless, or recording medium itself, and operates a control unit or the like.
  • Examples of the recording medium include an optical disk, a magnetic disk, a semiconductor memory device, and a hard disk.
  • the information processing system that operates as an abbreviation generation system is based on an abbreviation generation program developed in a RAM, an important word selection unit, an abbreviation candidate generation unit, and an importance level rule storage unit.
  • the abbreviation generation system may be constructed as a single computer as illustrated in FIG. 4 or may be constructed as a server-client system. Although the embodiments and processing examples have been illustrated and described above, changes such as separation / merging of block configurations and replacement of procedures are free as long as the gist of the present invention and the functions described are satisfied. The description is not intended to limit the invention.
  • an abbreviation generation system can be constructed on the Internet using a server. As described above, according to the present invention, it is possible to provide an abbreviation generation system that accurately generates abbreviations that are likely to be generated in a community from original words.
  • the present invention can be used for name identification, information retrieval, information extraction, etc. in a computer device, the Internet system, etc. by collecting the generated abbreviations.

Abstract

A system for generating abbreviations from the character strings of original words is constructed to include at least: an importance rule storage unit for associating and storing a given word and an index indicating the extent to which the word is used in the generation of abbreviations similar to information groups used in the community; an important word selecting unit for selecting words used in the generation of abbreviations in order of priority by comparing the indices of each original word that has been received and assigning a priority; an abbreviation candidate generating unit for outputting abbreviation candidates when abbreviation candidates have been generated using the selected words. As a result, abbreviations for the names of products and services which conform to words used in the community are obtained.

Description

略語生成システムAbbreviation generation system
 本発明は、情報処理により入力された原語から略語を生成処理する略語生成システム、略語生成方法、および略語生成用プログラムに関する。 The present invention relates to an abbreviation generation system, an abbreviation generation method, and an abbreviation generation program for generating abbreviations from original words input by information processing.
 社会では、名称や機能、商品などを、正式名称の一部分若しくは一部分を組み合わせて略語として用いている。また、人名や組織に対する略語を略称ともいい同様に扱っている。
 情報処理システムを用いて略語とその元と成った原語とを対応づける機能は、名寄せ、情報検索、情報抽出など、自然言語を対象とする多くのアプリケーションにおいて有用な技術である。
 略語と原語の対応を人手で収集して機械可読な辞書データを作成することは可能であり、現実社会で実施されている。一方で、新たな製品、サービス、作品、組織などの誕生に伴って、次々とコミュニティの中で自発的に略称が生成されるため、それらの略語を人手で正しく集めることには限界もある。特に名称について正しく辞書データを生成することは人手がかかり困難がある。このため、今日では、コーパスやWebなどから情報処理システムで自動的に対応付け辞書データを作成及び適宜更新することを行っている。
 ところで、入力として正式名称である原語を受け付け、略語の候補を情報処理により自動的に生成する方法が提案されている。このような、略語生成方式の一例が、非特許文献1に記載されている。
 非特許文献1に記載された略語自動推定方式では、確率モデルで作成された尤もらしい略語候補を抽出し、その抽出された略語候補についてWeb上の情報によって候補を絞り込むことが提案されている。候補の絞り込みは、略語候補の個々について、原語と略語候補が同義関係にあるかを検証することで、略語候補の絞り込みを行っている。ここで採択された確率モデルは、Noisy−channel modelである。また、非特許文献1には、変換ルールや望ましい提示の仕方などの既存技術についても説明されている。
In society, names, functions, products, etc. are used as abbreviations by combining a part or part of the official name. Abbreviations for personal names and organizations are also called abbreviations and are treated in the same way.
The function of associating an abbreviation with its original word using an information processing system is a technique useful in many applications targeting natural languages, such as name identification, information retrieval, and information extraction.
It is possible to create a machine-readable dictionary data by manually collecting correspondences between abbreviations and original words, which is practiced in the real world. On the other hand, as new products, services, works, organizations, etc. are born, abbreviations are generated spontaneously in the community one after another, and there is a limit to correctly collecting these abbreviations manually. In particular, generating dictionary data correctly for names is laborious and difficult. For this reason, today, association dictionary data is automatically created and appropriately updated by an information processing system from a corpus or the Web.
By the way, a method has been proposed in which an original word which is a formal name is accepted as an input, and abbreviation candidates are automatically generated by information processing. An example of such an abbreviation generation method is described in Non-Patent Document 1.
In the abbreviation automatic estimation method described in Non-Patent Document 1, it is proposed to extract likely abbreviation candidates created by a probability model and to narrow down candidates for the extracted abbreviation candidates by information on the Web. In narrowing down candidates, the abbreviation candidates are narrowed down by verifying whether the original word and the abbreviation candidate have the same synonym for each abbreviation candidate. The probability model adopted here is a Noisy-channel model. Non-Patent Document 1 also describes existing technologies such as conversion rules and desirable presentation methods.
 コミュニティにおいて使用される略語は、原語の重要な内容や想い、特徴を備えつつ、他の略語と識別可能な程度に短い、冗長性の少ない表現が好まれる場合が多い。
 一方、非特許文献1で説明されているような方式では、文字列タイプや、原語に対するモーラの位置の情報から略語に用いる形態素を選定するため、形態素自体の意味内容や形態素間の関係が考慮されていない。このような状況では、原語のパターンによって、コミュニティにおいて生成されるだろう略語とは異なる候補を略語候補として優先的に選択するなどといった問題が生じてしまう。換言すれば、ユーザが所望するコミュニティに則した略語候補を生成しきれていない。
 また、非特許文献1に記載された略語自動推定方式で得られる略語候補は、既存のWebの情報を用いて検証を行なうため、既にインターネット内で使用されていることが求められる。このため、新規の略語やインターネット内で使用されない分野などには用いることができない問題点を有する。
 本発明は、原語から、コミュニティにおいて生成される可能性の高い略語を精度よく生成する略語生成システムを提供する。
Abbreviations used in the community are often preferred to have expressions that are important to the original word, thoughts, and characteristics, but are short enough to be distinguished from other abbreviations and have little redundancy.
On the other hand, in the method described in Non-Patent Document 1, since the morpheme used for the abbreviation is selected from the character string type and the position information of the mora with respect to the original word, the semantic content of the morpheme itself and the relationship between the morphemes are considered. It has not been. In such a situation, there arises a problem that a candidate different from an abbreviation that will be generated in the community is preferentially selected as an abbreviation candidate depending on the pattern of the original word. In other words, the abbreviation candidates according to the community desired by the user have not been generated.
In addition, abbreviation candidates obtained by the abbreviation automatic estimation method described in Non-Patent Document 1 are required to be already used in the Internet in order to perform verification using existing Web information. For this reason, there are problems that cannot be used for new abbreviations or fields not used in the Internet.
The present invention provides an abbreviation generation system that accurately generates abbreviations that are likely to be generated in a community from original words.
 本発明に係る原語の文字列から略語を生成する略語生成システムは、所定の語とその語の略語への生成に用いられる度合いを示す指標とをコミュニティ内で用いられた情報群に類似するように関連付けて記憶した重要度ルール記憶部と、受け付けた複数の語から成る原語について、前記複数の語ごとに前記指標を相互に比較処理して順序付けることにより、前記複数の語から略語の生成に用いる語を優先すべき順に選定する重要語選定部と、選定された語を用いて略語候補を生成処理すると共に該略語候補を出力する略語候補生成部とを含み成ることを特徴とする。 The abbreviation generation system for generating an abbreviation from a character string of an original word according to the present invention is similar to an information group used in a community with a predetermined word and an index indicating a degree used for generation of the word into an abbreviation. The abbreviation generation from the plurality of words is performed by comparing and ordering the index for each of the plurality of words, with respect to the importance rule storage unit stored in association with the original word composed of the plurality of words received An important word selection unit that selects words to be used in order of priority, and an abbreviation candidate generation unit that generates abbreviation candidates using the selected words and outputs the abbreviation candidates.
 本発明によれば、原語から、コミュニティにおいて生成される可能性の高い略語を精度よく生成する略語生成システムを提供できる。 According to the present invention, it is possible to provide an abbreviation generation system that accurately generates abbreviations that are likely to be generated in the community from original words.
 図1は、本発明の実施形態のシステム構成を示すブロック図である。
 図2は、実施形態における処理動作例を示すフローチャートである。
 図3は、略語候補の提示例を示す模式図である。
 図4は、本発明に関する具現化の一例を示す構成図である。
FIG. 1 is a block diagram showing a system configuration of an embodiment of the present invention.
FIG. 2 is a flowchart illustrating an example of a processing operation in the embodiment.
FIG. 3 is a schematic diagram illustrating a presentation example of abbreviation candidates.
FIG. 4 is a block diagram showing an example of realization relating to the present invention.
 本発明の実施の形態を図1ないし図4に基づいて説明する。本実施の形態では、受け付けた複数の語から成る原語に形態素解析を行い、その解析結果を用いて略語の生成を行なう処理を説明する。
 図1を参照すると、実施形態の略語生成システムは、入力装置1、データ処理装置2、記憶装置3、出力装置4から構成される。入力装置1は、ユーザから原語や、略語の希望文字数、略語候補の表示数などを受け付ける装置である。また、入力装置4は、ユーザに対して生成した略語を提示する装置である。
 データ処理装置2は、重要形態素選定部20と略語候補生成部21とを含み構成される。
 重要形態素選定部20は、入力装置1から入力された原語に対して、形態素解析を行うと共に、形態素重要度ルール記憶部30に記憶されている形態素の内容に基づく重要度を示した指標に基づき、略語に使う形態素の選定を行なうように構成される。
 略語候補生成部21は、選定された各形態素に、形態素変換ルール記憶部31に記憶されている変換ルールに基づき、各形態素の文字列を変換すると共に、ユーザに提示する略語候補を生成するように構成される。
 記憶装置3は、データ処理装置2の各処理で用いるルールを保持している形態素重要度ルール記憶部30と形態素変換ルール記憶部31とを含む。
 形態素重要度ルール記憶部30には、形態素選定のための形態素の重要度を定量化するためのルールがコミュニティ内で用いられた情報群に基づいて指標として作成されて記憶されている。換言すれば、形態素重要度ルール記憶部30には、個々の形態素について、コミュニティ内で略語の生成に用いられる度合いを示す指標が記憶されている。
 このような重要度算出のための形態素重要度ルールは、コミュニティ内で用いられたことがある各種情報に基づいた、略語と原語を収集解析して構築された指標の集合となる。この形態素重要度ルールは、人手で作成したデータや、コーパスや原語と略語のペアが少なくとも記録されている略語データベースからコミュニティ内で用いられている略語の原語およびその類義語などを獲得したデータなどを用いることができる。
 なお、コミュニティ内で用いられた各種情報としては、コミュニティ内で用いられている文章や音源などを用いることができる。例えば、コミュニティが作成した複数のドキュメントやコミュニティ内で用いられている原語体系に基づく文章コーパスや音声コーパスなどが挙げられる。
 また、形態素重要度ルール記憶部30に記録する指標としては、形態素の組み合わせ毎に、組み合わせと扱われている形態素の何れが相対的に略語の生成に用いられやすいかを示す指標を用いることができる。
 また、指標として、複数の形態素の組み合わせおいて、いずれの形態素または形態素の組み合わせが相対的に略語の生成に用いられやすい形態素又は形態素の組み合わせかを示す指標を用いることもできる。
 また、指標として、形態素毎の略語への採用されている値を指標として用いることもできる。
 また、これらの指標を組み合わせて用いることもできる。
 形態素変換ルール記憶部31には、各形態素を略語生成のための文字列変換するためのルールが記憶されている。この変換ルールは、コミュニティ内で用いられた各種情報に基づいて、使用されていた変換ルールを収集解析して定められることが望ましい。
 変換ルールは、例えば、「形態素の先頭一文字を採用する」というルールや、「先頭の形態素は先頭一文字を採用し、2番目の形態素は先頭二文字を採用する」、「濁音を少なくする」、「長音を無くする」、「英語への翻訳した結果の頭文字を取る」、「特定の形態素は変換しない」などの変換ルールが挙げられる。様々な既存の変換ルールを用いればよい。変換ルールが複数ある場合、それらのルールの適用の組み合わせ毎に候補を生成する。
 次に、図2に示すフローチャートを参照して、具体的な処理例を用いて実施形態の動作を説明する。なお、入力される原語は「防災科学技術研究所」とする。
 略語生成システムは、略語の生成を依頼された原語を入力装置1から受け付ける(ステップS1)。この際、ユーザの希望する条件などの入力も受け付けてもよい。
 次に、重要形態素選定部20は、受け付けた原語について形態素解析を行い、略語生成に用いる形態素を選定する(ステップS2)。
 例えば、「防災科学技術研究所」は「防災/科学/技術/研究/所」と形態素に分けられる。なお、ユーザから語に分割された状態(例えば「防災/科学/技術/研究所」など)で受け付ければ当該処理は省略できる。また、複数種類の分け方を選定して以後の処理を並列的に実施してもよい。
 次に、重要形態素選定部20は、形態素重要度ルール記憶部30を参照して、各形態素の内容に従って重要度を算出し、この重要度に基づいて略語に用いる形態素を選定する(ステップS3)。
 本例では、2つの形態素を組として扱い、この組に含まれる2つの形態素を比較して一方が他方に対して優先的に略語に採用されている確率を用いてスコアの算出を行い、その結果を重要度として優先すべき形態素を重要度の高低に従い選定する。
 次に、略語候補生成部21は、形態素変換ルール記憶部31を参照して、選定された形態素に対して、形態素変換ルール(文字列の変換と組み合わせのルール)を適用して、略語の候補を生成する(ステップS4)。
 例えば、「形態素の先頭一文字を採用する」というルールを重要度が高かった「防災」「科学」「研究」に適用して組み合わせれば、「防科研」となる。変換ルールが複数ある場合、それらのルールの適用の組み合わせ毎に1ないし複数の候補を生成してもよい。変換ルールはユーザが直接選択してもよいし、ユーザが入力した文字数などからシステムが定めてもよい。また、原語の字句解析によって自動的に選択されるようにしてもよい。また、コミュニティで用いられていた各種情報を反映させてシステムが選択するとなおよい。また、全ての変換ルールを適用してもよいし、変換ルールの適用数などをユーザに求めて、システムが提示する際に調整してもよい。
 次に、略語候補生成部21は、出力装置4を介して生成した1ないし複数の略語候補をユーザに提示する(ステップS6)。この際にユーザに出力装置4を介して提示する画面例を図3に示す。
 略語候補の提示は、先にユーザによって指定された数や、上記処理過程で得たスコア、原語との共起確率に基づくスコア、ユーザから受け付けた略語や原語への想いの文字列との何一致度などを統合的に用いて、所要に行えばよい。
 また、個々の略語候補の有する文字と、原語の文字との対応付けをユーザに提示することが望ましい。図3では一番コミュニティに向いているだろう略語候補1のみについて、原語と略語の文字列としての関連を可視的に提示している。表示は、ユーザに選択された略語候補と原語が関連を可視的に提示されるようにしてもよい。
 また、提示画面中に自由記述欄を設け、その記述欄に入力された文字列を用いてスコアを調整して生成した略語候補の提示順番を変更してもよい。この記述欄は例えば『思い』や『優先事項』などを個々に分けて受け付けて、それぞれ別の処理を割当てるようにしてもよい。また、最初の原語入力と同時的に受け付けるようにしてもよい。
 ここでの調整は、記述された文字列の単語若しくは類似単語を識別して、生成した略語に用いた語との一致性を数値化して識別して、高い結果が得られた略語候補に加点を与えるなどを行なえばよい。このことで、提示順に『思い』や『優先事項』などを反映できる。
 また、自由記述欄に入力された文字列に基づいて変換ルールの選択と行うようにしてもよい。
 なお、各ルールに信頼度が割り当てられている場合は、信頼度の低いルールの組み合わせで生成される略語候補を出力しないようにしても良い。
 例えば、ルールの信頼度の積を取り、閾値を用いて一定以下であれば出力しない方法が考えられる。
 また、そのルールの信頼度や形態素の重要度を用いて、生成された個々の略語にスコア付けを行い、そのスコアと共に略語候補を出力しても良い。
 このようなルールや信頼度には、人手で作成したものや既存技術で収集されている各種値を用いることができる。
 ここで、優先すべき語を選定するための重要度ルールについて説明する。なお、以下に説明する形態素重要度ルールとは、コミュニティ内で用いられた各種情報に基づいて決定された、特定の形態素が別の形態素に比べて相対的に優先すべき確率を示した指標である。換言すれば、コミュニティで得られた情報から求めることができたコニュニティー内で結果として残っている形態素間の相対的な確率を示した指標である。
 形態素重要度ルールは、例えば、以下のように定まる。
・「防災>科学:0.7(=防災は科学に比べて70%の確率で残る)」
・「防災>技術:0.7」
・「防災>研究:0.5」
・「防災>所:0.9」
・「科学>所:0.9」
・「科学>技術:0.6」
・「科学>所:0.9」
・「技術>所:0.9」
・「研究>科学:0.7」
・「研究>技術:0.6」
・「研究>所:0.9」
 なお、逆の方向のルールの確率は、1からそのルールの確率を減算することとしてもよい。例えば一つ目のルールの逆方向は「科学>防災:0.3(=1.0−0.3)」となる。語の出現順を考慮するならば逆の方向のルールの確率も指標化すればよい。
 本例では、この形態素重要度ルールを用いて、ある形態素の重要度を他の残りの形態素との比較結果で求まった値の和として計算する。
 例えば、形態素である「防災」の重要度は、その原語(防災科学技術研究所)の内容に応じて比較された結果2.8となる。この値は、「0.7[防災と科学の比較]」「0.7[防災と技術の比較]」「0.5[防災と研究の比較]」「0.9[防災と所の比較]」の和である。
 このように、重要形態素選定部20は、原語に含まれる全形態素に対して同様の計算処理を行い、各形態素の重要度を算出する。なお、各形態素の値は以下の通りとなる。
・「防災」 2.8(=0.7+0.7+0.5+0.9)
・「科学」 2.1(=0.3+0.6+0.3+0.9)
・「技術」 1.9(=0.3+0.4+0.3+0.9)
・「研究」 2.7(=0.5+0.7+0.6+0.9)
・「所」  0.4(=0.1+0.1+0.1+0.1)
 例えば、残す語(形態素)として3単語を選定するならば、値が高い順に「防災」「科学」「研究」となり、2単語ならば「防災」「研究」となる。このように、選ぶ単語数には任意性があるが、重要度に対する閾値や順位に基づいて選定すれば良い。なお、全ての語を候補に残して、略語候補生成部21側で調整してもよい。
 本例ではこのように、形態素の2つ組のいずれか一方が他方に対して略語候補に優先的に残る確率に基づいて、各形態素を略語候補に用いるスコアを計算する。このため、形態素自体の意味内容や形態素間の関係が、実際のコミュニティで用いられていた略語の事例から求めた統計的観点を通じて考慮され、良好な候補の導出に繋がる。
 この際に、選択された各形態素を、あらかじめ収集されたコミュニティで用いられていた略語の事例に基づく変換ルールに従って、文字列の変換を行いうことが望ましい。このことによって、更にコミュニティにおいて生成される可能性の高い略語を精度よく自動生成できる。
 本例では、形態素間の比較による重要度の計算を示したが、これに限定されるものではない。例えば、1形態素の重要度を用いてもよいし、3形態素以上の比較を用いてもよい。
 1形態素の重要度には、例えばTFIDFなど、単語の重要度を定量化する任意の尺度を用いることができ、語毎の略語への採用されている値を用いればよい。
 3形態素以上の場合は、複数形態素間の比較を2形態素間の方法と同じように扱うことができる。例えば、「研究>技術,所:0.8」や「技術>研究,所:0.5」などである。また、「技術,研究>所:0.9」や「防災,技術>研究,所:0.4」などのように、語または語の組み合わせに対して、別の語または語の組み合わせに対して相対的な略語の生成に用いられやすいかを示す指標を用いてもよい。
 なお、略語生成システムの各部は、ハードウェアとソフトウェアの組み合わせを用いて実現すればよい。ハードウェアとソフトウェアとを組み合わせた形態では、RAMに略語生成用プログラムが展開され、プログラムに基づいて制御部(CPU)等のハードウェアを動作させることによって、各部を各種手段として実現する。また、このプログラムは、記憶媒体に固定的に記録されて頒布されても良い。当該記録媒体に記録されたプログラムは、有線、無線、又は記録媒体そのものを介して、メモリに読込まれ、制御部等を動作させる。尚、記録媒体を例示すれば、オプティカルディスクや磁気ディスク、半導体メモリ装置、ハードディスクなどが挙げられる。
 上記実施の形態を別の表現で説明すれば、略語生成システムとして動作させる情報処理システムを、RAMに展開された略語生成プログラムに基づき、重要語選定手段、略語候補生成手段、重要度ルール記憶手段、変換ルール記憶手段として制御部を動作させることで実現することが可能である。
 また、略語生成システムは、図4に例示すようにコンピュータ単体として構築してもよいし、サーバ−クライアントシステムとして構築してもよい。
 以上に実施の形態および処理例を図示して説明したが、そのブロック構成の分離併合、手順の入れ替えなどの変更は本発明の趣旨および説明される機能を満たせば自由であり、上記実施形態の説明が本発明を限定するものではない。
 例えば、略語生成システムをサーバを用いてインターネット上に構築することも可能である。
 以上説明したように、本発明によれば、原語から、コミュニティにおいて生成される可能性の高い略語を精度よく生成する略語生成システムを提供できる。
 すなわち、コミュニティにおいて生成される可能性の高い略語を精度よく自動生成することが可能となる。
 また、本発明は、生成した略語を収集することによって、コンピュータ装置、インターネットシステムなどにおける名寄せ、情報検索、情報抽出などに使用できる。
 この出願は、2012年2月16日に出願された日本出願特願2012−031826号を基礎とする優先権を主張し、その開示の全てをここに取り込む。
An embodiment of the present invention will be described with reference to FIGS. In the present embodiment, a process of performing morphological analysis on an original word composed of a plurality of accepted words and generating abbreviations using the analysis result will be described.
Referring to FIG. 1, the abbreviation generation system according to the embodiment includes an input device 1, a data processing device 2, a storage device 3, and an output device 4. The input device 1 is a device that receives an original word, a desired number of abbreviations, a display number of abbreviation candidates, and the like from a user. The input device 4 is a device that presents the generated abbreviation to the user.
The data processing device 2 includes an important morpheme selection unit 20 and an abbreviation candidate generation unit 21.
The important morpheme selection unit 20 performs morpheme analysis on the original word input from the input device 1 and based on an index indicating the importance based on the contents of the morpheme stored in the morpheme importance rule storage unit 30. , Configured to select morphemes to be used for abbreviations.
The abbreviation candidate generation unit 21 converts the character string of each morpheme into each selected morpheme based on the conversion rules stored in the morpheme conversion rule storage unit 31 and generates abbreviation candidates to be presented to the user. Configured.
The storage device 3 includes a morpheme importance rule storage unit 30 and a morpheme conversion rule storage unit 31 that hold rules used in each process of the data processing device 2.
In the morpheme importance rule storage unit 30, rules for quantifying the importance of morphemes for morpheme selection are created and stored as indices based on information groups used in the community. In other words, the morpheme importance rule storage unit 30 stores an index indicating the degree used for generating abbreviations within the community for each morpheme.
Such a morpheme importance rule for calculating importance is a set of indices constructed by collecting and analyzing abbreviations and original words based on various information that has been used in the community. These morpheme importance rules include manually created data, data obtained from corpus and abbreviation databases where at least pairs of abbreviations and abbreviations are recorded, and the acronyms used in the community and their synonyms. Can be used.
As various information used in the community, sentences and sound sources used in the community can be used. For example, a text corpus or a speech corpus based on a plurality of documents created by the community or a source language system used in the community may be used.
In addition, as an index to be recorded in the morpheme importance rule storage unit 30, for each morpheme combination, an index indicating which morpheme treated as a combination is relatively easy to use for generating an abbreviation is used. it can.
In addition, as an index, in a combination of a plurality of morphemes, an index indicating which morpheme or a combination of morphemes is relatively easy to use for generating an abbreviation or a combination of morphemes can be used.
Further, as the index, a value adopted for the abbreviation for each morpheme can be used as the index.
Moreover, these indicators can be used in combination.
The morpheme conversion rule storage unit 31 stores a rule for converting each morpheme into a character string for abbreviation generation. This conversion rule is preferably determined by collecting and analyzing conversion rules that have been used based on various information used in the community.
The conversion rule is, for example, a rule of “adopting the first letter of the morpheme”, “adopting the first letter of the first morpheme, and adopting the first two letters of the second morpheme”, “reducing muddy sounds”, Conversion rules such as “eliminate long tones”, “take initials as a result of translation into English”, and “do not convert specific morphemes”. Various existing conversion rules may be used. When there are a plurality of conversion rules, a candidate is generated for each combination of application of those rules.
Next, the operation of the embodiment will be described using a specific processing example with reference to the flowchart shown in FIG. The original language to be entered is “National Institute of Science and Technology for Disaster Prevention”.
The abbreviation generation system receives the original word requested to generate the abbreviation from the input device 1 (step S1). At this time, input of conditions desired by the user may be accepted.
Next, the important morpheme selection unit 20 performs a morpheme analysis on the received original word, and selects a morpheme used for abbreviation generation (step S2).
For example, “Disaster Prevention Science and Technology Institute” is divided into “disaster prevention / science / technology / research / place” and morphemes. Note that the processing can be omitted if it is accepted from a user in a state divided into words (for example, “disaster prevention / science / technology / laboratory”). Further, a plurality of types of division methods may be selected and the subsequent processing may be performed in parallel.
Next, the important morpheme selection unit 20 refers to the morpheme importance degree rule storage unit 30, calculates the importance according to the contents of each morpheme, and selects the morpheme used for the abbreviation based on the importance (step S3). .
In this example, two morphemes are treated as a set, the two morphemes contained in this set are compared, and the score is calculated using the probability that one is preferentially adopted as an abbreviation for the other, The morphemes that should prioritize the result as the importance are selected according to the level of importance.
Next, the abbreviation candidate generation unit 21 refers to the morpheme conversion rule storage unit 31 and applies the morpheme conversion rules (rules for character string conversion and combination) to the selected morphemes to obtain abbreviation candidates. Is generated (step S4).
For example, applying the rule of “adopting the first letter of a morpheme” to “disaster prevention”, “science”, and “research”, which have a high degree of importance, is combined to become “National Science and Technology Research Institute”. When there are a plurality of conversion rules, one or a plurality of candidates may be generated for each combination of application of those rules. The conversion rule may be directly selected by the user, or the system may determine the number of characters input by the user. Alternatively, it may be automatically selected by lexical analysis of the original language. It is even better if the system selects the information reflecting the various information used in the community. Also, all conversion rules may be applied, or the number of conversion rules applied may be obtained from the user and adjusted when the system presents it.
Next, the abbreviation candidate generation unit 21 presents one or more abbreviation candidates generated via the output device 4 to the user (step S6). An example of a screen presented to the user via the output device 4 at this time is shown in FIG.
The abbreviation candidates are presented by the number previously specified by the user, the score obtained in the above process, the score based on the co-occurrence probability with the original word, the abbreviation received from the user and the character string of the thought to the original word. The degree of coincidence may be used in an integrated manner.
It is also desirable to present to the user the correspondence between the characters of each abbreviation candidate and the original language characters. In FIG. 3, only the abbreviation candidate 1 that is most suitable for the community is visually presented as the relationship between the original word and the abbreviation as a character string. The display may be such that the abbreviation candidate and the original word selected by the user are visually presented as related.
In addition, a free description field may be provided in the presentation screen, and the presentation order of abbreviation candidates generated by adjusting the score using the character string input in the description field may be changed. In this description field, for example, “thought” and “priority” may be received separately, and different processing may be assigned to each. Alternatively, it may be accepted simultaneously with the first source language input.
The adjustment here is to identify words or similar words in the described character string, identify the numerical match with the word used for the generated abbreviation, and add points to the abbreviation candidates that have obtained high results. And so on. In this way, “thoughts” and “priorities” can be reflected in the order of presentation.
Further, the conversion rule may be selected based on the character string input in the free description field.
When reliability is assigned to each rule, abbreviation candidates generated by a combination of rules with low reliability may not be output.
For example, a method of taking the product of the reliability of the rule and not outputting if it is below a certain value using a threshold value can be considered.
Further, the generated abbreviations may be scored using the reliability of the rules and the importance of the morphemes, and abbreviation candidates may be output together with the scores.
Such rules and reliability can be created manually or various values collected by existing technology.
Here, the importance level rule for selecting a word to be prioritized will be described. The morpheme importance rule described below is an index indicating the probability that a specific morpheme should be given priority over another morpheme, determined based on various types of information used in the community. is there. In other words, the index indicates the relative probability between the remaining morphemes in the community that can be obtained from the information obtained in the community.
For example, the morpheme importance rule is determined as follows.
・ "Disaster prevention> Science: 0.7 (= Disaster prevention has a probability of 70% compared to science)"
・ "Disaster prevention> Technology: 0.7"
・ "Disaster prevention> Research: 0.5"
・ "Disaster prevention> place: 0.9"
・ Science> place: 0.9
・ Science> Technology: 0.6
・ Science> place: 0.9
・ "Technology> Location: 0.9"
・ "Research> Science: 0.7"
・ "Research> Technology: 0.6"
・ "Research> Location: 0.9"
The probability of the rule in the reverse direction may be obtained by subtracting the probability of the rule from 1. For example, the reverse direction of the first rule is “science> disaster prevention: 0.3 (= 1.0−0.3)”. If the order of appearance of words is taken into account, the probability of the rule in the opposite direction may be indexed.
In this example, using this morpheme importance rule, the importance of a certain morpheme is calculated as the sum of the values obtained from the comparison results with other remaining morphemes.
For example, the importance of “disaster prevention” which is a morpheme is 2.8 as a result of comparison according to the content of the original language (National Science and Technology Research Institute). This value is "0.7 [comparison between disaster prevention and science]""0.7 [comparison between disaster prevention and technology]""0.5 [comparison between disaster prevention and research]""0.9 [comparison between disaster prevention and places] ] ”.
As described above, the important morpheme selection unit 20 performs the same calculation process on all morphemes included in the original language, and calculates the importance of each morpheme. The value of each morpheme is as follows.
・ "Disaster prevention" 2.8 (= 0.7 + 0.7 + 0.5 + 0.9)
・ Science 2.1 (= 0.3 + 0.6 + 0.3 + 0.9)
・ "Technology" 1.9 (= 0.3 + 0.4 + 0.3 + 0.9)
・ "Research" 2.7 (= 0.5 + 0.7 + 0.6 + 0.9)
・ "Place" 0.4 (= 0.1 + 0.1 + 0.1 + 0.1)
For example, if 3 words are selected as the remaining words (morphemes), “disaster prevention”, “science” and “research” are in descending order, and “disaster prevention” and “research” are in the order of 2 words. As described above, the number of words to be selected is arbitrary, but may be selected based on a threshold value or rank for importance. It should be noted that all words may be left as candidates and adjusted on the abbreviation candidate generation unit 21 side.
In this example, in this way, the score for using each morpheme as an abbreviation candidate is calculated based on the probability that one of the two pairs of morphemes remains preferentially in the abbreviation candidate with respect to the other. For this reason, the semantic content of the morpheme itself and the relationship between the morphemes are taken into account through the statistical viewpoint obtained from the abbreviation examples used in the actual community, leading to the derivation of good candidates.
At this time, it is desirable to convert the character string of each selected morpheme according to the conversion rule based on the abbreviation examples used in the community collected in advance. As a result, abbreviations that are more likely to be generated in the community can be automatically generated with high accuracy.
In this example, calculation of importance by comparison between morphemes is shown, but the present invention is not limited to this. For example, the importance of one morpheme may be used, or a comparison of three or more morphemes may be used.
For the importance of one morpheme, for example, an arbitrary scale for quantifying the importance of a word such as TFIDF can be used, and a value adopted for an abbreviation for each word may be used.
In the case of three or more morphemes, the comparison between multiple morphemes can be handled in the same way as the method between two morphemes. For example, “Research> Technology, place: 0.8” or “Technology> Research, place: 0.5”. Also, for a word or combination of words, such as “Technology, Research> Place: 0.9” or “Disaster Prevention, Technology> Research, Place: 0.4”, etc. Thus, an index indicating whether it can be easily used to generate a relative abbreviation may be used.
In addition, what is necessary is just to implement | achieve each part of an abbreviation generation system using the combination of hardware and software. In a form in which hardware and software are combined, an abbreviation generation program is developed in the RAM, and each unit is realized as various means by operating hardware such as a control unit (CPU) based on the program. The program may be recorded in a fixed manner on a storage medium and distributed. The program recorded on the recording medium is read into a memory via a wired, wireless, or recording medium itself, and operates a control unit or the like. Examples of the recording medium include an optical disk, a magnetic disk, a semiconductor memory device, and a hard disk.
In other words, the information processing system that operates as an abbreviation generation system is based on an abbreviation generation program developed in a RAM, an important word selection unit, an abbreviation candidate generation unit, and an importance level rule storage unit. It can be realized by operating the control unit as the conversion rule storage means.
The abbreviation generation system may be constructed as a single computer as illustrated in FIG. 4 or may be constructed as a server-client system.
Although the embodiments and processing examples have been illustrated and described above, changes such as separation / merging of block configurations and replacement of procedures are free as long as the gist of the present invention and the functions described are satisfied. The description is not intended to limit the invention.
For example, an abbreviation generation system can be constructed on the Internet using a server.
As described above, according to the present invention, it is possible to provide an abbreviation generation system that accurately generates abbreviations that are likely to be generated in a community from original words.
That is, it becomes possible to automatically generate abbreviations that are highly likely to be generated in the community with high accuracy.
In addition, the present invention can be used for name identification, information retrieval, information extraction, etc. in a computer device, the Internet system, etc. by collecting the generated abbreviations.
This application claims the priority on the basis of Japanese application Japanese Patent Application No. 2012-031826 for which it applied on February 16, 2012, and takes in those the indications of all here.
1  入力装置
2  データ処理装置
3  記憶装置
4  出力装置
20 重要形態素選定部(重要語選定部、重要語選定手段)
21 略語候補生成部(略語候補生成手段)
30 形態素重要度ルール記憶部(重要度ルール記憶部、重要度ルール記憶手段)
31 形態素変換ルール記憶部(変換ルール記憶部、変換ルール記憶手段)
1 input device 2 data processing device 3 storage device 4 output device 20 important morpheme selection unit (important word selection unit, important word selection means)
21 Abbreviation candidate generation part (abbreviation candidate generation means)
30 Morphological importance rule storage unit (importance rule storage unit, importance rule storage unit)
31 Morphological conversion rule storage unit (conversion rule storage unit, conversion rule storage unit)

Claims (18)

  1.  所定の語とその語の略語への生成に用いられる度合いを示す指標とをコミュニティ内で用いられた情報群に類似するように関連付けて記憶した重要度ルール記憶部と、
     受け付けた複数の語から成る原語について、前記複数の語ごとに前記指標を相互に比較処理して順序付けることにより、前記複数の語から略語の生成に用いる語を優先すべき順に選定する重要語選定部と、
     選定された語を用いて略語候補を生成処理すると共に該略語候補を出力する略語候補生成部と、
    を含み成ることを特徴とする原語の文字列から略語を生成する略語生成システム。
    An importance rule storage unit that stores a predetermined word and an index indicating a degree used to generate the abbreviation of the word so as to be similar to the information group used in the community;
    An important word that selects words used for generation of abbreviations from the plurality of words in order of priority by ordering the indices for each of the plurality of words by comparing and ordering the indexes for the received original words. A selection department;
    An abbreviation candidate generation unit that generates abbreviation candidates using the selected word and outputs the abbreviation candidates;
    An abbreviation generation system for generating an abbreviation from a character string of an original word, comprising:
  2.  前記重要度ルール記憶部には、前記指標として、語の組み合わせ毎に、組み合わせと扱われている語の何れが相対的に略語の生成に用いられやすいかを示す指標が各語に関連付けて記憶され、
     前記重要語選定部は、原語における語の組み合わせ毎に、各語毎に略語の生成に用いられやすさを示す前記指標を相互に比較処理して順序付けることにより、前記複数の語から略語の生成に用いる語を優先すべき順に選定し、
     前記略語候補生成部は、選定された語を用いて、1ないし複数の略語を生成して出力する
    ことを特徴とする請求項1に記載の略語生成システム。
    In the importance rule storage unit, as the index, an index indicating which of the words treated as a combination is relatively easy to use for generating an abbreviation is stored in association with each word for each combination of words. And
    The important word selection unit, for each combination of words in the original word, compares the indices indicating the ease of being used to generate abbreviations for each word, and orders the abbreviations from the plurality of words. Select the words used for generation in order of priority,
    The abbreviation generation system according to claim 1, wherein the abbreviation candidate generation unit generates and outputs one or more abbreviations using the selected word.
  3.  前記重要度ルール記憶部には、前記指標として、複数の語の組み合わせおいて、いずれの語または語の組み合わせが相対的に略語の生成に用いられやすい語又は語の組み合わせであるかを示す指標が各語に関連付けて記憶され、
     前記重要語選定部は、前記指標に基づいて原語における語又は語の組み合わせについて略語の生成に用いられやすい語を抽出することにより、前記複数の語から略語の生成に用いられる語を選定し、
     前記略語候補生成部は、選定された語を用いて、1ないし複数の略語を生成して出力する
    ことを特徴とする請求項1に記載の略語生成システム。
    In the importance rule storage unit, as the index, an index indicating which word or combination of words is a word or combination of words that is relatively easy to use for generating an abbreviation in a combination of a plurality of words Is stored in association with each word,
    The important word selection unit selects words to be used for generation of abbreviations from the plurality of words by extracting words that are easily used for generation of abbreviations for words or combinations of words in the original word based on the index,
    The abbreviation generation system according to claim 1, wherein the abbreviation candidate generation unit generates and outputs one or more abbreviations using the selected word.
  4.  前記重要度ルール記憶部には、前記指標として、語毎の略語への採用されている値が各語に関連付けて記憶され、
     前記重要語選定部は、前記指標に基づいて、前記複数の語ごとに前記指標の値を相互に比較処理して高い値の語を略語の生成に用いる語として優先すべき順に選定し、
     前記略語候補生成部は、選定された語を組み合わせて1ないし複数の略語を生成して出力する
    ことを特徴とする請求項1に記載の略語生成システム。
    The importance rule storage unit stores, as the index, a value adopted for an abbreviation for each word in association with each word,
    The important word selection unit selects, based on the index, the values of the index for each of the plurality of words and compares them with each other to select a high value word as a word to be used as an abbreviation in order of priority;
    The abbreviation generation system according to claim 1, wherein the abbreviation candidate generation unit generates and outputs one or more abbreviations by combining selected words.
  5.  選定された各語を略語生成のための文字列変換に関するコミュニティ内で用いられた情報群に基づいて定められた変換ルールを記憶する変換ルール記憶部を含み、
     前記略語候補生成部は、選定された語を用いて前記変換ルールに従った1ないし複数の略語を生成して出力する
    ことを特徴とする請求項1ないし4の何れか一項に記載の略語生成システム。
    A conversion rule storage unit that stores a conversion rule determined based on a group of information used in the community related to character string conversion for abbreviation generation for each selected word;
    5. The abbreviation according to claim 1, wherein the abbreviation candidate generation unit generates and outputs one or more abbreviations according to the conversion rule using a selected word. Generation system.
  6.  前記重要語選定部は、前記原語に形態素解析を行なうことにより該原語を構成していた前記複数の語を抽出処理して、前記複数の語から略語の生成に用いられる語を選定することを特徴とする請求項1ないし5の何れか一項に記載の略語生成システム。 The important word selection unit extracts the plurality of words constituting the original word by performing morphological analysis on the original word, and selects a word used for generating an abbreviation from the plurality of words. The abbreviation generation system according to any one of claims 1 to 5, characterized in that:
  7.  前記重要語選定部は、前記原語をユーザから構成する語毎に分けて受け付けて、語毎に分けて受け付けた前記複数の語から略語の生成に用いられる語を選定する
    ことを特徴とする請求項1ないし5の何れか一項に記載の略語生成システム。
    The important word selection unit receives the original word separately for each word constituting a user, and selects a word used for generating an abbreviation from the plurality of words received separately for each word. Item 6. The abbreviation generation system according to any one of Items 1 to 5.
  8.  前記略語候補生成部は、生成した略語を略語候補として提示する際に、原語と略語の文字列としての関連性を可視的に関連付ける処理を行って提示することを特徴とする請求項1ないし7の何れか一項に記載の略語生成システム。 8. The abbreviation candidate generation unit, when presenting the generated abbreviation as an abbreviation candidate, presents the relationship between the original word and the abbreviation as a character string by visual association. The abbreviation generation system according to any one of the above.
  9.  前記指標は、前記コミュニティで用いられている文章を受け付けて、略語を用いる対象のコミュニティ内で用いられている略語および該略語の原語を収集解析して構築され、
     該コミュニティで用いられている略語から構築された指標を用いて、略語を生成する
    ことを特徴とする請求項1ないし8の何れか一項に記載の略語生成システム。
    The indicator is constructed by accepting sentences used in the community, collecting and analyzing the abbreviations used in the target community that uses abbreviations and the original words of the abbreviations,
    The abbreviation generation system according to any one of claims 1 to 8, wherein an abbreviation is generated using an index constructed from abbreviations used in the community.
  10.  前記変換ルールは、前記コミュニティで用いられている文章を受け付けて、略語を用いる対象のコミュニティ内で用いられている略語および該略語の原語を収集解析して構築され、
     該コミュニティで用いられている略語から構築された変換ルールを用いて、略語を生成する
    ことを特徴とする請求項9に記載の略語生成システム。
    The conversion rule is constructed by accepting a sentence used in the community, collecting and analyzing an abbreviation used in the target community using the abbreviation and the original word of the abbreviation,
    The abbreviation generation system according to claim 9, wherein an abbreviation is generated using a conversion rule constructed from the abbreviations used in the community.
  11.  予めコミュニティ内で用いられた情報群に類似するように、所定の語とその語の略語への生成に用いられる度合いを示す指標とを関連付けて記憶保持し、
     略語を生成する際に、
     複数の語から成る原語を受け付け処理し、
     前記複数の語ごとに前記指標を相互に比較処理して順序付けることにより、前記複数の語から略語の生成に用いる語を優先すべき順に選定処理し、
     選定された語を用いて、略語候補を生成処理して該略語候補を出力する
    ことを特徴とする原語の文字列から略語を生成処理する略語生成方法。
    In order to be similar to the information group used in the community in advance, a predetermined word and an index indicating the degree used for generation of the word into an abbreviation are associated and stored,
    When generating abbreviations,
    Accepts and processes multiple words
    By selecting and ordering the indicators for each of the plurality of words, the selection processing is performed in order of priority from words used for generating abbreviations from the plurality of words,
    An abbreviation generation method for generating and processing an abbreviation from a character string of an original word, wherein the selected word is used to generate an abbreviation candidate and output the abbreviation candidate.
  12.  予め記憶保持された前記指標には、語の組み合わせ毎に、組み合わせと扱われている語の何れが相対的に略語の生成に用いられやすいかを示す指標が各語に関連付けて記憶され、
     前記重要語の選定処理では、原語における語の組み合わせ毎に、各語毎に略語の生成に用いられやすさを示す前記指標を相互に比較処理して順序付けることにより、前記複数の語から略語の生成に用いる語を優先すべき順に選定し、
     前記略語の生成処理では、選定された語を用いて、1ないし複数の略語を生成する
    ことを特徴とする請求項11に記載の略語生成方法。
    The index stored in advance is stored in association with each word, indicating which of the words treated as a combination is relatively easy to use for generating an abbreviation for each combination of words,
    In the important word selection process, for each combination of words in the original word, an abbreviation is obtained from the plurality of words by comparing and ordering the indicators indicating the ease of use for generating an abbreviation for each word. Select the words used to generate the in order of priority,
    12. The abbreviation generation method according to claim 11, wherein in the abbreviation generation process, one or more abbreviations are generated using the selected word.
  13.  予め記憶保持された前記指標には、複数の語の組み合わせおいて、いずれの語または語の組み合わせが相対的に略語の生成に用いられやすい語又は語の組み合わせであるかを示す指標が各語に関連付けて記憶され、
     前記重要語の選定処理では、前記指標に基づいて原語における語又は語の組み合わせについて略語の生成に用いられやすい語を抽出することにより、前記複数の語から略語の生成に用いられる語を選定し、
     前記略語の生成処理では、選定された語を用いて、1ないし複数の略語を生成する
    ことを特徴とする請求項11に記載の略語生成方法。
    The index stored in advance includes an index indicating which word or combination of words is a word or combination of words that is relatively easy to use for generating an abbreviation in each word combination. Remembered in relation to
    In the important word selection process, a word used for generating an abbreviation is selected from the plurality of words by extracting words that are easy to use for generating an abbreviation for a word or a combination of words in the original word based on the index. ,
    12. The abbreviation generation method according to claim 11, wherein in the abbreviation generation process, one or more abbreviations are generated using the selected word.
  14.  予め記憶保持された前記指標には、語毎の略語への採用されている値が各語に関連付けて記憶され、
     前記重要語の選定処理では、前記指標に基づいて、
    前記複数の語ごとに前記指標の値を相互に比較処理して高い値の語を略語の生成に用いる語として優先すべき順に選定し、
     前記略語の生成処理では、選定された語を組み合わせて1ないし複数の略語を生成する
    ことを特徴とする請求項11に記載の略語生成方法。
    In the index stored and held in advance, the value adopted for the abbreviation for each word is stored in association with each word,
    In the important word selection process, based on the index,
    The value of the index is compared with each other for each of the plurality of words, and a word with a high value is selected in order of priority as a word used for generation of an abbreviation,
    12. The abbreviation generation method according to claim 11, wherein in the abbreviation generation process, one or more abbreviations are generated by combining selected words.
  15.  情報処理システムを、
     所定の語とその語の略語への生成に用いられる度合いを示す指標とをコミュニティ内で用いられた情報群に類似するように関連付けて記憶した重要度ルール記憶部と、
     受け付けた複数の語から成る原語について、前記複数の語ごとに前記指標を相互に比較処理して順序付けることにより、前記複数の語から略語の生成に用いる語を優先すべき順に選定する重要語選定部と、
     選定された語を用いて略語候補を生成処理すると共に該略語候補を出力する略語候補生成部と、
    として動作させることを特徴とする原語の文字列から略語の生成に用いる略語生成用プログラムを記録した記録媒体。
    Information processing system
    An importance rule storage unit that stores a predetermined word and an index indicating a degree used to generate the abbreviation of the word so as to be similar to the information group used in the community;
    An important word that selects words used for generation of abbreviations from the plurality of words in order of priority by ordering the indices for each of the plurality of words by comparing and ordering the indexes for the received original words. A selection department;
    An abbreviation candidate generation unit that generates abbreviation candidates using the selected word and outputs the abbreviation candidates;
    The recording medium which recorded the program for abbreviation generation used for generation | occurrence | production of the abbreviation from the character string of the original word characterized by operating as.
  16.  前記重要度ルール記憶部には、前記指標として、語の組み合わせ毎に、組み合わせと扱われている語の何れが相対的に略語の生成に用いられやすいかを示す指標が各語に関連付けて記憶され、
     前記重要語選定部は、原語における語の組み合わせ毎に、各語毎に略語の生成に用いられやすさを示す前記指標を相互に比較処理して順序付けることにより、前記複数の語から略語の生成に用いる語を優先すべき順に選定し、
     前記略語候補生成部は、選定された語を用いて、1ないし複数の略語を生成して出力する
    ように動作させることを特徴とする請求項15に記載の略語生成用プログラムを記録した記録媒体。
    In the importance rule storage unit, as the index, an index indicating which of the words treated as a combination is relatively easy to use for generating an abbreviation is stored in association with each word for each combination of words. And
    The important word selection unit, for each combination of words in the original word, compares the indices indicating the ease of being used to generate abbreviations for each word, and orders the abbreviations from the plurality of words. Select the words used for generation in order of priority,
    16. The recording medium recorded with the abbreviation generation program according to claim 15, wherein the abbreviation candidate generation unit operates to generate and output one or more abbreviations using the selected word. .
  17.  前記重要度ルール記憶部には、前記指標として、複数の語の組み合わせおいて、いずれの語または語の組み合わせが相対的に略語の生成に用いられやすい語又は語の組み合わせであるかを示す指標が各語に関連付けて記憶され、
     前記重要語選定部は、前記指標に基づいて原語における語又は語の組み合わせについて略語の生成に用いられやすい語を抽出することにより、前記複数の語から略語の生成に用いられる語を選定し、
     前記略語候補生成部は、選定された語を用いて、1ないし複数の略語を生成して出力する
    ように動作させることを特徴とする請求項15に記載の略語生成用プログラムを記録した記録媒体。
    In the importance rule storage unit, as the index, an index indicating which word or combination of words is a word or combination of words that is relatively easy to use for generating an abbreviation in a combination of a plurality of words Is stored in association with each word,
    The important word selection unit selects words to be used for generation of abbreviations from the plurality of words by extracting words that are easily used for generation of abbreviations for words or combinations of words in the original word based on the index,
    16. The recording medium recorded with the abbreviation generation program according to claim 15, wherein the abbreviation candidate generation unit operates to generate and output one or more abbreviations using the selected word. .
  18.  前記重要度ルール記憶部には、前記指標として、語毎の略語への採用されている値が各語に関連付けて記憶され、
     前記重要語選定部は、前記指標に基づいて、前記複数の語ごとに前記指標の値を相互に比較処理して高い値の語を略語の生成に用いる語として優先すべき順に選定し、
     前記略語候補生成部は、選定された語を組み合わせて1ないし複数の略語を生成して出力する
    ように動作させることを特徴とする請求項15に記載の略語生成用プログラムを記録した記録媒体。
    The importance rule storage unit stores, as the index, a value adopted for an abbreviation for each word in association with each word,
    The important word selection unit selects, based on the index, the values of the index for each of the plurality of words and compares them with each other to select a high value word as a word to be used as an abbreviation in order of priority;
    16. The recording medium recording the abbreviation generation program according to claim 15, wherein the abbreviation candidate generation unit is operated to generate and output one or more abbreviations by combining selected words.
PCT/JP2013/052968 2012-02-16 2013-02-04 Abbreviation generating system WO2013121988A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2013558668A JP6135867B2 (en) 2012-02-16 2013-02-04 Abbreviation generation system

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2012-031826 2012-02-16
JP2012031826 2012-02-16

Publications (1)

Publication Number Publication Date
WO2013121988A1 true WO2013121988A1 (en) 2013-08-22

Family

ID=48984100

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2013/052968 WO2013121988A1 (en) 2012-02-16 2013-02-04 Abbreviation generating system

Country Status (2)

Country Link
JP (1) JP6135867B2 (en)
WO (1) WO2013121988A1 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009041220A1 (en) * 2007-09-26 2009-04-02 Nec Corporation Abbreviation generation device and program, and abbreviation generation method
JP2010191804A (en) * 2009-02-19 2010-09-02 Toshiba Corp Abbreviation estimation device and method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7475343B1 (en) * 1999-05-11 2009-01-06 Mielenhausen Thomas C Data processing apparatus and method for converting words to abbreviations, converting abbreviations to words, and selecting abbreviations for insertion into text

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009041220A1 (en) * 2007-09-26 2009-04-02 Nec Corporation Abbreviation generation device and program, and abbreviation generation method
JP2010191804A (en) * 2009-02-19 2010-09-02 Toshiba Corp Abbreviation estimation device and method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
KOKI MIYAZAWA ET AL.: "Automatic estimation of abbreviations using phonological generation rules", DAI 53 KAI REPORTS OF THE MEETING OF SPECIAL INTERNET GROUP ON SPOKEN LANGUAGE UNDERSTANDING AND DIALOGUE PROCESSING (SIG-SLUD-A801), 9 July 2008 (2008-07-09), pages 1 - 6 *
TAKAHIRO MIWA ET AL.: "Kyoki Hindo to Ryakugo Keisei Pattern o Mochiita Ryakugo no Jido Suitei", FIT2011 DAI 10 KAI FORUM ON INFORMATION TECHNOLOGY KOEN RONBUNSHU, SEPARATE VOL.2, SADOKU TSUKI RONBUN - IPPAN RONBUN DATABASE SHIZEN GENGO - ONSEI - ONGAKU JINKO CHINO - GAME SEITAI JOHO KAGAKU, INFORMATION PROCESSING SOCIETY OF JAPAN, vol. 2, 22 August 2011 (2011-08-22), pages 335 - 338 *

Also Published As

Publication number Publication date
JPWO2013121988A1 (en) 2015-05-11
JP6135867B2 (en) 2017-05-31

Similar Documents

Publication Publication Date Title
Şeker et al. Initial explorations on using CRFs for Turkish named entity recognition
JP4701292B2 (en) Computer system, method and computer program for creating term dictionary from specific expressions or technical terms contained in text data
Oufaida et al. Minimum redundancy and maximum relevance for single and multi-document Arabic text summarization
JP5106636B2 (en) System for extracting terms from documents with text segments
JP6466952B2 (en) Sentence generation system
US20140351228A1 (en) Dialog system, redundant message removal method and redundant message removal program
JP6505421B2 (en) Information extraction support device, method and program
US20150066474A1 (en) Method and Apparatus for Matching Misspellings Caused by Phonetic Variations
US10055408B2 (en) Method of extracting an important keyword and server performing the same
Cabot et al. Cimind: A phonetic-based tool for multilingual named entity recognition in biomedical texts
JP2012022599A (en) Sentence structure analyzing apparatus, sentence structure analyzing method and sentence structure analyzing program
CN107870900B (en) Method, apparatus and recording medium for providing translated text
JP6409071B2 (en) Sentence sorting method and calculator
KR102351745B1 (en) User Review Based Rating Re-calculation Apparatus and Method
JP2012113459A (en) Example translation system, example translation method and example translation program
JP5642037B2 (en) SEARCH DEVICE, SEARCH METHOD, AND PROGRAM
JP5151412B2 (en) Notation fluctuation analyzer
JP4945015B2 (en) Document search system, document search program, and document search method
Lin et al. Evaluating Cross-lingual Semantic Annotation for Medical Forms.
de Mendonça Almeida et al. Evaluating phonetic spellers for user-generated content in Brazilian Portuguese
JP6135867B2 (en) Abbreviation generation system
Alam et al. Comparing named entity recognition on transcriptions and written texts
Bernik et al. DIAGÑOZA: a Natural Language Processing Tool for Automatic Annotation of Clinical Free Text with SNOMED-CT.
JP5289261B2 (en) Text conversion device, method and program
JP5506482B2 (en) Named entity extraction apparatus, string-named expression class pair database creation apparatus, numbered entity extraction method, string-named expression class pair database creation method, program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 13749285

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2013558668

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 13749285

Country of ref document: EP

Kind code of ref document: A1