WO2013121988A1

WO2013121988A1 - Abbreviation generating system

Info

Publication number: WO2013121988A1
Application number: PCT/JP2013/052968
Authority: WO
Inventors: 石川　開; 正明土田; 貴士大西; 早人山名; 孝徳及川
Original assignee: 日本電気株式会社; 学校法人早稲田大学
Priority date: 2012-02-16
Filing date: 2013-02-04
Publication date: 2013-08-22
Also published as: JPWO2013121988A1; JP6135867B2

Abstract

A system for generating abbreviations from the character strings of original words is constructed to include at least: an importance rule storage unit for associating and storing a given word and an index indicating the extent to which the word is used in the generation of abbreviations similar to information groups used in the community; an important word selecting unit for selecting words used in the generation of abbreviations in order of priority by comparing the indices of each original word that has been received and assigning a priority; an abbreviation candidate generating unit for outputting abbreviation candidates when abbreviation candidates have been generated using the selected words. As a result, abbreviations for the names of products and services which conform to words used in the community are obtained.

Description

Abbreviation generation system

The present invention relates to an abbreviation generation system, an abbreviation generation method, and an abbreviation generation program for generating abbreviations from original words input by information processing.

In society, names, functions, products, etc. are used as abbreviations by combining a part or part of the official name. Abbreviations for personal names and organizations are also called abbreviations and are treated in the same way.
The function of associating an abbreviation with its original word using an information processing system is a technique useful in many applications targeting natural languages, such as name identification, information retrieval, and information extraction.
It is possible to create a machine-readable dictionary data by manually collecting correspondences between abbreviations and original words, which is practiced in the real world. On the other hand, as new products, services, works, organizations, etc. are born, abbreviations are generated spontaneously in the community one after another, and there is a limit to correctly collecting these abbreviations manually. In particular, generating dictionary data correctly for names is laborious and difficult. For this reason, today, association dictionary data is automatically created and appropriately updated by an information processing system from a corpus or the Web.
By the way, a method has been proposed in which an original word which is a formal name is accepted as an input, and abbreviation candidates are automatically generated by information processing. An example of such an abbreviation generation method is described in Non-Patent Document 1.
In the abbreviation automatic estimation method described in Non-Patent Document 1, it is proposed to extract likely abbreviation candidates created by a probability model and to narrow down candidates for the extracted abbreviation candidates by information on the Web. In narrowing down candidates, the abbreviation candidates are narrowed down by verifying whether the original word and the abbreviation candidate have the same synonym for each abbreviation candidate. The probability model adopted here is a Noisy-channel model. Non-Patent Document 1 also describes existing technologies such as conversion rules and desirable presentation methods.

Abbreviations used in the community are often preferred to have expressions that are important to the original word, thoughts, and characteristics, but are short enough to be distinguished from other abbreviations and have little redundancy.
On the other hand, in the method described in Non-Patent Document 1, since the morpheme used for the abbreviation is selected from the character string type and the position information of the mora with respect to the original word, the semantic content of the morpheme itself and the relationship between the morphemes are considered. It has not been. In such a situation, there arises a problem that a candidate different from an abbreviation that will be generated in the community is preferentially selected as an abbreviation candidate depending on the pattern of the original word. In other words, the abbreviation candidates according to the community desired by the user have not been generated.
In addition, abbreviation candidates obtained by the abbreviation automatic estimation method described in Non-Patent Document 1 are required to be already used in the Internet in order to perform verification using existing Web information. For this reason, there are problems that cannot be used for new abbreviations or fields not used in the Internet.
The present invention provides an abbreviation generation system that accurately generates abbreviations that are likely to be generated in a community from original words.

The abbreviation generation system for generating an abbreviation from a character string of an original word according to the present invention is similar to an information group used in a community with a predetermined word and an index indicating a degree used for generation of the word into an abbreviation. The abbreviation generation from the plurality of words is performed by comparing and ordering the index for each of the plurality of words, with respect to the importance rule storage unit stored in association with the original word composed of the plurality of words received An important word selection unit that selects words to be used in order of priority, and an abbreviation candidate generation unit that generates abbreviation candidates using the selected words and outputs the abbreviation candidates.

According to the present invention, it is possible to provide an abbreviation generation system that accurately generates abbreviations that are likely to be generated in the community from original words.

FIG. 1 is a block diagram showing a system configuration of an embodiment of the present invention.
FIG. 2 is a flowchart illustrating an example of a processing operation in the embodiment.
FIG. 3 is a schematic diagram illustrating a presentation example of abbreviation candidates.
FIG. 4 is a block diagram showing an example of realization relating to the present invention.

An embodiment of the present invention will be described with reference to FIGS. In the present embodiment, a process of performing morphological analysis on an original word composed of a plurality of accepted words and generating abbreviations using the analysis result will be described.
Referring to FIG. 1, the abbreviation generation system according to the embodiment includes an input device 1, a data processing device 2, a storage device 3, and an output device 4. The input device 1 is a device that receives an original word, a desired number of abbreviations, a display number of abbreviation candidates, and the like from a user. The input device 4 is a device that presents the generated abbreviation to the user.
The data processing device 2 includes an important morpheme selection unit 20 and an abbreviation candidate generation unit 21.
The important morpheme selection unit 20 performs morpheme analysis on the original word input from the input device 1 and based on an index indicating the importance based on the contents of the morpheme stored in the morpheme importance rule storage unit 30. , Configured to select morphemes to be used for abbreviations.
The abbreviation candidate generation unit 21 converts the character string of each morpheme into each selected morpheme based on the conversion rules stored in the morpheme conversion rule storage unit 31 and generates abbreviation candidates to be presented to the user. Configured.
The storage device 3 includes a morpheme importance rule storage unit 30 and a morpheme conversion rule storage unit 31 that hold rules used in each process of the data processing device 2.
In the morpheme importance rule storage unit 30, rules for quantifying the importance of morphemes for morpheme selection are created and stored as indices based on information groups used in the community. In other words, the morpheme importance rule storage unit 30 stores an index indicating the degree used for generating abbreviations within the community for each morpheme.
Such a morpheme importance rule for calculating importance is a set of indices constructed by collecting and analyzing abbreviations and original words based on various information that has been used in the community. These morpheme importance rules include manually created data, data obtained from corpus and abbreviation databases where at least pairs of abbreviations and abbreviations are recorded, and the acronyms used in the community and their synonyms. Can be used.
As various information used in the community, sentences and sound sources used in the community can be used. For example, a text corpus or a speech corpus based on a plurality of documents created by the community or a source language system used in the community may be used.
In addition, as an index to be recorded in the morpheme importance rule storage unit 30, for each morpheme combination, an index indicating which morpheme treated as a combination is relatively easy to use for generating an abbreviation is used. it can.
In addition, as an index, in a combination of a plurality of morphemes, an index indicating which morpheme or a combination of morphemes is relatively easy to use for generating an abbreviation or a combination of morphemes can be used.
Further, as the index, a value adopted for the abbreviation for each morpheme can be used as the index.
Moreover, these indicators can be used in combination.
The morpheme conversion rule storage unit 31 stores a rule for converting each morpheme into a character string for abbreviation generation. This conversion rule is preferably determined by collecting and analyzing conversion rules that have been used based on various information used in the community.
The conversion rule is, for example, a rule of “adopting the first letter of the morpheme”, “adopting the first letter of the first morpheme, and adopting the first two letters of the second morpheme”, “reducing muddy sounds”, Conversion rules such as “eliminate long tones”, “take initials as a result of translation into English”, and “do not convert specific morphemes”. Various existing conversion rules may be used. When there are a plurality of conversion rules, a candidate is generated for each combination of application of those rules.
Next, the operation of the embodiment will be described using a specific processing example with reference to the flowchart shown in FIG. The original language to be entered is “National Institute of Science and Technology for Disaster Prevention”.
The abbreviation generation system receives the original word requested to generate the abbreviation from the input device 1 (step S1). At this time, input of conditions desired by the user may be accepted.
Next, the important morpheme selection unit 20 performs a morpheme analysis on the received original word, and selects a morpheme used for abbreviation generation (step S2).
For example, “Disaster Prevention Science and Technology Institute” is divided into “disaster prevention / science / technology / research / place” and morphemes. Note that the processing can be omitted if it is accepted from a user in a state divided into words (for example, “disaster prevention / science / technology / laboratory”). Further, a plurality of types of division methods may be selected and the subsequent processing may be performed in parallel.
Next, the important morpheme selection unit 20 refers to the morpheme importance degree rule storage unit 30, calculates the importance according to the contents of each morpheme, and selects the morpheme used for the abbreviation based on the importance (step S3). .
In this example, two morphemes are treated as a set, the two morphemes contained in this set are compared, and the score is calculated using the probability that one is preferentially adopted as an abbreviation for the other, The morphemes that should prioritize the result as the importance are selected according to the level of importance.
Next, the abbreviation candidate generation unit 21 refers to the morpheme conversion rule storage unit 31 and applies the morpheme conversion rules (rules for character string conversion and combination) to the selected morphemes to obtain abbreviation candidates. Is generated (step S4).
For example, applying the rule of “adopting the first letter of a morpheme” to “disaster prevention”, “science”, and “research”, which have a high degree of importance, is combined to become “National Science and Technology Research Institute”. When there are a plurality of conversion rules, one or a plurality of candidates may be generated for each combination of application of those rules. The conversion rule may be directly selected by the user, or the system may determine the number of characters input by the user. Alternatively, it may be automatically selected by lexical analysis of the original language. It is even better if the system selects the information reflecting the various information used in the community. Also, all conversion rules may be applied, or the number of conversion rules applied may be obtained from the user and adjusted when the system presents it.
Next, the abbreviation candidate generation unit 21 presents one or more abbreviation candidates generated via the output device 4 to the user (step S6). An example of a screen presented to the user via the output device 4 at this time is shown in FIG.
The abbreviation candidates are presented by the number previously specified by the user, the score obtained in the above process, the score based on the co-occurrence probability with the original word, the abbreviation received from the user and the character string of the thought to the original word. The degree of coincidence may be used in an integrated manner.
It is also desirable to present to the user the correspondence between the characters of each abbreviation candidate and the original language characters. In FIG. 3, only the abbreviation candidate 1 that is most suitable for the community is visually presented as the relationship between the original word and the abbreviation as a character string. The display may be such that the abbreviation candidate and the original word selected by the user are visually presented as related.
In addition, a free description field may be provided in the presentation screen, and the presentation order of abbreviation candidates generated by adjusting the score using the character string input in the description field may be changed. In this description field, for example, “thought” and “priority” may be received separately, and different processing may be assigned to each. Alternatively, it may be accepted simultaneously with the first source language input.
The adjustment here is to identify words or similar words in the described character string, identify the numerical match with the word used for the generated abbreviation, and add points to the abbreviation candidates that have obtained high results. And so on. In this way, “thoughts” and “priorities” can be reflected in the order of presentation.
Further, the conversion rule may be selected based on the character string input in the free description field.
When reliability is assigned to each rule, abbreviation candidates generated by a combination of rules with low reliability may not be output.
For example, a method of taking the product of the reliability of the rule and not outputting if it is below a certain value using a threshold value can be considered.
Further, the generated abbreviations may be scored using the reliability of the rules and the importance of the morphemes, and abbreviation candidates may be output together with the scores.
Such rules and reliability can be created manually or various values collected by existing technology.
Here, the importance level rule for selecting a word to be prioritized will be described. The morpheme importance rule described below is an index indicating the probability that a specific morpheme should be given priority over another morpheme, determined based on various types of information used in the community. is there. In other words, the index indicates the relative probability between the remaining morphemes in the community that can be obtained from the information obtained in the community.
For example, the morpheme importance rule is determined as follows.
・ "Disaster prevention> Science: 0.7 (= Disaster prevention has a probability of 70% compared to science)"
・ "Disaster prevention> Technology: 0.7"
・ "Disaster prevention> Research: 0.5"
・ "Disaster prevention> place: 0.9"
・ Science> place: 0.9
・ Science> Technology: 0.6
・ Science> place: 0.9
・ "Technology> Location: 0.9"
・ "Research> Science: 0.7"
・ "Research> Technology: 0.6"
・ "Research> Location: 0.9"
The probability of the rule in the reverse direction may be obtained by subtracting the probability of the rule from 1. For example, the reverse direction of the first rule is “science> disaster prevention: 0.3 (= 1.0−0.3)”. If the order of appearance of words is taken into account, the probability of the rule in the opposite direction may be indexed.
In this example, using this morpheme importance rule, the importance of a certain morpheme is calculated as the sum of the values obtained from the comparison results with other remaining morphemes.
For example, the importance of “disaster prevention” which is a morpheme is 2.8 as a result of comparison according to the content of the original language (National Science and Technology Research Institute). This value is "0.7 [comparison between disaster prevention and science]""0.7 [comparison between disaster prevention and technology]""0.5 [comparison between disaster prevention and research]""0.9 [comparison between disaster prevention and places] ] ”.
As described above, the important morpheme selection unit 20 performs the same calculation process on all morphemes included in the original language, and calculates the importance of each morpheme. The value of each morpheme is as follows.
・ "Disaster prevention" 2.8 (= 0.7 + 0.7 + 0.5 + 0.9)
・ Science 2.1 (= 0.3 + 0.6 + 0.3 + 0.9)
・ "Technology" 1.9 (= 0.3 + 0.4 + 0.3 + 0.9)
・ "Research" 2.7 (= 0.5 + 0.7 + 0.6 + 0.9)
・ "Place" 0.4 (= 0.1 + 0.1 + 0.1 + 0.1)
For example, if 3 words are selected as the remaining words (morphemes), “disaster prevention”, “science” and “research” are in descending order, and “disaster prevention” and “research” are in the order of 2 words. As described above, the number of words to be selected is arbitrary, but may be selected based on a threshold value or rank for importance. It should be noted that all words may be left as candidates and adjusted on the abbreviation candidate generation unit 21 side.
In this example, in this way, the score for using each morpheme as an abbreviation candidate is calculated based on the probability that one of the two pairs of morphemes remains preferentially in the abbreviation candidate with respect to the other. For this reason, the semantic content of the morpheme itself and the relationship between the morphemes are taken into account through the statistical viewpoint obtained from the abbreviation examples used in the actual community, leading to the derivation of good candidates.
At this time, it is desirable to convert the character string of each selected morpheme according to the conversion rule based on the abbreviation examples used in the community collected in advance. As a result, abbreviations that are more likely to be generated in the community can be automatically generated with high accuracy.
In this example, calculation of importance by comparison between morphemes is shown, but the present invention is not limited to this. For example, the importance of one morpheme may be used, or a comparison of three or more morphemes may be used.
For the importance of one morpheme, for example, an arbitrary scale for quantifying the importance of a word such as TFIDF can be used, and a value adopted for an abbreviation for each word may be used.
In the case of three or more morphemes, the comparison between multiple morphemes can be handled in the same way as the method between two morphemes. For example, “Research> Technology, place: 0.8” or “Technology> Research, place: 0.5”. Also, for a word or combination of words, such as “Technology, Research> Place: 0.9” or “Disaster Prevention, Technology> Research, Place: 0.4”, etc. Thus, an index indicating whether it can be easily used to generate a relative abbreviation may be used.
In addition, what is necessary is just to implement | achieve each part of an abbreviation generation system using the combination of hardware and software. In a form in which hardware and software are combined, an abbreviation generation program is developed in the RAM, and each unit is realized as various means by operating hardware such as a control unit (CPU) based on the program. The program may be recorded in a fixed manner on a storage medium and distributed. The program recorded on the recording medium is read into a memory via a wired, wireless, or recording medium itself, and operates a control unit or the like. Examples of the recording medium include an optical disk, a magnetic disk, a semiconductor memory device, and a hard disk.
In other words, the information processing system that operates as an abbreviation generation system is based on an abbreviation generation program developed in a RAM, an important word selection unit, an abbreviation candidate generation unit, and an importance level rule storage unit. It can be realized by operating the control unit as the conversion rule storage means.
The abbreviation generation system may be constructed as a single computer as illustrated in FIG. 4 or may be constructed as a server-client system.
Although the embodiments and processing examples have been illustrated and described above, changes such as separation / merging of block configurations and replacement of procedures are free as long as the gist of the present invention and the functions described are satisfied. The description is not intended to limit the invention.
For example, an abbreviation generation system can be constructed on the Internet using a server.
As described above, according to the present invention, it is possible to provide an abbreviation generation system that accurately generates abbreviations that are likely to be generated in a community from original words.
That is, it becomes possible to automatically generate abbreviations that are highly likely to be generated in the community with high accuracy.
In addition, the present invention can be used for name identification, information retrieval, information extraction, etc. in a computer device, the Internet system, etc. by collecting the generated abbreviations.
This application claims the priority on the basis of Japanese application Japanese Patent Application No. 2012-031826 for which it applied on February 16, 2012, and takes in those the indications of all here.

1 input device 2 data processing device 3 storage device 4 output device 20 important morpheme selection unit (important word selection unit, important word selection means)
21 Abbreviation candidate generation part (abbreviation candidate generation means)
30 Morphological importance rule storage unit (importance rule storage unit, importance rule storage unit)
31 Morphological conversion rule storage unit (conversion rule storage unit, conversion rule storage unit)

Claims

An importance rule storage unit that stores a predetermined word and an index indicating a degree used to generate the abbreviation of the word so as to be similar to the information group used in the community;
An important word that selects words used for generation of abbreviations from the plurality of words in order of priority by ordering the indices for each of the plurality of words by comparing and ordering the indexes for the received original words. A selection department;
An abbreviation candidate generation unit that generates abbreviation candidates using the selected word and outputs the abbreviation candidates;
An abbreviation generation system for generating an abbreviation from a character string of an original word, comprising:
In the importance rule storage unit, as the index, an index indicating which of the words treated as a combination is relatively easy to use for generating an abbreviation is stored in association with each word for each combination of words. And
The important word selection unit, for each combination of words in the original word, compares the indices indicating the ease of being used to generate abbreviations for each word, and orders the abbreviations from the plurality of words. Select the words used for generation in order of priority,
The abbreviation generation system according to claim 1, wherein the abbreviation candidate generation unit generates and outputs one or more abbreviations using the selected word.
In the importance rule storage unit, as the index, an index indicating which word or combination of words is a word or combination of words that is relatively easy to use for generating an abbreviation in a combination of a plurality of words Is stored in association with each word,
The important word selection unit selects words to be used for generation of abbreviations from the plurality of words by extracting words that are easily used for generation of abbreviations for words or combinations of words in the original word based on the index,
The abbreviation generation system according to claim 1, wherein the abbreviation candidate generation unit generates and outputs one or more abbreviations using the selected word.
The importance rule storage unit stores, as the index, a value adopted for an abbreviation for each word in association with each word,
The important word selection unit selects, based on the index, the values of the index for each of the plurality of words and compares them with each other to select a high value word as a word to be used as an abbreviation in order of priority;
The abbreviation generation system according to claim 1, wherein the abbreviation candidate generation unit generates and outputs one or more abbreviations by combining selected words.
A conversion rule storage unit that stores a conversion rule determined based on a group of information used in the community related to character string conversion for abbreviation generation for each selected word;
5. The abbreviation according to claim 1, wherein the abbreviation candidate generation unit generates and outputs one or more abbreviations according to the conversion rule using a selected word. Generation system.
The important word selection unit extracts the plurality of words constituting the original word by performing morphological analysis on the original word, and selects a word used for generating an abbreviation from the plurality of words. The abbreviation generation system according to any one of claims 1 to 5, characterized in that:
The important word selection unit receives the original word separately for each word constituting a user, and selects a word used for generating an abbreviation from the plurality of words received separately for each word. Item 6. The abbreviation generation system according to any one of Items 1 to 5.
8. The abbreviation candidate generation unit, when presenting the generated abbreviation as an abbreviation candidate, presents the relationship between the original word and the abbreviation as a character string by visual association. The abbreviation generation system according to any one of the above.
The indicator is constructed by accepting sentences used in the community, collecting and analyzing the abbreviations used in the target community that uses abbreviations and the original words of the abbreviations,
The abbreviation generation system according to any one of claims 1 to 8, wherein an abbreviation is generated using an index constructed from abbreviations used in the community.
The conversion rule is constructed by accepting a sentence used in the community, collecting and analyzing an abbreviation used in the target community using the abbreviation and the original word of the abbreviation,
The abbreviation generation system according to claim 9, wherein an abbreviation is generated using a conversion rule constructed from the abbreviations used in the community.
In order to be similar to the information group used in the community in advance, a predetermined word and an index indicating the degree used for generation of the word into an abbreviation are associated and stored,
When generating abbreviations,
Accepts and processes multiple words
By selecting and ordering the indicators for each of the plurality of words, the selection processing is performed in order of priority from words used for generating abbreviations from the plurality of words,
An abbreviation generation method for generating and processing an abbreviation from a character string of an original word, wherein the selected word is used to generate an abbreviation candidate and output the abbreviation candidate.
The index stored in advance is stored in association with each word, indicating which of the words treated as a combination is relatively easy to use for generating an abbreviation for each combination of words,
In the important word selection process, for each combination of words in the original word, an abbreviation is obtained from the plurality of words by comparing and ordering the indicators indicating the ease of use for generating an abbreviation for each word. Select the words used to generate the in order of priority,
12. The abbreviation generation method according to claim 11, wherein in the abbreviation generation process, one or more abbreviations are generated using the selected word.
The index stored in advance includes an index indicating which word or combination of words is a word or combination of words that is relatively easy to use for generating an abbreviation in each word combination. Remembered in relation to
In the important word selection process, a word used for generating an abbreviation is selected from the plurality of words by extracting words that are easy to use for generating an abbreviation for a word or a combination of words in the original word based on the index. ,
12. The abbreviation generation method according to claim 11, wherein in the abbreviation generation process, one or more abbreviations are generated using the selected word.
In the index stored and held in advance, the value adopted for the abbreviation for each word is stored in association with each word,
In the important word selection process, based on the index,
The value of the index is compared with each other for each of the plurality of words, and a word with a high value is selected in order of priority as a word used for generation of an abbreviation,
12. The abbreviation generation method according to claim 11, wherein in the abbreviation generation process, one or more abbreviations are generated by combining selected words.
Information processing system
An importance rule storage unit that stores a predetermined word and an index indicating a degree used to generate the abbreviation of the word so as to be similar to the information group used in the community;
An important word that selects words used for generation of abbreviations from the plurality of words in order of priority by ordering the indices for each of the plurality of words by comparing and ordering the indexes for the received original words. A selection department;
An abbreviation candidate generation unit that generates abbreviation candidates using the selected word and outputs the abbreviation candidates;
The recording medium which recorded the program for abbreviation generation used for generation | occurrence | production of the abbreviation from the character string of the original word characterized by operating as.
In the importance rule storage unit, as the index, an index indicating which of the words treated as a combination is relatively easy to use for generating an abbreviation is stored in association with each word for each combination of words. And
The important word selection unit, for each combination of words in the original word, compares the indices indicating the ease of being used to generate abbreviations for each word, and orders the abbreviations from the plurality of words. Select the words used for generation in order of priority,
16. The recording medium recorded with the abbreviation generation program according to claim 15, wherein the abbreviation candidate generation unit operates to generate and output one or more abbreviations using the selected word. .
In the importance rule storage unit, as the index, an index indicating which word or combination of words is a word or combination of words that is relatively easy to use for generating an abbreviation in a combination of a plurality of words Is stored in association with each word,
The important word selection unit selects words to be used for generation of abbreviations from the plurality of words by extracting words that are easily used for generation of abbreviations for words or combinations of words in the original word based on the index,
16. The recording medium recorded with the abbreviation generation program according to claim 15, wherein the abbreviation candidate generation unit operates to generate and output one or more abbreviations using the selected word. .
The importance rule storage unit stores, as the index, a value adopted for an abbreviation for each word in association with each word,
The important word selection unit selects, based on the index, the values of the index for each of the plurality of words and compares them with each other to select a high value word as a word to be used as an abbreviation in order of priority;
16. The recording medium recording the abbreviation generation program according to claim 15, wherein the abbreviation candidate generation unit is operated to generate and output one or more abbreviations by combining selected words.