WO2013146736A1 - 同義関係判定装置、同義関係判定方法、及びそのプログラム - Google Patents
同義関係判定装置、同義関係判定方法、及びそのプログラム Download PDFInfo
- Publication number
- WO2013146736A1 WO2013146736A1 PCT/JP2013/058696 JP2013058696W WO2013146736A1 WO 2013146736 A1 WO2013146736 A1 WO 2013146736A1 JP 2013058696 W JP2013058696 W JP 2013058696W WO 2013146736 A1 WO2013146736 A1 WO 2013146736A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- synonym
- candidate
- expression
- time interval
- source
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Definitions
- the present invention relates to a synonym relation period determination device, a synonym relation period determination method, and a synonym relation period determination program, and in particular, a synonym relation period determination apparatus, a synonym relation period determination method, and a synonym for determining a period in which a synonym relation is established. It relates to a related period determination program.
- Non-Patent Document 1 proposes a method for automatically acquiring synonym expressions that have similar appearances in synonym expression candidates. That is, based on the commonality of words appearing in a certain sentence, an automatic determination is made that there is a synonymous relationship when words that are used simultaneously are common.
- Patent Document 1 also describes the degree of association between words based on the time-series correlation coefficient of the frequency of use of each search word for the purpose of automatically creating a synonym dictionary corresponding to changes in synonym relationships with time.
- a method for defining is described.
- “ ⁇ ” is used to extract the synonymous relationship between the slang words such as “ ⁇ Ryo Denki” and “ ⁇ A Agency” and the original expressions such as “Mitsubishi Electric” and “Defense Agency”.
- a method for extracting a synonym relationship by generating a collation index from a slang expression list such as and collating with an original expression is disclosed.
- Patent Document 3 for the purpose of extracting synonymous relationships such as program names and their abbreviations and nicknames, series names and titles were excluded from synonym candidates using information on broadcasting stations and broadcasting times. A technique for synonymizing words is disclosed.
- “Tohoku Electric Power” can be considered as an abbreviation for “Tokyo Electric Power”, but this can also be an abbreviation for “Tohoku Electric Power”. As described above, the contents indicated by “Tohoku Electric Power” are ambiguous such as “Tokyo Electric Power” and “Tohoku Electric Power”.
- Tohoku Electric Power may change to “Tokyo Electric Power” or “Tohoku Electric Power” depending on the time.
- “Tohoku Electric Power” indicates “Tokyo Electric Power” at time A and time C, and “Tohoku Electric Power” at time B, and the synonymous relationship changes with time. It has become.
- Non-Patent Document 1 that uses a context for synonym determination, since time information is not used, the fact that synonym changes with time is not considered.
- Non-Patent Document 1 described above, synonymity can be determined using context, but since time information is not used, synonymity that changes with time cannot be grasped. That is, if the synonym changes with time and one synonym candidate is synonymous with a different synonym source depending on time, the time-series correlation calculated by the method as in Patent Document 1 is not high, As a result, synonymous relationships cannot be extracted.
- a synonym that is synonymous with a synonym source using a word used for a secret word or a hidden character (a combination of “no” and “le” “no” for “le”). Candidates can be generated, but changes in the meaning of synonymous candidates over time cannot be captured.
- time information is used to determine synonyms, but information from the same information source (broadcast station) is targeted, and a text set collected from an unspecified number is used. Is not applicable.
- Patent Documents 1 to 3 when the meaning of the synonym candidate changes with time, the synonymity between the synonym candidate and the synonym source is accurately determined. There was an inconvenience that it was not possible.
- the present invention relates to a synonym relation determination device and a synonym relation determination method capable of effectively extracting and specifying synonym relations of synonym candidates whose meaning changes with time from natural words used in texts from an unspecified majority. And its program is to be provided.
- a synonym relation determination device records a synonym expression candidate record in which a predetermined one synonym source expression and a plurality of synonym expression candidates that are targets of the synonym relation are recorded corresponding to each other.
- synonym relation determination specifying means for determining and specifying the synonym relation between the synonym expression candidate and the synonym source expression in the externally input text based on a certain standard.
- the synonym relation determination and identification unit collects the text input externally and generates a text set that can be issued based on the text, and the text set collected by the text collection unit 14
- a synonym candidate detecting means for identifying and outputting a time interval in which a large number of synonym expression candidates are detected and a time interval in which a large number of synonym source expressions are detected, and a time interval in which the synonym expression candidates are detected in the text set
- a time interval in which the synonym expression candidate and the synonym source expression are synonymous with each other based on a positional relationship and a detection frequency with the time interval in which the synonym source expression is detected in the text set.
- the synonym period specifying means for specifying is provided.
- a synonym relation determination method includes a synonym expression candidate record in which a predetermined one synonym source expression and a plurality of synonym expression candidates that are targets of synonym relations are recorded corresponding to each other.
- a synonym relation determination device including a synonym relation determination specifying unit that determines and specifies a synonym relation between a synonym expression candidate and a synonym source expression in an externally input text, The text collection unit of the synonym relation determination specifying unit generates a text set that collects the externally input text and can specify an issue time based on the text (text set generation step).
- the synonym relation determination specifying unit determines the synonym relation between the synonym expression candidate and the synonym source expression included in the generated text set while determining based on a certain standard (synonym relation specifying step), In the step of identifying the synonymous relationship, The synonym candidate detection unit of the synonym relation determination specifying unit searches and specifies a time interval in which the synonym expression candidate is detected from the text set and a time interval in which the synonym source expression is frequently detected (synonym).
- the synonym period specifying means is configured to determine and specify a time interval in which the synonym expression candidate and the synonym source expression have a synonymous relationship as a synonym period (synonym period specifying step).
- a synonym relation determination program is a synonym expression candidate in which a predetermined one synonym source expression and a plurality of synonym expression candidates to be synonymous relations are recorded corresponding to each other.
- a synonym relation determination device comprising a recording unit and synonym relation determination specifying means for determining and specifying a synonym relation between a synonym expression candidate and a synonym source expression in an externally input text
- a text set generation processing function for generating a text set that can identify an issue time by collecting externally input text, and a synonym relationship between the synonym expression candidate and the synonym source expression included in the generated text set, While providing a synonym relation specifying processing function for determining and specifying based on a certain standard,
- the synonym relation specifying processing function searches for a time interval in which a large number of synonym expression candidates are detected from a set of text collected by the text collection unit and a time interval in which a large number of synonym source expressions are detected.
- the synonym expression candidate and the synonym source expression are configured to include a synonym period specifying processing function that determines and identifies a time period in which the synonym source expression is synonymous as a synonym period,
- a synonym period specifying processing function that determines and identifies a time period in which the synonym source expression is synonymous as a synonym period
- the present invention is configured to determine synonymity by capturing the time when many synonymous expression candidates appear, it is possible to output the start time when the synonym relation is established, and depending on the time
- a synonym relation determination method, and a program thereof that do not exist in the related art described above, in which when the synonym changes, it is possible to determine a time interval in which the synonym relation is established. Can do.
- FIG. 1 It is a block diagram which shows 1st Embodiment of the synonymous relationship determination apparatus which concerns on this invention. It is explanatory drawing which shows an example of the synonym candidate list
- FIG. 7 is a flowchart illustrating an example of processing for calculating the number of occurrences of a synonym source in another configuration example of the synonym period start determination unit illustrated in FIG. 6. It is a flowchart which shows an example of the process which determines the start of a synonym period with a synonym candidate using the appearance number of a synonym origin. It is a block diagram which shows 2nd Embodiment of the synonymous relationship determination apparatus which concerns on this invention. It is a flowchart which shows operation
- a first embodiment of the present invention will be described below with reference to FIGS. First, the concept of the synonymous relationship is clarified, the basic configuration content of the first embodiment is described, and then the first embodiment is further described in detail.
- the synonymous relationship between two words is determined in correspondence with a period.
- the synonymous relationship is a synonym source that is a seed expression and an expression of a synonym candidate that may have synonymity with the synonym source.
- natural language phrases such as “NEC”, “NEC”, and “Nichiden” have synonyms as words. If “NEC” is a synonym that is a seed expression, “NEC” and “Nichiden” are synonymous candidates.
- a synonym that changes with time is extracted.
- synonymous relationships that change over time meanings change when synonym candidates are ambiguous and become candidates for multiple synonyms, or when interests and relationships change over time for synonyms.
- texts (sentences) sent mainly from the unspecified majority via the Internet “Asahi” tends to mean newspapers in the morning and often drinks at night. There is a possibility.
- synonymous changes in interest and relationships over time are the actors who grow up in the program as “boy”, “young man”, “lover”, “husband”, “father”, “grandfather” There is a time change that will be lost if a synonymous relationship with a word such as "is built. In this example, the synonymous relationship changes every several months if it is a one-year continuous program.
- the synonym relation determination apparatus 101 records a predetermined one synonym source expression and a plurality of synonym expression candidates that are targets of the synonym relation in association with each other.
- Synonym expression candidate recording unit 10 and synonym relation determination specifying means 12 for determining and specifying the synonym relation between the synonym expression candidate and the synonym source expression in the externally input text based on a certain standard, I have.
- the synonym expression candidate recording unit 10 is also provided with a synonym candidate generation unit 10A that inputs an expression as a seed for generating a synonym expression candidate and generates a synonym candidate from the seed expression.
- the synonym relation determining and specifying unit 12 collects the text inputted externally and generates a text set that can specify the issue time based on the text, and the text collecting unit 14 collects the text.
- a synonym candidate detection unit 12A that identifies and outputs a time interval in which a large number of the above-mentioned synonym expressions are detected from a text set and a time interval in which a large number of the synonym source expressions are detected, and detects the synonym candidate in the text set
- a synonym period specifying unit 12B that determines and specifies the period.
- the synonym candidate detection unit 12A detects and counts the synonym expression candidates from the text set that can identify the issue time collected by the text collection unit, and selects a time interval in which the number of appearances per unit time is large (see above).
- the synonym candidate search unit 16 to be identified (as a time interval in which synonym expression candidates exist) and the synonym source expression are detected and counted from a text set that can also identify the issue time collected by the text collection unit, and per unit time
- a synonym source search unit 18 that identifies a time interval in which the number of appearances is large (as a time interval in which the synonym source expression exists) is included.
- the synonym period specifying unit 12B extracts a synonym expression having the highest number of appearances as a synonym having a synonym relation in the text set in the time interval when the synonym candidate detection unit detects the synonym candidate.
- the synonym relation extraction unit 20 determines that the extracted synonym source expression is synonymous with the synonym expression candidate, and synonymously defines this as the start point of the synonym period when the two are synonymous.
- a synonym period start determination unit 22 that registers in the synonym dictionary 32 equipped in advance with the period is included.
- the synonym period start determination unit 22 described above includes the number of occurrences per unit time in the time interval and the unit time before the time interval in the set in the time interval in which the synonym expression candidates are detected in the text set.
- the synonym source expression having the largest ratio of the number of occurrences of the above is also provided with a function of determining that this is synonymous with the synonym expression candidate.
- the synonym period specifying unit 12B sets in advance the number of appearances per unit period for the synonymous expression candidates determined and specified as synonymous by the synonym period start determining unit 22 of the synonym period specifying unit 12B.
- a synonym period end determination unit 24 is provided to determine that the synonym relation has been resolved when the value becomes equal to or less than the threshold value.
- the synonym candidate generation unit 10A receives, as described above, an expression that is a seed for generating a synonym expression candidate (hereinafter referred to as “seed expression”), and generates a synonym expression candidate from the seed expression. It has a function to do.
- the synonym candidate is generated by extracting a partial character string from the seed expression and generating an abbreviation, or by replacing a part of the seed expression with a specific character and generating an abbreviation.
- a plurality of operations are generated by applying an operation such as generating a translated expression translated into a plurality of times.
- the synonym expression candidate recording unit 10 records the synonym expression candidates generated by the synonym candidate generation unit 10A.
- One or more corresponding seed expressions (hereinafter referred to as “synonymous element expressions”) are recorded with the synonymous expression candidates as headings.
- a synonym expression candidate in which a plurality of synonym source expressions are recorded is an ambiguous synonym expression candidate.
- the synonym candidate detection unit 12A reads the text set that can specify the issue time, and counts how many times the synonym candidate and the synonym source expression recorded in the synonym candidate recording unit 10 appear at each time point. It has a function. Then, a time interval in which the number of appearances of synonym expression candidates per unit period is greatly increased is detected.
- the synonym period determination unit 12B identifies synonym source expressions that are synonymous with the synonym expression candidates using the text set in the time interval detected by the synonym candidate detection unit 12A, and synonyms the period in which they are synonymous. Register in dictionary 32.
- the synonym period start determination unit 22 that forms part of the synonym period determination unit 12B determines which of the corresponding synonym source expressions the synonym expression candidates detected by the synonym candidate detection unit 12A are synonymous with.
- the start point of the time interval detected by the synonym candidate detection unit 12A is registered in the synonym dictionary 32 as the start point of the synonym relationship.
- This synonym relation is determined by the synonym source expression having the highest number of occurrences in the text set of the time interval in which the synonym expression candidate is detected, or the number of occurrences per unit time in the time interval and before the time interval. This is performed by determining that the synonym source expression having the largest ratio of the number of appearances per unit time is synonymous with the synonym expression candidate.
- the synonym dictionary 5 is a dictionary that records expressions in synonym relations, and can also register the start and end times of synonym relations.
- the synonym relation determination device 101 includes the synonym expression candidate recording unit 10 and the synonym relation determination specifying unit 12.
- the synonym relation determination specifying unit 12 includes a text collection unit 14, a synonym candidate search unit 16, a synonym source search unit 18, a synonym relationship extraction unit 20, and a synonym period start determination unit 22.
- the synonym relation determination device 101 includes a synonym candidate generation unit 30 and a synonym dictionary 32. And by this structure, it is going to determine the time interval PD where synonymous relation is materialized.
- the synonym candidate search unit 16 and the synonym source search unit 18 constitute the synonym candidate detection unit 12A
- the synonym relation extraction unit 20 and the synonym period start determination unit 22 define the synonym period determination unit 12B. It is configured.
- the synonym expression candidate recording unit 10 stores in advance a synonym candidate EW that is a synonym for a word that is a synonym source OW in association with the synonym source OW.
- the synonym expression candidate recording unit 10 uses the synonym candidate EW as a headline, and stores one or more corresponding species expressions (synonym source OW) in association with each other.
- the synonym candidate list 10A shown in FIG. 2 is data that associates the synonym candidate EW and the synonym source OW in this way.
- the synonym candidate list 10A only needs to be created immediately before data collection.
- the synonym candidate EW may be automatically generated using text collected in the past. It may be input.
- the synonym expression candidate recording unit 10 stores the synonym candidate EW automatically generated by the synonym candidate generation unit 30 in the synonym candidate list 10A.
- the synonym candidate EW in which a plurality of synonym sources OW are stored is an ambiguous synonym candidate EW.
- the synonym relation determination specifying unit 12 collects text including natural language data that can be processed via the network 96, for example, and performs data processing on the set of the text (see FIG. 1).
- This network 96 is, for example, the Internet, and may be a local network 96 connected to the Internet.
- Natural language data is a sentence such as Japanese, English, etc., which includes words, sentences, paragraphs, etc., and is information that humans can read.
- the text is data including the natural language data, and any file format may be used as long as the natural language is expressed. Also, the amount and the amount of proofreading can be anything from a single line comment to a document, paper or book.
- This text should have the logical location related to the author of the sentence and the issue time as attribute information.
- Examples of the logical location include an IP address, a file location in a server group (Web site) that can be specified by an IP address, and a URL that also indicates a database search result.
- the text collection unit 14 generates a text set TX by collecting text in association with the issue time.
- the text collection unit 14 treats text having an issue time (for example, writing time) as attribute information as being issued at the issue time, and collects when a text with an unknown issue time is newly collected. This time (for example, crawl time) can be used as the text issuance time.
- issue time for example, writing time
- This time for example, crawl time
- This text collection may be a robot search for an unspecified number of server devices 70 connected to the Internet, or data obtained by accessing an address whose location is designated in advance by the user may be collected. Also good.
- the entire file including images and links may be received instead of collecting only character data. Only the difference from the already collected data may be received.
- the text set TX is a set of text data including a large amount of sentences, and is preferably stored in a storage medium such as the synonym expression candidate recording unit 10 using the issue time or a period to which the issue time belongs as a key.
- a storage medium such as the synonym expression candidate recording unit 10 using the issue time or a period to which the issue time belongs as a key.
- an index to the synonym source OW and the synonym candidate EW in the synonym candidate list 10A may be generated and stored integrally.
- the synonym candidate search unit 16 calculates a time interval PD in which the synonym candidate EW included in the text set TX is searched in the text set TX from the issue time of the text.
- the time interval PD is an interval delimited by the start time, and in the first embodiment, is a period during which the synonym candidate EW is searched.
- the text set TX in the time interval PD includes a certain number of synonym candidates EW, and the text set TX before the time interval PD does not include a certain number of synonym candidates EW.
- the constant may be 0, or may be the number of searches at normal time (normal time).
- the synonym candidate search unit 16 reads the text set TX that can specify the issue time, and counts how many times the synonym candidate EW stored in the synonym expression candidate recording unit 10 appears at each time point.
- a period during which the synonym candidate EW is searched in the text set TX is defined as a time interval PD.
- a period in which the number of occurrences of the synonym candidate EW per unit period is greatly increased is set as the time interval PD of the synonym candidate EW.
- the synonym source search unit 18 searches for the synonym source OW stored in the synonym expression candidate recording unit 10 from the text set TX in a period overlapping with the time interval PD in which the synonym candidate EW is searched, so that the occurrence of the synonym source OW Calculate “Appearance” is a data item obtained as a result of the search, and is, for example, the number of appearances or the appearance ratio.
- the period overlapping with the time interval PD may be the same period as the time interval PD or may be from a certain time before the time interval PD. This overlapping period may literally partially overlap the time interval PD.
- the synonym source search unit 18 searches for a synonym source OW that may be synonymous with the synonym candidate EW having the time interval PD from the text set TX in a period overlapping with the time interval PD. Then, it is possible to obtain data on how the synonym candidate EW has appeared in a period (including the same period) overlapping with the time interval PD when the synonym candidate EW has appeared.
- the time interval PD may also be calculated for the synonym source OW.
- the synonym source search unit 18 searches the synonym source OW for each predetermined search specified time or unit time independently of the time interval PD, and further determines the time interval PD from the search result.
- the number of appearances in the text set TX in the overlapping period may be calculated. In any case, the synonym source search unit 18 counts how many times the synonym source OW appears at each time point.
- the synonym relation extraction unit 20 extracts a synonym relation between the synonym candidate EW and the synonym source OW when the synonym source OW appears in the time interval PD when the synonym candidate EW is searched. For example, the synonym relation between the synonym source OW that appears in the same period as the time interval PD and the synonym candidate EW is extracted.
- the synonym relation extraction unit 20 synonyms the time interval PD of the synonym candidate EW with the time interval PD of the synonym candidate EW when the time interval PD of the synonym candidate OW overlaps. It can be determined that OW has appeared.
- the synonym relation extraction unit 20 specifies the synonym source OW that has the synonym relation with the synonym candidate EW using the text set TX in the time interval PD detected by the synonym candidate search unit 16 and the synonym source search unit 18. .
- the synonym dictionary 32 shown in FIG. 1 is connected, the synonym relation specified by the division PD at this time is registered in the synonym dictionary 32.
- the synonym relation between the synonym candidate EW and the synonym source OW is obtained by information processing by searching for the synonym source that appears in the text set TX in a period overlapping with the time interval PD in which the synonym candidate EW is searched. It can be automatically extracted while being separated by the time interval PD.
- time interval synonym Such a synonym relation established in the time interval PD in which the synonym candidate EW is searched is referred to as “time interval synonym”.
- the period of time synonyms can be established as described above, and may end in several hours when it is caused by news, etc. On the other hand, the appearance of buzzwords or new concepts may last for several decades. . Further, depending on the synonym relationship, after the start of the time interval synonym, it may not end at the time of collecting the text TX.
- the synonym candidate generation unit 30 receives a seed expression as a seed for generating the synonym candidate EW, and automatically generates the synonym candidate EW from the seed expression.
- the synonym candidate generation unit 30 automatically generates a plurality of synonym candidates EW by applying the following operation to the expression of the synonym source OW a plurality of times. (1) A partial character string is extracted from the seed expression to generate an abbreviation. (2) A part of the seed expression is replaced with a specific character to generate a prone character. (3) A translation expression is generated by translating the seed expression into another language.
- the synonym dictionary 32 is a dictionary that stores expressions in synonym relations, and can also register the start and end times of synonym relations.
- the synonym relations stored in the synonym dictionary 32 can be used for various purposes such as search using a thesaurus, text data classification, grouping, natural language analysis, data mining, trend analysis, reputation / reputation survey, and so on. .
- the synonym candidate list 10A illustrated in FIG. 2 includes an example in which the synonym candidate EW is ambiguous.
- synonym sources OW [1] to [4] are registered as candidates for the synonym relation. If the synonym candidate EW [1] is, for example, one word in east, west, south, and north (for example, “east”), there may be many synonym sources OW such as a company name and a country name.
- the synonym candidate list 10A includes synonym candidates EW [1] to [n] and synonym sources OW [1] to [n].
- the synonym source OW [1] which is the same seed expression [1] may be associated with a plurality of synonym candidates EW [1], [2], [3].
- the synonym expression candidate recording unit 10 stores a plurality of synonym candidates EW that are synonym candidates for a plurality of synonym sources OW
- the synonym source search unit 18 performs a plurality of appearance processing 18a.
- the synonym relation extraction unit 20 may include a selection process 20a (see FIG. 1).
- the multiple appearance process 18 a calculates the appearance of the synonym source OW for each synonym source OW that has a synonym relation with the ambiguous synonym candidate EW. Then, the selection process 20a compares the occurrence of a plurality of synonym sources OW in a period overlapping with the time interval PD of the ambiguous synonym candidate EW, thereby synonymous source OW having a synonym relation with the ambiguous synonym candidate EW. Select.
- the multiple appearance process 18a calculates the appearance of the synonym source OW [1] and the appearance of the synonym source OW [5] in the time interval PD. Then, the selection process 20a compares the appearance of the synonym source OW [1] with the appearance of the synonym source OW [5] to select the synonym source OW [5] that has the synonymous relationship.
- Selection by comparison includes selection of a synonym source OW having a high number of occurrences and appearance ratio, and selection of a synonym source OW having a low number of appearances and appearance ratio from candidates.
- the selection process 20a may select only one synonym source OW or a plurality of synonym sources OW.
- the synonym relation extracting unit 20 extracts the synonym relations by comparing the appearances.
- the synonym source OW can be specified.
- the synonym relation extraction unit 20 may establish the synonym relation with the synonym source OW that satisfies the predetermined condition without comparing the appearances.
- FIG. 3 shows time transitions between the appearance of the synonym candidate EW and the appearance of the synonym source OW [1] to [8] that may have the same synonym relation with the synonym candidate EW.
- the number of occurrences of the synonym source OW increases upward in the figure, and the number of occurrences of the synonym candidate EW increases downward in the figure.
- the time interval PD is a period during which the synonym candidate EW is searched. In the example shown in FIG. 3, the time intervals PD [1] to [6] have the end points of the synonymous relationship, and the time interval PD [7] is not completed.
- the synonym relation extraction unit 20 can extract a synonym relation when the synonym candidate EW and the synonym source OW appear in common in time.
- the time interval PD of the synonym source OW overlaps the time interval PD of the synonym candidate EW, the appearance is common in time.
- the synonym relation extraction unit 20 determines that the synonym candidate EW and the synonym source OW [1] are in the time interval PD [1. ] It is determined that it is established. Similarly, the synonym relation extraction unit 20 establishes a synonym relation between the synonym candidate EW and the synonym source OW [2] in the time interval PD [2] and synonym source OW [3] in the time interval PD [4]. Can be determined.
- the synonym source OW that appears in common in the time interval PD [3] cannot be specified, but the period before the time interval PD [3] is included as a period overlapping the time interval PD [3] And the synonym relation with the synonym source OW [2] can be extracted.
- the synonym relation extraction unit 20 has a rapid increase in the appearance of the synonym source OW [3] in the time interval PD [3]. Relationships can also be extracted.
- the synonym relation extraction unit 20 can determine that a synonym relation is established with the synonym source OW having the largest number of appearances when there are a plurality of synonym sources OW in the time interval PD.
- the number of occurrences of the synonym source OW of the time interval PD [5] and the time interval PD [6] is searched by the synonym source search unit 18 from the text set TX that can be specified in each time interval PD.
- the total number of appearances in the time interval PD is the total number of appearances in the time interval PD.
- the synonym relation extraction unit 20 can select the synonym source OW [5] that appears most frequently as shown by the bar graph, and can establish a synonym relation with the synonym candidate EW. . Further, when allowing a plurality of synonymous relationships in the same time interval PD, in the example shown in the time interval PD [6], the synonym source OW [4] having a smaller number of appearances than a predetermined condition is removed. The synonym relations with the synonym sources OW [5] and [6] may be established.
- the synonym relation extraction unit 20 can also extract a synonym relation based on a calculated value using the number of appearances per unit time.
- the number of appearances per unit time By setting the number of appearances per unit time, the change rate of the same number of appearances, the ratio of the number of appearances between different synonym sources OW, or the like can be used.
- FIG. 3 shows temporal changes in the number of appearances per unit time of the synonym source OW [7] and the synonym source OW [8] in relation to the time division PD [7].
- the time between vertical lines with short intervals parallel to the vertical line indicating the time section PD [7] in the figure is a unit time.
- the synonym source search unit 18 searches for the synonym source OW from the text set TX every unit time, and calculates the number of appearances. In the time interval PD [7], the number of occurrences of the synonym source OW [7] is large, and the time interval PD [7] When the number of appearances (total number) of units and the number of appearances per unit time are compared with the synonym source OW [8], the synonym source OW [7] is selected.
- the change rate OW [8d] of the synonym source OW [8] rises in the period overlapping with the time interval PD [7], whereas the change rate of the synonym source OW [7] is small.
- the synonym relationship extraction unit 20 is not the most synonym source OW [7], but the synonym source OW whose use is rapidly increasing. The synonymous relationship with [8] can be extracted.
- the synonym relation extraction unit 20 may determine the synonym relation between the synonym candidate EW and the synonym source OW from the commonality (overlap etc.) between the time interval PD of the synonym candidate EW and the time interval PD of the synonym source OW. First, in the synonym expression candidate recording unit 10, a synonym candidate EW that is a candidate for an expression having a synonym relation with the synonym source OW is recorded together with the synonym source OW.
- the synonym candidate search unit 16 calculates a time interval PD in which the synonym candidate EW is detected in the text set TX with reference to the text set TX whose issue time can be specified. Further, the synonym source search unit 18 calculates a time interval PD in which the synonym source OW is detected in the text set TX.
- the synonym relation extraction unit 20 determines the synonym candidate EW and the synonym candidate EW from the relationship between the time interval PD in which the synonym candidate EW is detected in the text set TX and the time interval PD in which the synonym source OW is detected in the text set TX.
- the time interval PD in which the synonym source OW has a synonymous relationship is determined.
- the synonym candidate generation unit 30 generates the synonym candidate EW from the seed expression and stores it in the synonym expression candidate recording unit 10 (FIG. 4: step S101 / synonym candidate generation registration step).
- the synonym candidate generation unit 30 may accept the synonym candidate generation unit 30 and store it in the synonym expression candidate recording unit 10.
- the text collection unit 14 collects the text input from the outside and generates a text set that can specify the issue time based on the text (FIG. 4: step S102 / text set generation step). Then, the synonym relation determination specifying unit 12 determines and specifies the synonym relation between the synonym expression candidate and the synonym source expression included in the generated text set (FIG. 4 :). Steps S103 and S104 / synonymous relationship specifying step).
- the step of identifying the synonym relationship (FIG. 4: steps S103 and S104), first, the time interval in which many synonym expression candidates are detected from the generated text set and the synonym source expression are determined.
- the synonym candidate detection unit 12A of the synonym relation determination specifying unit 12 searches and specifies the time intervals that are frequently detected (FIG. 4: step S103 / synonym candidate detection step).
- the synonym relationship determination specifying unit Twelve synonym period specifying units 12B determine and identify a time interval in which the synonym expression candidate and the synonym source expression have a synonymous relationship as a synonym period (FIG. 4: step S104 / synonym period specifying step).
- the synonym candidate search unit 16 of the synonym candidate detection unit 12A functions and is collected by the text collection unit 14.
- the synonym expression candidates are detected and counted from the text set TX, and time intervals PD having a large number of appearances per unit time are extracted and specified (FIG. 4: step S103A / synonym candidate correspondence / time interval specification step).
- the synonym candidate search unit 16 reads the text set TX and, for example, the number of occurrences of the synonym expression candidate EW stored in the synonym expression candidate recording unit 10 per unit period is greatly increased. A time interval PD is detected.
- the synonym source search unit 18 of the synonym candidate detection unit 12A functions to detect the synonym source expression OW from the text set collected by the text collection unit, and the number of occurrences thereof. And a time interval with a large number of appearances per unit time is extracted and specified (FIG. 4: Step S103B / synonymous element correspondence / time interval specifying step).
- the synonym relation extraction determining unit 20 of the synonym section specifying unit 12B functions to detect the synonym candidate (FIG. 4).
- the synonym source expression having the highest number of appearances in the text set in the time interval in which the synonym expression candidate is detected in step S103) is extracted as a synonym source having a synonym relationship (FIG. 4: step S104A / synonym relationship extraction step) ). That is, the synonym relation extraction unit 20 determines which synonym source OW the synonym candidate EW detected by the synonym candidate search unit 16 is synonymous with, and extracts the synonym relation in the time interval PD.
- the synonym period start determining unit 22 of the synonym section specifying unit 12B functions to determine that the extracted synonym source expression is synonymous with the synonym expression candidate, and at the same time The start point of the time interval in which both are synonymous is set as the start point of the synonym period, and this is registered in the synonym dictionary equipped in advance together with the synonym period (FIG. 4: step S104B / synonym period registration step).
- the synonym period start determining unit 22 of the synonym section specifying unit 12B functions to store the above-described determination result indicating that there is a synonym relationship in the synonym dictionary 32 (FIG. 4: step S105 / synonym relationship registration process). This is the end of the process.
- synonymity is determined using the appearance (number of appearances, appearance change rate, appearance ratio, etc.) in the time interval PD (or time point) when the synonym candidate EW appears, and the synonym relation is determined.
- the time interval PD when is established is calculated.
- “time interval synonym” is determined for ambiguous and ambiguous words, and thus the synonyms delimited by the time interval PD even when the synonym changes with time Can handle relationships.
- synonymity determination is performed using the number of appearances per unit period at the time when a synonym expression candidate appears, so that it is possible to output the start time when the synonym relation is established. . Therefore, when synonymity changes with time, it becomes possible to determine the time interval in which the synonymous relationship is established.
- the synonym relation determination specifying unit 12 includes a synonym period start determination unit 22.
- the synonym period start determination unit 22 is a time point when the occurrence of the synonym source OW in the text set TX satisfies a predetermined condition in the time interval PD in which the synonym candidate EW is searched in the text set TX in the order of issue time. It is determined that a synonym period that is synonymous with the synonym source OW has started.
- various data, comparison processing, and determination processing can be adopted as disclosed with reference to FIG.
- the data about the appearance the number of appearances, the change rate of appearance, the appearance ratio, and the like can be used.
- a comparison with a predetermined threshold value For comparison and determination, a comparison with a predetermined threshold value, a comparison with a threshold value obtained by comparison with a normal occurrence of the synonym source, and a relationship with an appearance value of another synonym source There is a comparison with the obtained value.
- a comparison with a predetermined threshold value For comparison and determination, a comparison with a predetermined threshold value, a comparison with a threshold value obtained by comparison with a normal occurrence of the synonym source, and a relationship with an appearance value of another synonym source There is a comparison with the obtained value.
- an exception may be determined depending on the specific implementation example.For example, if the exception is exceeded even if the threshold is exceeded, the condition may be determined. You may determine that it does not satisfy.
- the synonym period start determination unit 22 determines which synonym candidate EW detected by the synonym candidate search unit 16 is synonymous with the corresponding synonym source OW.
- the start point of the time interval PD detected by the candidate search unit 16 may be registered in the synonym dictionary 32 as the start point of the synonym relation. In this example, it can be grasped that the meaning of the synonym candidate EW differs before and after the start time.
- the synonym relation extraction unit 20 and the synonym period start determination unit 22 are separate units, but the synonym relation extraction unit 20 may include the synonym period start determination unit 22.
- the synonym period start determination unit 22 first determines the synonym source OW having the highest number of occurrences in the text set TX in the time interval PD in which the synonym candidate EW is detected in the text set TX, in order to determine the number of occurrences. It is determined that it is synonymous with EW. And the synonym period start determination part 22 determines with the start point of the synonym relationship between synonym candidate EW and synonym origin OW in the start point of time interval PD.
- FIG. 5 shows an example of the configuration of the determination process with the maximum number of appearances.
- the synonym period start determination unit 22 includes an appearance number process 22a and a maximum number determination process 22b in order to perform determination based on the maximum number of appearances.
- the appearance number process 22a calculates the number of appearances of a plurality of synonym sources OW related to the synonym candidate EW in the time interval PD in which the synonym candidate EW is searched in the text set TX. Then, the most frequent determination process 22b determines that the synonym period between the synonym source OW and the synonym candidate EW having the largest number of occurrences has started at the start point of the time interval PD of the synonym candidate EW.
- the synonym source search unit 18 records the number of appearances that is the search result of the synonym source OW in the synonym source table 10B.
- the synonym period start determination unit 22 refers to the synonym source table 10B and performs the most frequent determination process 22b.
- the appearance number processing 22a calculates the number of appearances of a plurality of synonym sources OW in the time interval PD [5].
- the synonym source OW to be searched is the synonym source OW [4], [5], [, which is stored in advance in the storage unit 10 as being related to the synonym candidate EW when the time interval PD [5] is obtained. 6].
- the appearance number process 22a calculates the number of appearances of the synonym source OW [4], [5], [6] in the time interval PD [5] and records it in the synonym source table 10B. It is not necessary to calculate the synonym source OW whose appearance number is 0.
- the most frequent determination process 22b selects the synonym source OW [5] having the largest number of occurrences of 90 shown in FIG. 5 and sets the synonym source OW [5] and the synonym candidate EW for the current time interval PD [5]. It is determined that the synonym relation has started at the starting point.
- the synonym period start determination based on the number of appearances is effective for determining the synonym relationship with the synonym source OW that has been attracting attention from the normal time.
- the synonym period start determination unit 22 has the highest occurrence ratio between the number of appearances per unit time in the time section PD and the number of appearances per unit time before the time section PD in order to determine the appearance ratio. It is determined that the original OW is synonymous with the synonym candidate EW. And the synonym period start determination part 22 determines with the start point of the synonym relationship between synonym candidate EW and synonym origin OW in the start point of time interval PD.
- FIG. 6 shows an example of determination processing based on the appearance ratio.
- the synonym period start determination unit 22 includes a time interval process 22c, a time interval pre-process 22d, and a ratio determination process 22e in order to make a determination based on the appearance ratio.
- the time interval process 22c calculates the number of appearances per unit time of one or more synonym sources OW related to the synonym candidate EW in the time interval PD in which the synonym candidate EW is searched in the text set TX.
- the time interval pre-processing 22d calculates the number of appearances per unit time before the time interval PD of each synonym source OW.
- the ratio determination process 22e determines that the synonym period is the start point of the time interval PD of the synonym candidate EW when the number of occurrences in the time interval PD is larger than the number of occurrences before the start time of the time interval PD. It is determined that it has started.
- the synonym source search unit 18 also calculates the number of occurrences of the synonym candidate EW before the time interval PD, and stores the number of occurrences of the synonym source OW in the synonym source table 10B. Furthermore, the ratio determination process 22e stores the calculated appearance ratio in the synonym candidate table 10C.
- the text collection unit 14 searches for text via the network 96 at a predetermined cycle or time (search time) to generate a text set TX (FIG. 7: step S201).
- the text collection unit 14 further specifies the text issuance time (FIG. 7: step S202).
- the synonym source search unit 18 sequentially searches all the synonym sources OW registered in the storage unit 10 (FIG. 7: steps S203 and S207).
- the synonym source search unit 18 calculates the number of appearances per unit time in the text set TX (FIG. 7: step S204) and records it in the synonym source table 10B (FIG. 7: step). S205).
- the synonym source search unit 18 determines the interval before the time interval PD [7] as shown in the synonym source table 10B of FIG. The number of occurrences of the synonym source OW [8] (100) and the number of occurrences of the synonym source OW [8] in the time interval PD [7] (400) are recorded.
- the synonym candidate search unit 16 first sequentially searches for synonym candidates EW registered in the synonym expression candidate recording unit 10 (FIG. 8: steps S211 and S213).
- the synonym period start determination unit 22 starts the time interval PD of the synonym candidates EW, starting from the time when the found text is issued. To do. In the example shown in FIG. 3, the time interval PD [7] is started.
- the mid-period process 22c of the synonym period start determination unit 22 calculates the number of occurrences of the synonym source OW in the time period PD [7] in which the synonym candidate EW is searched (FIG. 8: Step S216). Record in the original table 10B.
- the time interval pre-processing 22d instead of the shortest unit time shown in FIG. Then, calculates the number of appearances (100) of the same synonym source OW before the time interval PD for the same fixed time (100) (FIG. 8: step S217) and records it in the synonym source table 10B. .
- the ratio determination process 22e calculates the ratio (400%) of the number of appearances (400) in the time interval PD to the number of appearances (100) before the time interval PD (step S218).
- the ratio determination process 22e further selects the synonym source OW [8] having the highest appearance ratio, and the synonym period with the synonym source OW [8] is the start point of the time interval PD [7] of the synonym candidate EW. It determines with having started (FIG. 8: step S221).
- the synonym period start determination unit 22 records the synonym relationship that is the synonym at this time in the synonym dictionary 32 (FIG. 8: step S222).
- the synonym period start determination process based on the appearance ratio shown in FIG. 8 is effective for extracting a synonym relationship with a synonym source OW that has a low degree of attention in normal times.
- the operation contents (particularly each operation step in FIGS. 4, 7 and 8) in the operation part of each configuration described above are programmed to be executable by a computer, and these steps are executed. You may make it make the computer with which the synonym relation determination specific means 12 is provided execute. The same applies to other embodiments.
- the programmed program may be recorded on a non-temporary recording medium such as a DVD, a CD, or a flash memory.
- the program is read from the recording medium by a computer and executed.
- time division synonyms are determined by information processing using the appearance of the synonym source OW, and in particular, time zone synonyms with a clear start time are determined. Can do.
- the synonym relation determination device 102 determines the end of the synonym period in addition to the components of the first embodiment disclosed in FIG. 1 described above. It is characterized by having 24.
- the synonym relation determination specifying unit 12 determines that the synonym period has ended when the occurrence of the synonym candidate EW decreases below a predetermined condition in the text set TX in the order of issue time.
- the synonym period end determination unit 24 is provided. Then, since the synonym period end determination unit 24 determines that the synonym relation has ended when the occurrence of the synonym candidate EW decreases, the period in which the meaning of the ambiguous and ambiguous synonym candidate EW is established is within a certain period. Can be identified.
- FIG. 10 is a flowchart illustrating an example of information processing by the synonym relation determination apparatus 102 in the second embodiment.
- the synonym expression generation unit 30 generates a synonym candidate EW from the seed expression and records it in the synonym expression candidate recording unit 10 (FIG. 10: step S221).
- the synonym candidate search unit 16 reads the text set TX and detects a time interval PD in which the number of occurrences per unit period of the synonym candidate EW recorded in the synonym expression candidate recording unit 10 is greatly increased. (FIG. 10: Step S222). Then, the synonym period start determination unit 22 determines to which synonym source OW the synonym candidate EW detected by the synonym candidate search unit 16 is synonymous, and determines the start point of the synonym relationship (FIG. 10: step). S223).
- the number of appearances per unit period of the synonym candidate EW determined by the synonym period end determination unit 24 as being synonymous by the synonym period start determination unit 22 is equal to or less than the end threshold value. In this case, it is determined that the synonym relationship is resolved (FIG. 10: Step S224), and the end time of the synonym relationship is registered in the synonym dictionary 32 (FIG. 10: Step S225).
- This third embodiment is characterized in that one synonym source in the time interval PD is used and the start, replacement, and end of the synonym relation are determined.
- the synonym relation determination device 103 includes a synonym relation extraction unit 20, an appearance calculation process 20 b, a start determination process 20 c, a replacement process 20 d, and an end determination process 20 e. It has.
- the appearance calculation process 20b calculates the appearance of a plurality of synonym sources OW related to the synonym candidate EW when the synonym candidate EW is searched in the text set TX.
- the start determination process 20c is synonymous with the synonym candidate EW when the occurrence exceeds the start threshold for the synonym source OW whose occurrence exceeds the predetermined start threshold. Determine that the relationship has started.
- the replacement process 20d determines that the synonym relationship has ended at the time when the synonym source OW has fallen, and newly creates the largest number of synonym sources. It is determined that a synonym relationship has started for OW.
- the end determination process 20e determines that the synonym relation is ended when the appearance falls below a predetermined end threshold after the synonym relation starts. With this configuration, the meaning of the ambiguous synonym candidate EW can be specified with a higher probability.
- Other configurations are the same as those of the first embodiment described above.
- the synonym candidate generation unit 30 generates the abbreviation by leaving the first character of the morpheme of the synonym source OW, or generates the abbreviation by replacing one character of the synonym source OW with “O”.
- a plurality of candidate EWs are generated. For example, when the seed expression (synonymous element OW) is “Tokyo Electric Power”, an abbreviation such as “Toden” or “Tokyo Electric Power” and an abbreviated character such as “Too Electric Power” are generated.
- “Tohoku Electric Power” (synonymous candidate EW) can be considered as an abbreviation of “Tokyo Electric Power” (synonymous OW [10]). It can also be. In this way, the contents pointed to by “Tohoku Electric Power” (synonymous candidate EW) are ambiguous such as “Tokyo Electric Power” (synonymous source OW [10]) and “Tohoku Electric Power” (synonymous source OW [11]).
- the seed expression is “Tohoku Electric Power”
- an abbreviation such as “Toden Electric” or “Tohoku Electric”
- an abbreviated character such as “Tohoku Electric Power” are generated.
- the synonym candidate list 10 ⁇ / b> A shown in FIG. 12 or such data is stored in the storage unit 10.
- “Toden” and “Tohoku Electric Power” are generated from both “Tokyo Electric Power” and “Tohoku Electric Power”, and are synonymous candidates EW with ambiguity as described above. Actually, the content indicated by “Tohoku Electric Power” can be changed to “Tokyo Electric Power” or “Tohoku Electric Power” depending on the time.
- “Eastern Electric Power” (synonymous candidate EW) is time section PD [A]
- time section PD [C] is “Tokyo Electric Power” (synonymous source OW [10]).
- PD [B] it points to “Tohoku Electric Power” (synonymous source OW [11]), and the synonymous relationship changes with time.
- the synonym candidate generation unit 30 generates the abbreviation by leaving the first character of the morpheme in the seed expression, or generates the abbreviation by replacing one character in the seed expression with “O”.
- a plurality of candidate EWs are generated.
- the synonym candidate search unit 16 and the synonym source search unit 18 detect when the synonym candidate EW and the synonym source OW appear (suddenly) in the text set TX.
- Each text in the text set TX is given an issuance time such as a crawl time, a writing time, and the like, and a time point when the synonym candidate EW and the synonym source OW appear is detected based on the issue time.
- the synonym relation extraction unit 20 includes “Tokyo Electric Power” (synonymous source OW [11]), “Tohoku Electric Power” (synonymous source OW [11]), and “Tohoku Electric Power” (synonymous candidate EW) in the text set TX.
- the appearance frequency is counted (appearance calculation processing 20b) and the appearance frequency is as shown in FIG. 13, the time interval PD [A], the time interval PD [B], and the time interval PD [C in FIG. ] Is calculated.
- the synonym relation extraction unit 20 determines synonymity. For example, in the time interval PD [A] in FIG. 13, there are two synonymous OWs of “Tohoku Electric Power”, “Tokyo Electric Power” and “Tohoku Electric Power”, but the number of occurrences in the time interval PD [A] is If “Tokyo Electric Power” is 800 per day and “Tohoku Electric Power” is 150 per day, “Tokyo Electric Power” appears more frequently. And “Tokyo Electric Power” are determined to be synonymous (start determination process 20c), and “Tohoku Electric Power” and “Tokyo Electric Power” are synonymous in the synonym dictionary 32 from the start of the time interval PD [A]. Is registered.
- the synonym relation extraction unit 20 performs “Tohoku Electric Power” and “Tohoku Electric Power” (replacement processing 20d) in the time interval PD [B], and “Tohoku Electric Power” and “Tohoku Electric Power” in the time interval PD [C]. It is determined that “Tokyo Electric Power” is synonymous (replacement process 20d).
- the synonym relation of “Tohoku Electric Power” changes depending on the time, but the synonym dictionary 32 is also updated accordingly, and correctly determines the time interval PD where the synonym relation whose meaning changes depending on the time is established. It becomes possible to do.
- the synonym relation extraction unit 20 monitors the number of appearances after the synonym relation is established, and the occurrence of “East power” decreases below the threshold as in the time interval PD [D] in FIG. However, if the number of occurrences per unit period of “Tokyo Electric Power”, which is synonymous with “Tohoku Electric Power”, decreases to the same level as usual, the synonymous relationship between “Tohoku Electric Power” and “Tokyo Electric Power” ends. (End determination process 20e), the fact that the synonym relation has ended together with the end time is registered in the item of the synonym dictionary 32. Thereby, when the occurrence of the synonym candidate EW is reduced and it cannot be said that the synonym relation is established, it can be determined that the synonym relation is completed together with the end time.
- FIG. 14 is a flowchart illustrating another information processing example according to the third embodiment.
- the processing shown in FIG. 14 differs from the above-described information processing in terms of threshold values and termination, but the outline of information processing is the same.
- the appearance of the synonym source OW for each unit time is calculated in the flowchart shown in FIG. 7 and stored in the synonym source table 10B and the like.
- the synonym relation extraction unit 20 searches for the synonym candidate EW stored in the storage unit 10 (FIG. 14: step S211), and when the synonym candidate EW is not found ( FIG. 14: Step S212), the next synonym candidate EW is specified (FIG. 14: Step S213) and searched again (FIG. 14: Step S211).
- the synonym relation extraction unit 20 refers to the synonym candidate list 10A shown in FIG. “Tokyo Electric Power” (synonymous source OW [10]) and “Tohoku Electric Power” (synonymous source OW [11]), which are former OWs, are specified (FIG. 14: step S215).
- the appearance calculation process 20b searches for the appearance of each of a plurality of related synonym sources OW [10] and [11] after “Tohoku Electric Power” (synonym candidate EW) is searched in the text set TX (step S212). Calculation is performed (FIG. 14: Step S301).
- the start determination process 20c confirms whether or not the appearance of the plurality of synonym sources OW [10] and [11] exceeds a predetermined start threshold value (FIG. 14: Step S302). Next, when the start threshold value is exceeded, the start determination process 20c selects “Tokyo Electric Power” (synonym source OW [10]) having the highest number of occurrences from the synonym source OW group (FIG. 14). : Step S303). Then, the start determination process 20c determines that the synonym relationship with the synonym candidate EW has started when the start threshold value is exceeded (FIG. 14: step S304).
- the replacement process 20d is performed when the occurrence (current appearance) of the synonym source OW [10] (Tokyo Electric Power) falls below the appearance (other appearances) of the other synonym source OW [11] (Tohoku Electric Power) after the synonym relation starts. Then (FIG. 14: step S305), it is determined that the synonym relation has ended at the time when it falls below, and it is determined that the synonym relation is newly started for the largest number of synonym sources OW (FIG. 14: step S306). If the state where there are more current appearances than other appearances continues (FIG. 14: step S305), the replacement process 20d is not executed, and the process proceeds to the end process determination.
- the synonym relation determining apparatus 103 is provided with the display device 95 for displaying data in the synonym relation determining specifying unit 12.
- the synonym relation determination specifying unit 12 includes a display control unit 26.
- the display control unit 26 displays the synonym candidate EW extracted by the synonym relationship extraction unit 20, the synonym relationship start time, the synonym source OW that started the synonym relationship, and the synonym relationship end time. Display control is performed as the synonymous data TD for each category. As a result, the text set TX and the like can be displayed to the user including the information on the temporal change of the synonymous relationship.
- the present invention can be applied to a reputation monitoring system, a reputation extraction system, etc. for the Internet.
- Information processing by the synonym relationship determination apparatuses 101, 102, and 103 in the present embodiment is a specific means that software and hardware resources cooperate to calculate or process information according to the purpose of use.
- a computer 80 that performs information processing is included.
- the computer 80 includes calculation means 82 that is a central processing unit (CPU), and main storage means 86 that provides a storage area for the calculation means 82.
- the computer 80 generally has peripheral devices connected through a data bus and an input / output interface.
- the peripheral devices are typically a communication unit 88, an external storage unit 90, an input unit 92, and an output unit 94.
- the whole including peripheral devices may be referred to as a computer 80.
- the communication unit 88 controls communication with the server device 70 via a wired or wireless network.
- the external storage means 90 is a program file 100 or a storage medium that can be installed or carried to store data.
- the input unit 92 is a keyboard, a touch panel, a pointing device, a scanner, or the like, and inputs data that can be read by the computer 80 in accordance with a user operation.
- the output means 94 displays and outputs data calculated by the computer 80 such as a display and a printer.
- the storage means 10 of the synonym relation determination devices 101, 102, 103 stores data such as the synonym candidate list 10A using the external storage means 90 as hardware resources.
- the synonym relation determination specifying unit 12 executes data processing on the text set TX by using the calculation unit 82 as a hardware resource. That is, the synonym relation determination specifying unit 12 can be realized by the computer 80 that executes the program.
- the synonym expression candidate recording unit 10 in which a predetermined one synonym source expression and a plurality of synonym expression candidates to be synonymously recorded are recorded corresponding to each other, and the synonym expression candidate in the externally input text and the A synonym relation determination device comprising synonym relation determination specifying means 12 for determining and specifying a synonym relation with a synonym source expression based on a certain standard,
- the synonym relation determination specifying means 12 A text collection unit 14 that collects the externally input text and generates a text set that can specify an issuance time based on the text;
- a synonym candidate detection unit 12A for specifying and outputting a time interval in which a large number of the synonym expression candidates are detected from a text set collected by the text collection unit 14 and a time interval in which a large number of the synonym source expressions are detected; Based on the positional relationship and the detection frequency between the time interval in which the synonym expression candidate is detected in the text set and the time interval in which the synonym source expression is detected in the text set, the synonym expression candidate and the synonym source expression And a synonym
- the synonym candidate detection unit 12A detects and counts the synonym candidate from the text set that can identify the issue time collected by the text collecting unit, and selects a time interval with a large number of appearances per unit time (the synonym candidate)
- the synonym candidate search unit 16 to be identified) and the synonym source expression are detected and counted from the text set that can identify the issue time collected by the text collection unit as well as the number of occurrences per unit time.
- a synonym relation determining apparatus comprising: a synonym source search unit that identifies a large time interval (as a time interval in which the synonym source expression exists).
- the extraction unit 20 determines that the extracted synonym source expression is synonymous with the synonym candidate and uses the start point of the time interval in which the two are synonymous as the start point of the synonym period, together with the synonym period.
- a synonym relation determining apparatus comprising a synonym period start determining unit 22 registered in a synonym dictionary equipped in advance.
- the synonym period start determination unit 22 includes the number of occurrences per unit time in the time interval and the unit time before the time interval in the set in the time interval in which the synonym expression candidates are detected in the text set.
- a synonym relation determination device having a function of determining that a synonym source expression having the largest ratio of the number of occurrences is synonymous with the synonym expression candidate.
- a synonym expression candidate recording unit 10 in which a predetermined one synonym source expression and a plurality of synonym expression candidates to be synonymously recorded are recorded corresponding to each other, and synonymous with the synonym expression candidate in the externally input text
- the text collection unit of the synonym relation determination specifying unit 12 generates a text set that collects the externally input text and can specify an issue time based on the text (text set generation step).
- the synonym relation determination specifying unit 12 determines and specifies a synonym relation between the synonym expression candidate and the synonym source expression included in the generated text set (synonym relation specifying step) , In the step of identifying the synonymous relationship, The synonym candidate detection unit 12A of the synonym relation determination specifying unit 12 searches and specifies a time interval in which a large number of synonym expression candidates are detected from the text set and a time interval in which a large number of synonym source expressions are detected.
- the synonym relationship determination specifying unit Twelve synonym period specifying units 12B determine synonym relations and identify synonym relations when the synonym expression candidates and the synonym source expressions are synonymous (synonym period specifying step).
- a synonym expression candidate recording unit 10 in which a predetermined one synonym source expression and a plurality of synonym expression candidates to be synonymously recorded are recorded corresponding to each other, and synonymous with the synonym expression candidate in the externally input text
- the synonym relation determination device provided with the synonym relation determination specifying means 12 for determining and specifying the synonym relation with the original expression, Text set generation processing function that collects text input from outside and generates a text set that can specify the issue time, And providing a synonym relation specifying processing function for determining and specifying the synonym relation between the synonym expression candidate and the synonym source expression included in the generated text set based on a certain standard
- the synonym relation identification processing function is A synonym candidate detection processing function for respectively searching and specifying a time interval in which a large number of synonym expression candidates are detected from the text set collected by the text collection unit and a time interval in which a large number of synonym source expressions are detected.
- the synonym expression candidate and the synonym source based on a positional relationship and a detection frequency between a time interval in which the synonym expression candidate is detected in the text set and a time interval in which the synonym source expression is detected in the text set. It is configured to include a synonym period specification processing function that determines and identifies a time interval in which the expression and the synonym are synonymous as a synonym period, A synonym relation determination program characterized in that each of these processing functions is realized by a computer provided in the synonym relation determination specifying means 12.
- the synonym candidate detection processing function is Synonym candidate correspondence / time interval specification processing for detecting and counting the synonymous expression candidates from the text set that can specify the issue time collected by the text collection unit, and extracting and specifying the time interval having a large number of appearances per unit time function, And synonym correspondence / time interval specification that detects and counts the synonym expression from the text set that can identify the issue time collected by the text collector and extracts and identifies the time interval with a large number of occurrences per unit time
- a synonym relation determining program characterized in that each of the time interval specifying processing functions is realized by a computer provided in the synonym relation determining and specifying unit 12.
- the synonym period specifying processing function is A synonym relation extraction process function for extracting a synonym element expression having the highest number of occurrences as a synonym element having a synonym relation in a text set in a time interval in which the synonym candidate detection process function detects the synonym expression candidate; At the same time, it is determined that the extracted synonym source expression is synonymous with the synonym candidate and the start point of the time interval in which both are synonymous is used as the start point of the synonym period.
- a synonym period registration processing function for registering in the synonym dictionary equipped in advance with the synonym period, A synonym relation determination program characterized in that each of these processing functions is realized by a computer provided in the synonym relation determination specifying means 12.
- the synonym relation identification processing function is When the number of occurrences per unit period of synonymous expression candidates determined to be in the synonymous relationship is continuously counted and the number of occurrences is equal to or less than a preset threshold, the synonymous relationship is determined at that time. Equipped with a synonym period end determination function that determines that it has been resolved, A synonym relation determination program characterized in that this is realized by a computer provided in the synonym relation determination specifying means 12.
- the present invention is applicable to all natural language data processing using synonymous relationships.
- Synonym relation determination specification means 12A Synonym candidate detection part 12B Synonym period specification part 14 Text collection part 16 Synonym candidate search part 18 Synonym origin search part 18a Multiple appearance process 20 Synonym relation extraction part 22 Synonym period start determination part 24 Synonym End of period determination unit 26 Display control unit 30 Synonym candidate generation unit 32 Synonym dictionary EW Synonym candidate OW Synonym source PD Time interval D Synonym data by time division TX Text set
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
そのため、同義表現を自動獲得するさまざまな手法が提案されている。
特許文献2には、「○菱電気」・「某A庁」等の隠語と「三菱電機」・「防衛庁」等の元の表現との同義関係を抽出することを目的として、「○」等の隠語表現リストから照合用索引を生成して、元の表現と照合することにより同義語関係を抽出する手法が開示されている。
即ち、時間の推移によって同義性が変化し、一つの同義候補が時間によっては異なる同義元と同義となる場合には、特許文献1のような手法で算出する時系列相関は高くならず、その結果、同義関係を抽出できない。
特許文献3記載の手法では、同義語の判定に時間情報を使用しているが、同一の情報源(放送局)からの情報を対象としており、不特定多数から収集されるテキスト集合に対しては適用することができない。
本発明は、不特定多数からのテキストで使用される自然語から、時間と共に意味が変化する同義候補の同義関係を有効に抽出し特定することを可能とした同義関係判定装置、同義関係判定方法、及びそのプログラムを提供することを、その目的とする。
そして、前記同義関係判定特定部が、外部入力される前記テキストを収集しこれに基づいて発行時間が特定可能なテキスト集合を生成するテキスト収集部と、このテキスト収集部14で収集されたテキスト集合から前記同義表現候補が多く検出される時区間と前記同義元表現が多く検出される時区間とを特定し出力する同義候補検出手段と、前記同義表現候補が前記テキスト集合で検出される時区間と前記同義元表現が前記テキスト集合で検出される時区間との位置関係及び検出頻度とに基づいて、前記同義表現候補と前記同義元表現とが同義関係にある時区間を同義期間と判定し特定する同義期間特定手段とを備えたことを特徴としている。
外部入力される前記テキストを収集しこれに基づいて発行時間が特定できるテキスト集合を、前記同義関係判定特定部のテキスト収集部が生成し(テキスト集合生成工程)、
この生成された前記テキスト集合に含まれる前記同義表現候補と前記同義元表現との同義関係を、前記同義関係判定特定部が一定の基準に基づいて判定すると共に特定し(同義関係特定工程)、
前記同義関係を特定する工程にあっては、
前記テキスト集合から前記同義表現候補が多く検出される時区間と前記同義元表現が多く検出される時区間とを、それぞれ前記同義関係判定特定部の同義候補検出手段が検索して特定し(同義候補検出工程)、
続いて、前記同義表現候補が前記テキスト集合で検出される時区間と前記同義元表現が前記テキスト集合で検出される時区間との位置関係及び検出頻度とに基づいて、前記同義関係判定特定部の同義期間特定手段が前記同義表現候補と前記同義元表現とが同義関係にある時区間を同義期間と判定し特定するように構成したこと(同義期間特定工程)を特徴とする。
外部入力されるテキストを収集して発行時間を特定できるテキスト集合を生成するテキスト集合生成処理機能、および生成された前記テキスト集合に含まれる前記同義表現候補と前記同義元表現との同義関係を、一定の基準に基づいて判定すると共に特定処理する同義関係特定処理機能を設けると共に、
前記同義関係特定処理機能が、前記テキスト収集部で収集された前記テキスト集合から前記同義表現候補が多く検出される時区間と前記同義元表現が多く検出される時区間とを、それぞれ検索して特定処理する同義候補検出処理機能、および前記同義表現候補が前記テキスト集合で検出される時区間と前記同義元表現が前記テキスト集合で検出される時区間との位置関係及び検出頻度とに基づいて前記同義表現候補と前記同義元表現とが同義関係にある時区間を同義期間と判定し特定する同義期間特定処理機能を含む構成とし、
これらの各処理機能を前記同義関係判定特定手段が備えているコンピュータに実現させるようにしたことを特徴としている。
最初に、同義関係の概念を明確にすると共に本第1実施形態の基本的な構成内容を説明し、その後に本第1実施形態について、更に詳述する。
まず、本第1実施形態では、二つの語(自然語の語句)の同義関係を期間に対応させて判定するようにした。
ここで、同義関係は、種表現となる同義元と、この同義元と同義性を持つ可能性のある同義候補の表現とである。例えば、「日本電気」と「NEC」と「日電」という自然語の語句は、語としての同義性を持つ。そして、「日本電気」を種表現である同義元とすると、「NEC」「日電」は同義候補である。
社名をアルファベット表記している場合、何らかのニュースとの関係で伏せ字での電子的な対話がなされることがある。このような場合、ニュースとなったことで同義元の出現数が増加する時期に、当該伏せ字の出現が現れ、同義性を持つ。このようなニュース等による同義性は、数時間で終了することもある。
本第1実施形態にあって、同義関係判定装置101は、図1に示すように、所定の一の同義元表現と同義関係の対象となる複数の同義表現候補とが相互に対応して記録された同義表現候補記録部10と、外部入力されるテキスト中における前記同義表現候補と前記同義元表現との同義関係を一定の基準に基づいて判定し特定する同義関係判定特定手段12と、を備えている。
また、同義表現候補記録部10には、同義表現候補を生成するための種となる表現を入力しその種表現から同義候補を生成する同義候補生成部10Aが併設されている。
同義辞書5は、同義関係にある表現を記録する辞書であり、同義関係の開始、終了時刻もあわせて登録することができる。
そして、この構成により、同義関係が成立する時区間PDを判定しようとするものである。
時区間PDは、開始時間で区切られる区間であり、本第1実施形態では、同義候補EWが検索される期間である。収集したテキスト集合TXから同義候補EWが検索され始めた際には時区間PDの開始時間はあるが継続中で有り終了時間はない。
時区間PDと重なる期間は、時区間PDと完全に同一の期間でも良いし、時区間PD前の一定時間前からとしても良い。この重なる期間は、文字通り一部で時区間PDと重なれば良い。
いずれにせよ、同義元検索部18は、同義元OWが各時点で何回出現しているかを計数する。
(1)種表現に対して部分文字列を抽出して省略語を生成する。(2)種表現の一部を特定の文字に置換して伏字を生成する。(3)種表現を別の言語に翻訳した翻訳表現を生成する。
図2に示す同義候補リスト10Aは、同義候補EWが多義となる例を含んでいる。同義候補EW[1]に対して、同義元OW[1]から[4]が同義関係の候補として登録されている。同義候補EW[1]は、例えば東西南北の内の一語(例えば「東」)であると、社名、国名など多数の同義元OWが有り得る。
この同義候補リスト10Aには、同義候補EW[1]から[n]まで、同義元OW[1]から[n]まである。同一の種表現[1]である同義元OW[1]が、複数の同義候補EW[1],[2],[3]と関連することもある。
上記同義関係抽出部20は、同義候補EWと同義元OWとの出現が時間的に共通する際に、同義関係を抽出することができる。ここで、同義候補EWの時区間PDに同義元OWの時区間PDが重なり合う際には、出現が時間的に共通する。
同義関係抽出部20は、時区間PD中に複数の同義元OWがある際に、出現数が最多の同義元OWと同義関係が成立したと判定することができる。
ここで、時区間PD[5]と時区間PD[6]の同義元OWの出現数は、同義元検索部18が、それぞれの時区間PDで特定できるテキスト集合TXから同義元OWを検索し、時区間PDでの出現数を合計した値である。
同義関係抽出部20は、単位時間当たりの出現数を使用した計算値に基づいて同義関係を抽出することもできる。出現数を単位時間当たりとすることで、同一の出現数の変化率や、異なる同義元OW間の出現数の比などを使用することができる。
例えば、図3に時区分PD[7]との関係で同義元OW[7]と同義元OW[8]の単位時間当たりの出現数の時間変化を示す。図中時区分PD[7]を示す縦線と平行な間隔の短い縦線間の時間が単位時間である。
単位の出現数(総数)や、単位時間当たりの出現数を同義元OW[8]と比較すると、同義元OW[7]が選択される。
まず、同義表現候補記録部10には、同義元OWと同義関係となる表現の候補である同義候補EWが同義元OWと共に記録されている。
次に、上記第1実施形態における動作を、図4に基づいて説明する。
そして、この生成された前記テキスト集合に含まれる前記同義表現候補と前記同義元表現との同義関係を、前記同義関係判定特定手段12が一定の基準に基づいて判定すると共に特定する(図4:ステップS103,S104/同義関係特定工程)。
具体的には、同義候補検索部16は、テキスト集合TXを読んでいき、例えば、同義表現候補記録部10に記憶されている同義表現候補EWの単位期間あたりの出現数が大きく増加している時区間PDを検出する。
即ち、同義関係抽出部20は、同義候補検索部16で検出された同義候補EWがどの同義元OWと同義関係にあるかを判定し、時区間PDでの同義関係を抽出する。
次に、同義候補EWと同義元OWとの同義関係について同義関係の開始時点を判定する手法を開示する。
この例では、図1に示すように、同義関係判定特定手段12が同義期間開始判定部22を備えている。
この同義期間開始判定部22は、発行時間順のテキスト集合TXで同義候補EWが検索される時区間PDにて当該テキスト集合TXでの同義元OWの出現が予め定められた条件を満たした時点で当該同義元OWと同義となる同義期間が開始したと判定する。
データを比較した結果、条件を満たすと判定しても良いし、実施の具体例によっては例外事項等を定めておき、例えば、しきい値を超えても例外事項に該当する際には条件を満たさないと判定しても良い。
この例では、同義候補EWの意味を、開始時点の前後で異なると把握することができる。
この出現数による同義期間開始判定は、通常時から注目されている同義元OWとの同義関係の判定に効果的である。
同義期間開始判定部22は、出現比率で判定するには、時区間PD中での単位時間あたりの出現数と、時区間PD以前での単位時間あたりの出現数との出現比率が最も大きい同義元OWを同義候補EWと同義関係にあると判定する。そして、同義期間開始判定部22は、時区間PDの開始点において同義候補EWと同義元OWとの間での同義関係の開始点と判定する。
時区間中処理22cは、同義候補EWがテキスト集合TXで検索される時区間PDでの当該同義候補EWと関連する1以上の同義元OWの単位時間あたりの出現数を計算する。 時区間前処理22dは、当該各同義元OWの時区間PD前での単位時間あたりの出現数を計算する。
即ち、比率判定処理22eは、時区間PDの開始時点より前の出現数と比較して、時区間PD中の出現数が大きい際に、同義期間が同義候補EWの時区間PDの開始点で開始したと判定する。
まず、テキスト収集部14は、予め定められた周期や時刻(検索時刻)にネットワーク96を介してテキストを検索し、テキスト集合TXを生成する(図7:ステップS201)。テキスト収集部14は、さらに、テキストの発行時間を特定する(図7:ステップS202)。
同義元検索部18は、同義候補EWと時区間PD[7]とが特定されている際には、図6の同義元テーブル10Bに示すように、時区間PD[7]の前の区間の同義元OW[8]の出現数(100)と、時区間PD[7]中の同義元OW[8]の出現数(400)とを記録する。
そして、時区間前処理22dは、当該各同義元OWの時区間PD前での同様の一定時間まとめた出現数(100)を計算し(図8:ステップS217)、同義元テーブル10Bに記録する。
〔第2実施形態〕
まず、図9に示す第2実施形態にあって、同義関係判定装置102は、前述した図1に開示した第1実施形態の各構成に加えて同義期間の終了を判定する同義期間終了判定部24を備えている点に特徴を有する。
まず、同義表現生成部30で、種表現から同義候補EWを生成し、同義表現候補記録部10に記録する(図10:ステップS221)。
そして、同義期間開始判定部22で、同義候補検索部16で検出された同義候補EWがどの同義元OWと同義関係にあるかを判定し、同義関係の開始点を判定する(図10:ステップS223)。
本実施の形態では、同義関係が終了時刻を出力することが可能であるため、同義関係が成立している時区間PDを正しく判定することが可能となる。
〔第3実施形態〕
この第3実施形態では、時区間PD内の同義元を1つとし、同義関係の開始、入替及び終了を判定する点に特徴を有する。
開始判定処理20cは、出現が予め定められた開始しきい値を超えた同義元OWのうち当該出現が最多の同義元OWについて、当該開始しきい値を超えた時点で同義候補EWとの同義関係が開始したと判定する。
この構成により、多義的な同義候補EWの意味をより高確率な意味に特定することができる。
その他の構成は、前述した第1実施形態の場合と同一となっている。
ここで、いま、種表現(同義元)として、図12に示すように「東京電力」と「東北電力」が与えられた場合を想定する。
そして、実際には、時刻によって「東○電力」が指す内容が「東京電力」や「東北電力」に変化しうる。
まず、同義候補生成部30で、種表現中の形態素の先頭文字を残して省略語を生成したり、種表現中の1文字を「○」に置換して伏字を生成したりすることで同義候補EWを複数生成する。
次に、同義候補検索部16及び同義元検索部18で、同義候補EW及び同義元OWがテキスト集合TX中で(突発的に)出現する時点を検出する。テキスト集合TX中の各テキストには、クロール時間、書き込み時間、等の発行時間が付与されており、それをもとに同義候補EW及び同義元OWが出現する時点を検出する。
これにより、同義候補EWの出現が少なくなり、同義関係が成立すると言えなくなった場合に、終了時刻とともに同義関係が終了したことを判定することができる。
この図14に示す処理では、上述の情報処理とはしきい値や終了の扱い等が異なるが情報処理の概要は同様である。
ここでは、同義元OWの単位時間毎の出現については、図7に示すフローチャートで計算され、同義元テーブル10B等に格納されているとする。
入替処理20dは、同義関係開始後に、当該同義元OW[10](東京電力)の出現(現出現)が他の同義元OW[11](東北電力)の出現(他出現)を下回った際に(図14:ステップS305)、当該下回った時点で当該同義関係が終了したと判定し、そして、新たに最多の同義元OWについて同義関係が開始したと判定する(図14:ステップS306)。現出現が他出現より多い状態が継続すれば(図14:ステップS305)、この入替処理20dは実行されず、終了処理の判定に移る。
前述した図11において、実施例3の同義関係判定装置103は、同義関係判定特定手段12に、データを表示する表示デバイス95を併設している。そして、同義関係判定特定手段12が、表示制御部26を備えている。
ここで、上記実施例1乃至実施例3の同義関係判定装置101,102,103に共通する情報処理について、ハードウエア資源を参照して説明する。
本実施形態における同義関係判定装置101,102,103による情報処理は、ソフトウエアとハードウエア資源とが協働し、使用目的に応じて情報を演算し、又は加工する具体的手段である。
ハードウエア資源として、図15に示すように、情報処理をするコンピュータ80を有している。コンピュータ80は、中央処理装置(CPU)である演算手段82と、この演算手段82に記憶領域を提供する主記憶手段86を有する。コンピュータ80は、一般に、データバス及び入出力インタフェースを通じて接続される周辺機器を有する。周辺機器は、代表的には、通信手段88、外部記憶手段90、入力手段92、出力手段94である。周辺機器を含めた全体をコンピュータ80ということもある。
所定の一の同義元表現と同義関係の対象となる複数の同義表現候補とが相互に対応して記録された同義表現候補記録部10と、外部入力されるテキスト中における前記同義表現候補と前記同義元表現との同義関係を一定の基準に基づいて判定し特定する同義関係判定特定手段12とを備えた同義関係判定装置であって、
前記同義関係判定特定手段12が、
外部入力される前記テキストを収集しこれに基づいて発行時間が特定可能なテキスト集合を生成するテキスト収集部14と、
このテキスト収集部14で収集されたテキスト集合から前記同義表現候補が多く検出される時区間と前記同義元表現が多く検出される時区間とを特定し出力する同義候補検出部12Aと、
前記同義表現候補が前記テキスト集合で検出される時区間と前記同義元表現が前記テキスト集合で検出される時区間との位置関係及び検出頻度とに基づいて、前記同義表現候補と前記同義元表現とが同義関係にある時区間を同義期間と判定し特定する同義期間特定部12Bと、
を備えたことを特徴とする同義関係判定装置。
付記1に記載の同義関係判定装置において、
前記同義候補検出部12Aを、前記テキスト収集部で収集された発行時間を特定できるテキスト集合から前記同義表現候補を検出し計数すると共に単位時間当たりの出現数の大きい時区間を(前記同義表現候補が存在する時区間として)特定する同義候補検索部16と、同じく前記テキスト収集部で収集された発行時間を特定できるテキスト集合から前記同義元表現を検出し計数すると共に単位時間当たりの出現数の大きい時区間を(前記同義元表現が存在する時区間として)特定する同義元検索部18とを含む構成としたことを特徴とする同義関係判定装置。
付記1又は2に記載の同義関係判定装置において、
前記同義期間特定部12Bを、前記同義候補検出手段で前記同義表現候補が検出された時区間におけるテキスト集合中で少なくとも最も出現数の多い同義元表現を同義関係にある同義元として抽出する同義関係抽出部20と、この抽出された同義元表現が前記同義表現候補と同義関係にあると判定すると共に当該両者が同義関係にある時区間の開始点を同義期間の開始点としてこれを同義期間と共に予め装備した同義辞書に登録する同義期間開始判定部22とを含む構成としたことを特徴とする同義関係判定装置。
付記3に記載の同義関係判定装置において、
前記同義期間開始判定部22は、前記同義表現候補が前記テキスト集合で検出される時区間における集合中で、前記時区間での単位時間あたりの出現数と前記時区間以前での単位時間あたりの出現数の比率が最も大きい同義元表現についても、これを前記同義表現候補と同義関係にあると判定する機能を備えていることを特徴とした同義関係判定装置。
付記1,2,3,又は4に記載の同義関係判定装置において、
前記同義期間特定部12Bは、
当該同義期間特定部12Bの前記同義期間開始判定部22で同義関係にあると判定され特定された同義表現候補についてその単位期間あたりの出現数が予め設定したしきい値以下になった時点で、前記同義関係が解消した旨判定する同義期間終了判定部24を備えていることを特徴とした同義関係判定装置。
所定の一の同義元表現と同義関係の対象となる複数の同義表現候補とが相互に対応して記録された同義表現候補記録部10と、外部入力されるテキスト中における前記同義表現候補と同義元表現との同義関係を判定し特定する同義関係判定特定手段12を備えた同義関係判定装置にあって、
外部入力される前記テキストを収集しこれに基づいて発行時間が特定できるテキスト集合を、前記同義関係判定特定手段12のテキスト収集部が生成し(テキスト集合生成工程)、
この生成された前記テキスト集合に含まれる前記同義表現候補と前記同義元表現との同義関係を、前記同義関係判定特定手段12が一定の基準に基づいて判定すると共に特定し(同義関係特定工程)、
前記同義関係を特定する工程にあっては、
前記テキスト集合から前記同義表現候補が多く検出される時区間と前記同義元表現が多く検出される時区間とを、それぞれ前記同義関係判定特定手段12の同義候補検出部12Aが検索して特定し(同義候補検出工程)、
続いて、前記同義表現候補が前記テキスト集合で検出される時区間と前記同義元表現が前記テキスト集合で検出される時区間との位置関係及び検出頻度とに基づいて、前記同義関係判定特定手段12の同義期間特定部12Bが前記同義表現候補と前記同義元表現とが同義関係にある時区間を同義期間と判定し特定するようにしたこと(同義期間特定工程)を特徴とする同義関係判定方法。
付記6に記載の同義関係判定方法において、
前記同義候補を検出する工程にあっては、
前記テキスト収集部で収集された前記テキスト集合から前記同義表現候補を検出し計数すると共に単位時間当たりの出現数の大きい時区間を抽出して特定し(同義候補対応・時区間特定工程)、
これと相前後して同じく前記テキスト収集部で収集された前記テキスト集合から前記同義元表現を検出し計数すると共に単位時間当たりの出現数の大きい時区間を抽出して特定する構成(同義元対応・時区間特定工程)とし、
これらの各時区間の特定工程における動作内容を前記同義候補検出部12Aが実行することを特徴とした同義関係判定方法。
付記6に記載の同義関係判定方法において、
前記同義期間を特定する工程にあっては、
前記同義候補の検出工程で前記同義表現候補が検出された時区間におけるテキスト集合中で少なくとも最も出現数の多い同義元表現を同義関係にある同義元として抽出し(同義関係抽出工程)、
これと相前後して、同じく前記抽出された同義元表現が前記同義表現候補と同義関係にあると判定し且つ当該両者が同義関係にある時区間の開始点を同義期間の開始点としてこれを前記同義期間と共に予め装備した同義辞書に登録する構成とし(同義期間登録工程)、
これらの各抽出/登録工程の動作内容を前記同義区間特定手段12Bが実行することを特徴とした同義関係判定方法。
付記6,7又は8に記載の同義関係判定方法において、
前記同義期間特定部12Bで同義関係にあると判定された同義表現候補の単位期間あたりの出現数を継続的に計数すると共にその出現数が予め設定したしきい値以下になった場合には、前記同義期間特定部12Bの同義期間終了判定部24が前記同義関係が解消したと判定する構成としたことを特徴とする同義関係判定方法。
所定の一の同義元表現と同義関係の対象となる複数の同義表現候補とが相互に対応して記録された同義表現候補記録部10と、外部入力されるテキスト中における前記同義表現候補と同義元表現との同義関係を判定し特定する同義関係判定特定手段12とを備えた同義関係判定装置にあって、
外部入力されるテキストを収集して発行時間を特定できるテキスト集合を生成するテキスト集合生成処理機能、
および生成された前記テキスト集合に含まれる前記同義表現候補と前記同義元表現との同義関係を、一定の基準に基づいて判定すると共に特定処理する同義関係特定処理機能を設けると共に、
前記同義関係特定処理機能が、
前記テキスト収集部で収集された前記テキスト集合から前記同義表現候補が多く検出される時区間と前記同義元表現が多く検出される時区間とを、それぞれ検索して特定処理する同義候補検出処理機能、
および前記同義表現候補が前記テキスト集合で検出される時区間と前記同義元表現が前記テキスト集合で検出される時区間との位置関係及び検出頻度とに基づいて、前記同義表現候補と前記同義元表現とが同義関係にある時区間を同義期間と判定し特定する同義期間特定処理機能を含む構成とし、
これらの各処理機能を前記同義関係判定特定手段12が備えているコンピュータに実現させるようにしたことを特徴とした同義関係判定プログラム。
付記10に記載の同義関係判定プログラムにおいて、
前記同義候補検出処理機能が、
前記テキスト収集部で収集された発行時間を特定できるテキスト集合から前記同義表現候補を検出し計数すると共に単位時間当たりの出現数の大きい時区間を抽出して特定する同義候補対応・時区間特定処理機能、
および前記テキスト収集部で収集された発行時間を特定できるテキスト集合から前記同義元表現を検出し計数すると共に単位時間当たりの出現数の大きい時区間を抽出して特定する同義元対応・時区間特定処理機能、を備えた構成とし、
これら各時区間特定処理機能を前記同義関係判定特定手段12が備えているコンピュータに実現させるようにしたことを特徴とする同義関係判定プログラム。
付記10に記載の同義関係判定プログラムにおいて、
前記同義期間特定処理機能が、
前記同義候補検出処理機能で前記同義表現候補が検出された時区間におけるテキスト集合中で少なくとも最も出現数の多い同義元表現を同義関係にある同義元として抽出する同義関係抽出処理機能、
およびこれと相前後して、同じく前記抽出された同義元表現が前記同義表現候補と同義関係にあると判定し且つ当該両者が同義関係にある時区間の開始点を同義期間の開始点としてこれを前記同義期間と共に予め装備した同義辞書に登録する同義期間登録処理機能、を備えた構成とし、
これらの各処理機能を前記同義関係判定特定手段12が備えているコンピュータに実現させるようにしたことを特徴とする同義関係判定プログラム。
付記10,11又は12に記載の同義関係判定プログラムにおいて、
前記同義関係特定処理機能が、
前記同義関係にあると判定された同義表現候補の単位期間あたりの出現数を継続的に計数すると共にその出現数が予め設定したしきい値以下になった場合に、その時点で前記同義関係が解消したと判定する同義期間終了判定機能を備え、
これを前記同義関係判定特定手段12が備えているコンピュータに実現させるようにしたことを特徴とする同義関係判定プログラム。
12 同義関係判定特定手段
12A 同義候補検出部
12B 同義期間特定部
14 テキスト収集部
16 同義候補検索部
18 同義元検索部
18a 複数出現処理
20 同義関係抽出部
22 同義期間開始判定部
24 同義期間終了判定部
26 表示制御部
30 同義候補生成部
32 同義辞書
EW 同義候補
OW 同義元
PD 時区間
D 時区分別同義データ
TX テキスト集合
Claims (10)
- 所定の一の同義元表現と同義関係の対象となる複数の同義表現候補とが相互に対応して記録された同義表現候補記録部と、外部入力されるテキスト中における前記同義表現候補と前記同義元表現との同義関係を一定の基準に基づいて判定し特定する同義関係判定特定手段とを備えた同義関係判定装置であって、
前記同義関係判定特定手段が、
外部入力される前記テキストを収集しこれに基づいて発行時間が特定可能なテキスト集合を生成するテキスト収集部と、
このテキスト収集部で収集されたテキスト集合から前記同義表現候補が多く検出される時区間と前記同義元表現が多く検出される時区間とを特定し出力する同義候補検出部と、
前記同義表現候補が前記テキスト集合で検出される時区間と前記同義元表現が前記テキスト集合で検出される時区間との位置関係及び検出頻度とに基づいて、前記同義表現候補と前記同義元表現とが同義関係にある時区間を同義期間と判定し特定する同義期間特定部と、
を備えたことを特徴とする同義関係判定装置。 - 請求項1に記載の同義関係判定装置において、
前記同義候補検出部を、前記テキスト収集部で収集された発行時間を特定できるテキスト集合から前記同義表現候補を検出し計数すると共に単位時間当たりの出現数の大きい時区間を前記同義表現候補が存在する時区間として特定する同義候補検索部と、同じく前記テキスト収集部で収集された発行時間を特定できるテキスト集合から前記同義元表現を検出し計数すると共に単位時間当たりの出現数の大きい時区間を前記同義元表現が存在する時区間として特定する同義元検索部とを含む構成としたことを特徴とする同義関係判定装置。 - 請求項1又は2に記載の同義関係判定装置において、
前記同義期間特定部を、前記同義候補検出部で前記同義表現候補が検出された時区間におけるテキスト集合中で少なくとも最も出現数の多い同義元表現を同義関係にある同義元として抽出する同義関係抽出部と、この抽出された同義元表現が前記同義表現候補と同義関係にあると判定すると共に当該両者が同義関係にある時区間の開始点を同義期間の開始点としてこれを同義期間と共に予め装備した同義辞書に登録する同義期間開始判定部とを含む構成としたことを特徴とする同義関係判定装置。 - 請求項3に記載の同義関係判定装置において、
前記同義期間開始判定部は、前記同義表現候補が前記テキスト集合で検出される時区間における集合中で、前記時区間での単位時間あたりの出現数と前記時区間以前での単位時間あたりの出現数の比率が最も大きい同義元表現についても、これを前記同義表現候補と同義関係にあると判定する機能を備えていることを特徴とした同義関係判定装置。 - 請求項1,2,3,又は4に記載の同義関係判定装置において、
前記同義期間特定部は、
当該同義期間特定部の前記同義期間開始判定部で同義関係にあると判定され特定された同義表現候補についてその単位期間あたりの出現数が予め設定したしきい値以下になった時点で、前記同義関係が解消した旨判定する同義期間終了判定部を備えていることを特徴とした同義関係判定装置。 - 所定の一の同義元表現と同義関係の対象となる複数の同義表現候補とが相互に対応して記録された同義表現候補記録部と、外部入力されるテキスト中における前記同義表現候補と同義元表現との同義関係を判定し特定する同義関係判定特定手段を備えた同義関係判定装置にあって、
外部入力される前記テキストを収集しこれに基づいて発行時間が特定できるテキスト集合を、前記同義関係判定特定手段のテキスト収集部が生成し、
この生成された前記テキスト集合に含まれる前記同義表現候補と前記同義元表現との同義関係を、前記同義関係判定特定手段が一定の基準に基づいて判定すると共に特定し、
前記同義関係を特定する工程にあっては、
前記テキスト集合から前記同義表現候補が多く検出される時区間と前記同義元表現が多く検出される時区間とを、それぞれ前記同義関係判定特定手段の同義候補検出部が検索して特定し、
次に、前記同義表現候補が前記テキスト集合で検出される時区間と前記同義元表現が前記テキスト集合で検出される時区間との位置関係及び検出頻度とに基づいて、前記同義関係判定特定手段の同義期間特定部が前記同義表現候補と前記同義元表現とが同義関係にある時区間を同義期間と判定し特定するようにしたことを特徴とする同義関係判定方法。 - 請求項6に記載の同義関係判定方法において、
前記同義候補を検出する工程にあっては、
前記テキスト収集部で収集された前記テキスト集合から前記同義表現候補を検出し計数すると共に単位時間当たりの出現数の大きい時区間を抽出して特定し、
これと相前後して同じく前記テキスト収集部で収集された前記テキスト集合から前記同義元表現を検出し計数すると共に単位時間当たりの出現数の大きい時区間を抽出して特定する構成とし、
これらの各時区間の特定工程における動作内容を前記同義候補検出部が実行することを特徴とした同義関係判定方法。 - 請求項6に記載の同義関係判定方法において、
前記同義期間を特定する工程にあっては、
前記同義候補の検出工程で前記同義表現候補が検出された時区間におけるテキスト集合中で少なくとも最も出現数の多い同義元表現を同義関係にある同義元として抽出し、
これと相前後して、同じく前記抽出された同義元表現が前記同義表現候補と同義関係にあると判定し且つ当該両者が同義関係にある時区間の開始点を同義期間の開始点としてこれを前記同義期間と共に予め装備した同義辞書に登録する構成とし、
これらの各抽出/登録工程の動作内容を前記同義区間特定手段が実行することを特徴とした同義関係判定方法。 - 所定の一の同義元表現と同義関係の対象となる複数の同義表現候補とが相互に対応して記録された同義表現候補記録部と、外部入力されるテキスト中における前記同義表現候補と同義元表現との同義関係を判定し特定する同義関係判定特定手段とを備えた同義関係判定装置にあって、
外部入力されるテキストを収集して発行時間を特定できるテキスト集合を生成するテキスト集合生成処理機能、
および生成された前記テキスト集合に含まれる前記同義表現候補と前記同義元表現との同義関係を、一定の基準に基づいて判定すると共に特定処理する同義関係特定処理機能を設けると共に、
前記同義関係特定処理機能が、
前記テキスト収集部で収集された前記テキスト集合から前記同義表現候補が多く検出される時区間と前記同義元表現が多く検出される時区間とを、それぞれ検索して特定処理する同義候補検出処理機能、
および前記同義表現候補が前記テキスト集合で検出される時区間と前記同義元表現が前記テキスト集合で検出される時区間との位置関係及び検出頻度とに基づいて、前記同義表現候補と前記同義元表現とが同義関係にある時区間を同義期間と判定し特定する同義期間特定処理機能を含む構成とし、
これらの各処理機能を前記同義関係判定特定手段が備えているコンピュータに実現させるようにしたことを特徴とした同義関係判定プログラム。 - 請求項9に記載の同義関係判定プログラムにおいて、
前記同義候補検出処理機能が、
前記テキスト収集部で収集された発行時間を特定できるテキスト集合から前記同義表現候補を検出し計数すると共に単位時間当たりの出現数の大きい時区間を抽出して特定する同義候補対応・時区間特定処理機能、
および前記テキスト収集部で収集された発行時間を特定できるテキスト集合から前記同義元表現を検出し計数すると共に単位時間当たりの出現数の大きい時区間を抽出して特定する同義元対応・時区間特定処理機能、を備えた構成とし、
これら各時区間特定処理機能を前記同義関係判定特定手段が備えているコンピュータに実現させるようにしたことを特徴とする同義関係判定プログラム。
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2014507891A JP6394388B2 (ja) | 2012-03-30 | 2013-03-26 | 同義関係判定装置、同義関係判定方法、及びそのプログラム |
SG11201406240WA SG11201406240WA (en) | 2012-03-30 | 2013-03-26 | Synonym relation determination device, synonym relation determination method, and program thereof |
US14/389,462 US9489370B2 (en) | 2012-03-30 | 2013-03-26 | Synonym relation determination device, synonym relation determination method, and program thereof |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2012082722 | 2012-03-30 | ||
JP2012-082722 | 2012-03-30 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2013146736A1 true WO2013146736A1 (ja) | 2013-10-03 |
Family
ID=49259987
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2013/058696 WO2013146736A1 (ja) | 2012-03-30 | 2013-03-26 | 同義関係判定装置、同義関係判定方法、及びそのプログラム |
Country Status (4)
Country | Link |
---|---|
US (1) | US9489370B2 (ja) |
JP (1) | JP6394388B2 (ja) |
SG (1) | SG11201406240WA (ja) |
WO (1) | WO2013146736A1 (ja) |
Families Citing this family (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9009197B2 (en) | 2012-11-05 | 2015-04-14 | Unified Compliance Framework (Network Frontiers) | Methods and systems for a compliance framework database schema |
US10152532B2 (en) * | 2014-08-07 | 2018-12-11 | AT&T Interwise Ltd. | Method and system to associate meaningful expressions with abbreviated names |
JP6481643B2 (ja) * | 2016-03-08 | 2019-03-13 | トヨタ自動車株式会社 | 音声処理システムおよび音声処理方法 |
JP2017167851A (ja) * | 2016-03-16 | 2017-09-21 | 株式会社東芝 | 概念辞書作成装置、方法およびプログラム |
US10943075B2 (en) * | 2018-02-22 | 2021-03-09 | Entigenlogic Llc | Translating a first language phrase into a second language phrase |
US11182416B2 (en) | 2018-10-24 | 2021-11-23 | International Business Machines Corporation | Augmentation of a text representation model |
US10769379B1 (en) | 2019-07-01 | 2020-09-08 | Unified Compliance Framework (Network Frontiers) | Automatic compliance tools |
US11120227B1 (en) | 2019-07-01 | 2021-09-14 | Unified Compliance Framework (Network Frontiers) | Automatic compliance tools |
US10824817B1 (en) * | 2019-07-01 | 2020-11-03 | Unified Compliance Framework (Network Frontiers) | Automatic compliance tools for substituting authority document synonyms |
EP4205018A1 (en) | 2020-08-27 | 2023-07-05 | Unified Compliance Framework (Network Frontiers) | Automatically identifying multi-word expressions |
US20230031040A1 (en) | 2021-07-20 | 2023-02-02 | Unified Compliance Framework (Network Frontiers) | Retrieval interface for content, such as compliance-related content |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH0877178A (ja) * | 1994-09-01 | 1996-03-22 | Ibm Japan Ltd | 情報検索システム及び方法 |
JPH11312168A (ja) * | 1998-04-28 | 1999-11-09 | Nippon Telegr & Teleph Corp <Ntt> | 同義語計算装置及びその方法並びに同義語計算プログラムを記録した媒体 |
Family Cites Families (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP3547069B2 (ja) * | 1997-05-22 | 2004-07-28 | 日本電信電話株式会社 | 情報関連づけ装置およびその方法 |
JP2003296354A (ja) | 2002-03-29 | 2003-10-17 | Mitsubishi Electric Corp | 辞書作成装置 |
US7636714B1 (en) | 2005-03-31 | 2009-12-22 | Google Inc. | Determining query term synonyms within query context |
US7925498B1 (en) | 2006-12-29 | 2011-04-12 | Google Inc. | Identifying a synonym with N-gram agreement for a query phrase |
US8037086B1 (en) | 2007-07-10 | 2011-10-11 | Google Inc. | Identifying common co-occurring elements in lists |
US8001136B1 (en) | 2007-07-10 | 2011-08-16 | Google Inc. | Longest-common-subsequence detection for common synonyms |
US9092517B2 (en) | 2008-09-23 | 2015-07-28 | Microsoft Technology Licensing, Llc | Generating synonyms based on query log data |
US8612202B2 (en) | 2008-09-25 | 2013-12-17 | Nec Corporation | Correlation of linguistic expressions in electronic documents with time information |
-
2013
- 2013-03-26 WO PCT/JP2013/058696 patent/WO2013146736A1/ja active Application Filing
- 2013-03-26 SG SG11201406240WA patent/SG11201406240WA/en unknown
- 2013-03-26 US US14/389,462 patent/US9489370B2/en active Active
- 2013-03-26 JP JP2014507891A patent/JP6394388B2/ja active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH0877178A (ja) * | 1994-09-01 | 1996-03-22 | Ibm Japan Ltd | 情報検索システム及び方法 |
JPH11312168A (ja) * | 1998-04-28 | 1999-11-09 | Nippon Telegr & Teleph Corp <Ntt> | 同義語計算装置及びその方法並びに同義語計算プログラムを記録した媒体 |
Non-Patent Citations (1)
Title |
---|
MASAAKI OKUBO ET AL.: "Extracting Information Demand by Analyzing a WWW Search Log", TRANSACTIONS OF INFORMATION PROCESSING SOCIETY OF JAPAN, vol. 39, no. 7, 15 July 1998 (1998-07-15), pages 2250 - 2258, XP008171097 * |
Also Published As
Publication number | Publication date |
---|---|
JP6394388B2 (ja) | 2018-09-26 |
US9489370B2 (en) | 2016-11-08 |
SG11201406240WA (en) | 2014-11-27 |
US20150066478A1 (en) | 2015-03-05 |
JPWO2013146736A1 (ja) | 2015-12-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP6394388B2 (ja) | 同義関係判定装置、同義関係判定方法、及びそのプログラム | |
US9317498B2 (en) | Systems and methods for generating summaries of documents | |
US9600466B2 (en) | Named entity extraction from a block of text | |
JP4241934B2 (ja) | テキスト処理及び検索システム及び方法 | |
US10558754B2 (en) | Method and system for automating training of named entity recognition in natural language processing | |
KR101713831B1 (ko) | 문서추천장치 및 방법 | |
WO2010014082A1 (en) | Method and apparatus for relating datasets by using semantic vectors and keyword analyses | |
CN103365924A (zh) | 一种搜索信息的方法、装置和终端 | |
CN102737021B (zh) | 搜索引擎及其实现方法 | |
CN104978332B (zh) | 用户生成内容标签数据生成方法、装置及相关方法和装置 | |
JP2016529619A (ja) | ハイパーリンクが設定されたマイニングされたテキストスニペットを介する画像のブラウジング | |
EP2870549A1 (en) | Weight-based stemming for improving search quality | |
CN102722501A (zh) | 搜索引擎及其实现方法 | |
US11887011B2 (en) | Schema augmentation system for exploratory research | |
CN112989208A (zh) | 一种信息推荐方法、装置、电子设备及存储介质 | |
US8037403B2 (en) | Apparatus, method, and computer program product for extracting structured document | |
Shah et al. | DOM-based keyword extraction from web pages | |
KR101375221B1 (ko) | 의료 프로세스 모델링 및 검증 방법 | |
WO2019231635A1 (en) | Method and apparatus for generating digest for broadcasting | |
Tsapatsoulis | Web image indexing using WICE and a learning-free language model | |
Al-Hamami et al. | Development of an opinion blog mining system | |
Shah et al. | WebRank: Language-Independent Extraction of Keywords from Webpages | |
Zuo et al. | Cross-Genre Retrieval for Information Integrity: A COVID-19 Case Study | |
Luo et al. | Improving keyphrase extraction from web news by exploiting comments information | |
Jin et al. | Labelling topics in weibo using word embedding and graph-based method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 13768820 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2014507891 Country of ref document: JP Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
WWE | Wipo information: entry into national phase |
Ref document number: 14389462 Country of ref document: US |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 13768820 Country of ref document: EP Kind code of ref document: A1 |