WO2010035455A1 - 情報分析装置、情報分析方法、及びプログラム - Google Patents
情報分析装置、情報分析方法、及びプログラム Download PDFInfo
- Publication number
- WO2010035455A1 WO2010035455A1 PCT/JP2009/004752 JP2009004752W WO2010035455A1 WO 2010035455 A1 WO2010035455 A1 WO 2010035455A1 JP 2009004752 W JP2009004752 W JP 2009004752W WO 2010035455 A1 WO2010035455 A1 WO 2010035455A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- series data
- time
- section
- document
- sections
- Prior art date
Links
- 238000004458 analytical method Methods 0.000 title claims abstract description 91
- 238000000605 extraction Methods 0.000 claims abstract description 31
- 238000004364 calculation method Methods 0.000 claims abstract description 25
- 239000000284 extract Substances 0.000 claims abstract description 14
- 230000008859 change Effects 0.000 claims description 39
- 238000000034 method Methods 0.000 claims description 36
- 230000000875 corresponding effect Effects 0.000 description 396
- 239000013598 vector Substances 0.000 description 47
- 230000014509 gene expression Effects 0.000 description 42
- 230000006870 function Effects 0.000 description 17
- 238000010586 diagram Methods 0.000 description 16
- 206010028980 Neoplasm Diseases 0.000 description 7
- 201000011510 cancer Diseases 0.000 description 7
- 238000007619 statistical method Methods 0.000 description 7
- 238000000611 regression analysis Methods 0.000 description 6
- 238000005516 engineering process Methods 0.000 description 5
- 231100000957 no side effect Toxicity 0.000 description 5
- 238000012731 temporal analysis Methods 0.000 description 5
- 238000000700 time series analysis Methods 0.000 description 5
- 230000003247 decreasing effect Effects 0.000 description 4
- 238000005065 mining Methods 0.000 description 4
- 230000004044 response Effects 0.000 description 4
- 230000000694 effects Effects 0.000 description 3
- 238000003058 natural language processing Methods 0.000 description 3
- 230000001364 causal effect Effects 0.000 description 2
- 230000002596 correlated effect Effects 0.000 description 2
- 230000036541 health Effects 0.000 description 2
- 230000002123 temporal effect Effects 0.000 description 2
- 230000007704 transition Effects 0.000 description 2
- 244000187656 Eucalyptus cornuta Species 0.000 description 1
- 238000003646 Spearman's rank correlation coefficient Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 235000009508 confectionery Nutrition 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 235000013305 food Nutrition 0.000 description 1
- 239000003607 modifier Substances 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 230000000877 morphologic effect Effects 0.000 description 1
- 230000000737 periodic effect Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3347—Query execution using vector based model
Definitions
- the present invention relates to an information analysis apparatus, an information analysis method, and a program for analyzing a document set.
- This application claims priority on the basis of Japanese Patent Application No. 2008-244753 for which it applied to Japan on September 24, 2008, and uses the content here.
- Non-Patent Document 1 discloses a technique for obtaining a similarity between two documents in order to group similar documents and organize texts.
- the similarity between two documents is defined by an expression using the number of index words (a kind of language expression) that appear in common in both documents.
- index words a kind of language expression
- the similarity between two document sets clusters
- the maximum value among the similarities between documents belonging to each document set is used, and the pair of document sets having the highest similarity (cluster pair) is merged. And one group.
- language expression means a description representing a specific noun, topic, opinion, or thing included in a document (text).
- Examples of the “language expression” include a noun expression expressed by a so-called noun such as an event name, an event name, and a product name, and a combination of a noun expression and a predicate or a modifier.
- Specific examples of noun expressions include “race games”, “food disguise”, “earthquake resistant gel”, and the like.
- Specific examples of combined expressions include “seismic gel is effective” and “diesel engine is good for the environment”.
- the “language expression” may be a character string itself appearing in a document, and an existing natural language processing technology such as morphological analysis, syntax analysis, dependency analysis, or synonym processing is applied to the document. It may be the analysis result obtained by applying it.
- “school” and “student” are linguistic expressions each consisting of one word.
- the relationship between words such as “School ⁇ Go” obtained by performing dependency analysis on texts such as “go to school”, “go to school”, “go to school”, etc.
- the result of the receiving analysis is also a linguistic expression representing a single meaning.
- the document data analysis is also performed by examining the temporal transition of the number of document sets including a specific language expression. It has been broken. This will be described below.
- Non-Patent Document 2 discloses a technique called “Blog Watcher”.
- time series changes such as the number of times a specific topic word has appeared, the number of times that the topic word has been described positively, and the number of times it has been described negatively in the entire collected blog are broken lines. Plotted as a graph.
- the user can examine the transition of the number of appearances in the blog of the topic word of interest, and how popular the topic word of interest was at each time point. Analysis can be performed.
- regression analysis is a basic method of statistical analysis. This is a technology to detect highly relevant events by examining the correlation of time changes of multiple time-series data when there are multiple sets of time-series data such as the number of occurrences and prices of each event at each time point. It is. For example, when there is a correlation between the time change of one stock price and the time change of another stock price, the regression analysis is performed by regarding the price of each of the two stocks as time series data. So we can calculate how much the price of both was related.
- the event of interest is an event expressed in a specific language expression.
- a document set of documents with time information is given as an analysis target instead of direct time-series data such as stock prices
- each language expression can be expressed by using the technique disclosed in Non-Patent Document 2.
- the time series data can be obtained.
- the document set that is the analysis population is divided by a specific period using time information, the number of documents including each language expression and the number of appearances of the language expression in each period are determined by the period of each language expression.
- Each time-series data is if the document set that is the analysis population is divided by a specific period using time information, the number of documents including each language expression and the number of appearances of the language expression in each period are determined by the period of each language expression.
- Non-Patent Document 2 if two document sets with time information are converted into two time-series data, and then the correlation between the two is examined by statistical analysis such as regression analysis, The relevance of is required. In this case, it is irrelevant whether the same or similar language expression exists in the two sets of documents with time information.
- the two sets of documents with time information are regarded as time-series data, and the degree of relevance between the two is obtained from the similarity and correlation between the two change patterns.
- FIG. 2 is a diagram illustrating an example of time-series data, as will be described later.
- two peaks exist at the same time in time series data (1) and time series data (2). Therefore, high relevance is recognized only from the time series data shown in FIG.
- the time series data (1) and the time series data (2) there is a causal relationship between the time series data (1) and the time series data (2), in which one causes the change of the other, and high relevance may be appropriate.
- the two peaks of the time series data (1) are due to two different causes and the peaks are independent, but the two peaks of the time series data (2) are It is possible that the peak is a periodic peak due to another cause. That is, in the time-series data (1) and the time-series data (2), a case where the peak sections of both coincide by chance can be considered.
- Non-Patent Document 2 two sets of documents with time information are converted into two time-series data, and then the correlation between the two is examined by statistical analysis such as regression analysis. In some cases, it is difficult to determine whether it is due to coincidence or is really relevant.
- Non-Patent Document 1 the similarity between the document set that is the origin of one time-series data and the document set that is the origin of another time-series data is obtained, and the obtained similarity Based on the characteristics, a method for obtaining the degree of association between time series data is also conceivable.
- the similarity between the two document sets is calculated based on the degree to which the same or similar language expression appears in both document sets.
- An object of the present invention is to solve the above-mentioned problem, and when determining the relevance of a plurality of document sets with time information, the change pattern of the time series data obtained from each document set is accidentally changed.
- An object of the present invention is to provide an information analysis device, an information analysis method, and a program capable of suppressing the influence of matching.
- an information analysis apparatus is an information analysis apparatus that performs information analysis on a document set including documents to which time information is added, A plurality of time-series data generated based on the time information for each document set from a plurality of the document sets are compared with each other, and two or more sections of other time-series data from each time-series data A corresponding section selecting unit for selecting two or more sections that change corresponding to For each of the plurality of time-series data, a feature extraction unit that identifies the document belonging to the selected two or more sections for each section, and extracts the characteristics of the identified document for each section; For each of the time series data, the inter-feature distance between the feature extracted from one section and the feature extracted from another section in the selected two or more sections is obtained, and the obtained time A comparison unit for comparing distances between features for each series data; A relevance calculating unit that calculates a relevance between the document sets based on a result of the comparison by the comparing unit.
- an information analysis method for performing information analysis on a document set including a document to which time information is given, (A) A plurality of time series data generated based on the time information for each document set from a plurality of the document sets are compared with each other, and two or more of the other time series data are obtained from each time series data.
- a program according to an aspect of the present invention is a program for causing a computer to perform information analysis on a document set including a document to which time information is added.
- (A) A plurality of time series data generated based on the time information for each document set from a plurality of the document sets are compared with each other, and two or more of the other time series data are obtained from each time series data.
- the change patterns of the time-series data obtained from each document set coincide by chance. The influence by this can be suppressed.
- FIG. 1 is a block diagram showing a schematic configuration of the information analysis apparatus according to Embodiment 1 of the present invention.
- FIG. 2 is a diagram illustrating an example of time-series data.
- FIG. 3 is a diagram illustrating an example of time-series data.
- FIG. 4 is a diagram illustrating an example of time-series data.
- FIG. 5 is a diagram illustrating an example of time-series data.
- FIG. 6 is a diagram illustrating an example of time-series data that varies due to a common cause.
- FIG. 7 is a diagram illustrating another example of time-series data that varies due to a common cause.
- FIG. 8 is a diagram illustrating another example of time-series data that varies due to different causes.
- FIG. 1 is a block diagram showing a schematic configuration of the information analysis apparatus according to Embodiment 1 of the present invention.
- FIG. 2 is a diagram illustrating an example of time-series data.
- FIG. 3 is
- FIG. 9 is a flowchart showing the flow of processing in the information analysis method according to Embodiment 1 of the present invention.
- FIG. 10 is a block diagram showing a schematic configuration of the information analysis apparatus according to Embodiment 2 of the present invention.
- FIG. 11 is a flowchart showing the flow of processing in the information analysis method according to Embodiment 2 of the present invention.
- FIG. 1 is a block diagram showing a schematic configuration of the information analysis apparatus according to Embodiment 1 of the present invention.
- 2 to 5 are diagrams showing examples of time-series data.
- the information analysis apparatus 1 shown in FIG. 1 is an apparatus that performs information analysis on a document set including documents to which time information is assigned. As illustrated in FIG. 1, the information analysis apparatus 1 includes a corresponding section selection unit 30, a feature extraction unit 40, a comparison unit 50, and a relevance calculation unit 70.
- the document set to be analyzed is composed of a plurality of text data to which time information is added, and is input to the information analysis apparatus 1 from the outside.
- the information analysis apparatus 1 further includes an input unit 10, a time series data generation unit 20, and an output unit 80.
- a database 60 is connected to the information analysis apparatus 1.
- the database 60 is used for processing by the comparison unit 50 as described later. In the following, a case will be described in which two document sets are input and two time-series data that change correspondingly are generated.
- the input unit 10 accepts input of a plurality of document sets to be analyzed.
- the document data constituting the document set is input to the input unit 10.
- the document data constituting the document set may be directly input to the input unit 10 from an external computer device via a network, or may be provided in a state stored in a recording medium.
- an interface for connecting the information analysis apparatus 1 to the outside is used as the input unit 10.
- a reading device is used as the input unit 10.
- time information in the present invention means time information such as date and time assigned to each document belonging to the input document set.
- time information time information directly related to each document such as the creation date / time, transmission date / time, and publication date / time of each document can be used.
- time information it is possible to use time information related to matters and cases handled by the contents in the document. Specific examples of such time information include an incoming call date and time recorded in a response record created at a call center, an accident occurrence date and time recorded in a police accident record, and the like.
- a plurality of pieces of time information may be given to one document.
- time information is used as unique time information for the document in the time-series data generation unit 20 described later.
- the time-series data generation unit 20 extracts only time information of a preset type.
- the format of the time information may be any format that can be ordered over time among the documents included in the input document set, such as year / month / day, combination of year / month / day and time, year / month only, etc. Any format may be used.
- Examples of document sets to be input include a blog article including a language expression (or synonymous expression) “I bought candy A”, a language expression “or dance of Idol B” (or its Blog articles that contain synonymous expressions). In this case, the date of each blog article becomes time information.
- the time-series data generation unit 20 generates a plurality of time-series data from a plurality of document sets received by the input unit 10 based on time information for each document set.
- a document set may be directly input to the information analysis apparatus 1.
- two document sets are input, and the time-series data generation unit 20 generates two time-series data.
- the time series data generated from the input document set (1) is referred to as “time series data (1)”
- the time series data generated from the input document set (2) is “ It is expressed as “time series data (2)”.
- time-series data refers to time divided by a certain period, and each divided section, or a specific point in each section such as the head or middle point of each section. Arbitrary counting results are arranged in order of time, and the data obtained thereby.
- time series data generated from a document set
- a stock price of a certain company for each date is a typical example of time series data.
- the certain period is one day.
- the time change of temperature, the time change of traffic on a specific road, and the like are not time series data generated from a document set, but are examples of time series data.
- the time-series data generation unit 20 generates time-series data from a document set.
- the document set is set to a certain fixed value. Divide by period to create multiple subsets.
- the degree of the certain period is not particularly limited, and the length of the certain period is the time and the purpose of use of the information analysis apparatus 1 and the time given to the documents constituting the document set. It is set appropriately according to the nature of the information.
- the time-series data generation unit 20 stores a document set of documents having time information of January 2005, a document set of documents having time information of February 2005, and documents having time information of March 2005. Like a document set, one document set is divided into a plurality of document sets. Then, the time-series data generation unit 20 obtains a value (arbitrary counting result) defined by the properties of the documents constituting each subset for each document set (subset) obtained by the division. The values are sorted in time order to obtain time series data.
- the “value defined by the properties of the document” may be any value that can be uniquely calculated mechanically from the properties of the documents constituting each subset, and the purpose and use of the information analysis apparatus 1 These are set as appropriate according to the type of meta information assigned to each document.
- examples of the “value defined by the nature of the document” include the number and size of documents constituting each subset, the number of unique senders of the documents constituting each subset, and the like.
- the “number of unique senders of a document” is the actual number of senders sending each document and does not include the total number of people who count the same person multiple times.
- information specifying the numerical value for each document for example, information specifying a sender such as a sender ID
- it must be added as meta information of the document separately from the time information.
- time series data (1) generated from the input document set (1) and the time series data (2) generated from the input document set (2) are illustrated.
- Both the time series data (1) and (2) can be represented by graphs in which the horizontal axis represents time and the vertical axis represents the counting result. In FIGS. The count results up to 2008) are plotted.
- the counting result that can be used as the vertical axis in the time series data may be a measured value itself such as the number of appearances, or may be a value obtained by correcting or converting the original numerical value. . Examples of the latter include values obtained by normalizing measured values by the number of all document sets, values obtained by differentiating changes in measured values, and the like. Further, what correction or conversion is performed or whether the measured value itself is used is appropriately selected according to the use and purpose of use of the information analysis apparatus 1 and the nature of the input document set. .
- the corresponding section selection unit 30 compares a plurality of time-series data obtained from a plurality of document sets with each other, and changes from each time-series data corresponding to each of two or more sections of other time-series data (corresponding 2) or more are selected.
- the corresponding section selection unit 30 compares the time series data (1) and the time series data (2) with each other, and selects two or more sections (corresponding sections) that change correspondingly from each other. .
- the corresponding section selection unit 30 outputs two or more corresponding sections of the selected time series data to the feature extraction unit 40.
- the corresponding section selection unit 30 includes a corresponding section pair selection unit 31 and a similar corresponding section pair selection unit 32, and performs selection of the corresponding section by these. This will be described below.
- the corresponding section pair selection unit 31 examines the correlation between the two time series data, and selects a section (corresponding section) that changes corresponding to each other between the two time series data.
- the corresponding section pair selection unit 31 receives the time-series data (1) and the time-series data (2) from the time-series data creation unit 20, and receives one section of one time-series data and the other that changes correspondingly.
- corresponding section pair is detected as a pair of corresponding sections in the time series data (hereinafter referred to as “corresponding section pair”).
- the corresponding section pair selection unit 31 selects two or more pairs of such corresponding section pairs from the time series data (1) and the time series data (2).
- correspondingly changing section is a graph in which values of one partial section of time series data (1) are plotted and a partial section of time series data (2). These partial one sections in the case where a high correlation is recognized with a graph in which values of one section are plotted. Further, in the first embodiment, it can be determined whether the correlation is high using the correlation coefficient.
- the corresponding section pair selection unit 31 first obtains a correlation coefficient between the time series data (1) and the time series data (2). Then, the corresponding section pair selection unit 31 can select two or more sections in each of the two time-series data that have an absolute value of the correlation coefficient that exceeds the set threshold value (or is equal to or greater than the threshold value) as the corresponding section. it can. At this time, the threshold is set so that two or more corresponding section pairs are selected in the time-series data assumed as an input, taking into account the nature of the document set that is the source of the time-series data and the fluctuation state of the time-series data. It is assumed that an appropriate value is set in advance.
- the obtained correlation coefficient may be a negative value.
- a general Pearson product moment correlation coefficient, Spearman rank correlation coefficient, Kendall rank correlation coefficient, or the like can be used.
- the corresponding section pair selecting unit 31 may set the threshold value again so that the preset threshold value becomes small. An instruction may be given to cancel the calculation of the degree of association.
- the corresponding segment pair selection unit 31 does not use the correlation coefficient, but instead uses the existing statistical analysis technique or the time series analysis technique, and uses one of the time series data parts. It is also possible to determine the correlation between the section and the other section of the time series data.
- the corresponding section pair selection unit 31 does not use only the high correlation in the partial sections of both time series data as the selection criterion of the corresponding section pair, but one or both of the time series data is characteristic.
- a fluctuating section may be detected, and the degree thereof may be used as a selection criterion. For example, it is possible to detect a section where one or both of the time-series data graphs change greatly, and select the corresponding section pair in consideration of the degree of change in this section.
- the graph in Fig. 2 can be cited as an example of selecting corresponding section pairs.
- both of the time series data (1) and (2) have two peaks that are convex upward.
- the correlation coefficient between the time series data is a positive high value, and the time series data (1) and (2) are highly correlated at the peak. Therefore, these two peaks can be selected as corresponding section pairs.
- the number of appearances of the time series data (1) is rapidly decreasing, whereas the number of appearances of the time series data (2) is It is increasing rapidly.
- the number of appearances of time-series data (1) is rapidly increasing, whereas the number of appearances of time-series data (2) is rapidly decreasing.
- the correlation coefficient is negative, but its absolute value is high, and the correlation between the rapidly increasing portion and the rapidly decreasing portion is considered high. Therefore, both the rapidly increasing and rapidly decreasing sections can be selected as corresponding section pairs.
- the corresponding sections of the time series data in FIGS. 2 to 8 are described as a corresponding section 1-1, a corresponding section 2-1, a corresponding section 1-2, and a corresponding section 2-2 for convenience of explanation. I will do it.
- the corresponding section 1-1 means the first corresponding section of the time series data (1)
- the corresponding section 1-2 means the second corresponding section of the time series data (1)
- the corresponding section 1-n means the nth corresponding section of the time series data (1).
- the corresponding section 2-1 means the first corresponding section of the time series data (2)
- the corresponding section 2-2 means the second corresponding section of the time series data (2)
- the corresponding section 2-n means the nth corresponding section of the time series data (2).
- the corresponding section 1-n and the corresponding section 2-n have the same numerical value corresponding to “n”, it indicates that the corresponding section pair has a corresponding relationship.
- the corresponding section 1-1 and the corresponding section 2-1 are corresponding section pairs that have a corresponding relationship.
- each corresponding section pair shown in FIG. 2 and FIG. 3 the length, start time, and end time are the same in the corresponding section having a correspondence relationship.
- this Embodiment 1 is not limited to this, In the corresponding section in a correspondence relationship, the length, start time, and end time of a corresponding section do not necessarily need to be the same.
- a pair of corresponding sections such as the pair of the corresponding section 1-1 and the corresponding section 2-1 or the pair of the corresponding section 1-2 and the corresponding section 2-2 shown in FIG.
- the start time and end time may be different from each other.
- the lengths of the corresponding sections 1-2 and the corresponding sections 2-2 shown in FIG. 4 may be different.
- the similar opposing section pair selection unit 32 examines the correlation between the partial sections for a plurality of partial sections existing in one time-series data, and further performs selection from those selected as the corresponding sections.
- the similar corresponding section pair selection unit 32 further selects similar correspondences in the time series data (1) and the time series data (2) from the plurality of corresponding section pairs previously selected by the corresponding section pair selection unit 31. Select interval pairs.
- the similar corresponding section pair selection unit 32 first determines whether or not changes in two or more selected corresponding sections in the time series data (1) are similar to each other. Similarly, in the time series data (2), it is determined whether or not changes in two or more selected corresponding sections are similar to each other.
- the similar corresponding section pair selection unit 32 determines that, in the time series data (1) and (2), if there are two or more similar corresponding sections on each time series data, the time series It is determined whether or not two or more corresponding sections similar in data (1) and two or more corresponding sections similar in time series data (2) change correspondingly (corresponding to a corresponding section pair). judge. If there are two or more corresponding section pairs that satisfy the above conditions, the similar corresponding section pair selection unit 32 selects these corresponding sections (corresponding section pairs).
- the similar corresponding section pair selection unit 32 outputs information specifying the corresponding section forming the corresponding section pair selected here to the feature extraction unit 40.
- corresponding sections that are on the same time-series data and are similar to each other are referred to as “similar corresponding sections”.
- a group of similar corresponding sections belonging to the same time series data is hereinafter referred to as “similar corresponding section set”.
- the corresponding section 1-m and the corresponding section 2-m, and the corresponding section 1-n and the corresponding section 2-n have already been selected as corresponding section pairs.
- the graph of the corresponding section 1-m and the graph of the corresponding section 1-n are similar, and if the graph of the corresponding section 2-m and the graph of the corresponding section 2-n are similar,
- the sections 1-m, 1-n, 2-m, and 2-n are selected again as similar corresponding sections.
- the corresponding sections 1-m and 1-n and the corresponding sections 2-m and 2-n are similar corresponding section sets.
- similarity determination by the similarity corresponding section pair selection unit 32 can be performed using the correlation coefficient.
- similarity determination between corresponding sections to be subjected to similarity determination, for example, between a corresponding section 1-m and a corresponding section 1-n, or between a corresponding section 2-m and a corresponding section 2-n.
- a correlation coefficient is obtained.
- compatible area pair selection part 32 determines with it being similar, when the calculated correlation coefficient is a positive value and exceeds a threshold value (or when it becomes more than a threshold value).
- the threshold is set so that two or more similar corresponding sections are selected in the time-series data assumed as input, taking into account the nature of the document set that is the source of the time-series data and the fluctuation state of the time-series data. It is assumed that it is set in advance.
- the similarity determination by the similarity corresponding section pair selection unit 32 in the first embodiment can be performed without using a correlation coefficient.
- the similarity corresponding section pair selection unit 32 can make a similar determination by a method using an existing time series analysis technique.
- a method using time series analysis technology the number of inflection points in each corresponding section, the relative position of the inflection point in the corresponding section, the value of the differential count between the inflection points, etc. are used as determination factors.
- a method is mentioned.
- the determination is made based on a preset threshold value. The threshold value can be set in the same manner as when the correlation coefficient is used.
- the similarity corresponding section pair selection unit 32 determines similarity by the time series analysis technique. For example, in FIG. 2, the corresponding section 1-1 and the corresponding section 1-2 both decrease after increasing. Therefore, it can be determined that these are similar. Also, the corresponding section 2-1 and the corresponding section 2-2 corresponding to these are similar. In this case, the similar corresponding section pair selection unit 32 selects the corresponding section pair of the corresponding section 1-1 and the corresponding section 2-1, and the corresponding section pair of the corresponding section 1-2 and the corresponding section 2-2. .
- the corresponding section 1-2 and the corresponding section 1-3 are both monotonically increasing and similar, but the corresponding section 2-2 and the corresponding section 2-3 corresponding to them are the same.
- the sign of the derivative is opposite and not similar. Therefore, the corresponding section 1-2 and the corresponding section 1-3, and the corresponding section 2-2 and the corresponding section 2-3 do not constitute a similar corresponding section set.
- the similar correspondence section pair selection unit 32 may reset the threshold value so that the threshold value used for the similarity determination described above becomes small when one or more similar correspondence section sets cannot be selected in each time-series data. good. Further, in this case, the similar correspondence section pair selection unit 32 may instruct the relevance calculation unit 70 to stop calculating the relevance.
- the similar corresponding section pair selection unit 32 of the first embodiment it is possible to extend the conditions of the similar corresponding section to be selected.
- the similar corresponding section pair selection unit 32 further selects similar correspondences in the time series data (1) and the time series data (2) from the plurality of corresponding section pairs previously selected by the corresponding section pair selection unit 31.
- this condition can be expanded. For example, a corresponding section pair having low similarity is selected in each of the time-series data (1) and the time-series data (2) from a plurality of corresponding section pairs previously selected by the corresponding section pair selection unit 31. You can also
- the corresponding section 1-1 and the corresponding section 1-2, and the corresponding section 2-1 and the corresponding section 2-2 have a similar relationship.
- the corresponding section 1-1 and the corresponding section 1-3, and the corresponding section 2-1 and the corresponding section 2-3 have a dissimilar relationship.
- the corresponding section pair of the corresponding sections 1-1 and 2-1 is similar to the corresponding section pair of the corresponding sections 1-2 and 2-2, but the corresponding sections 1-3 and 2-
- the corresponding section pair with 3 has a dissimilar relationship on both the time series data (1) side and the time series data (2) side.
- the similar corresponding section pair selection unit 32 adds the corresponding section 1-3. And 2-3 corresponding section pairs can also be selected.
- the similar corresponding section pair selection unit 32 selects a corresponding section having a dissimilar relationship as a selection target, for each corresponding section pair, a relationship with another corresponding section pair (similar relationship). It is preferable to register whether or not there is a dissimilar relationship.
- the time series data (1) side and the time series data (2 ) Side is either similar or dissimilar.
- the time series data (1) side and the time series data (2 ) Side is either similar or dissimilar.
- the feature extraction unit 40 identifies, for each corresponding section, a document (document data) belonging to two or more selected corresponding sections for each of a plurality of time series data, and extracts the document features specified for each corresponding section.
- the “document feature” here includes “document set feature” specified for each corresponding section.
- the feature extraction unit 40 identifies the corresponding sections selected for the time-series data (1) and the corresponding sections selected for the time-series data (2). Is performed for each corresponding section, and the characteristics of the specified document are extracted. For example, when the corresponding section 1-1, the corresponding section 2-1, the corresponding section 1-2, the corresponding section 2-2, the corresponding section 1-3, and the corresponding section 2-3 shown in FIG. 5 are selected. To do. In this case, the feature extraction unit 40 identifies documents belonging to each corresponding section for each of the six corresponding sections, and further extracts features from each of the identified documents.
- features extracted from a document include language expressions that characteristically appear in a set of documents belonging to a selected corresponding section.
- the linguistic expression that appears characteristically is the language expression that appears frequently as a result of counting the number of simple occurrences of each linguistic expression in the document set belonging to the selected corresponding section, and other than the corresponding section Compared with the number of appearances in the document set that belongs to the section or the number of appearances in the population of the documents to be analyzed by the information analysis device 1, language expressions that appear relatively frequently, and appear relatively infrequently Language expression.
- the feature extraction unit 40 can also extract such meta information as “feature”.
- sender information indicating whether the sender is “beginner”, “normal”, or “skilled” is given to each document in the input document set.
- These sender information can be used as features. For example, if the document set belonging to the corresponding section 1-2 includes a large number of documents transmitted from the “novice” sender, “novice” is the “feature” in the corresponding section 1-2. Extracted as
- the type of meta information is not particularly limited, and the feature extraction unit 40 can arbitrarily select the meta information provided to each document included in the input document set. Can be extracted as “features”.
- feature extraction from a specific document set by the feature extraction unit 40 can be performed using, for example, an existing text mining technique.
- the text mining technique is one of general natural language processing techniques and is not the main focus of the first embodiment of the present invention. Therefore, the description about the text mining technique is omitted.
- the extraction of “features” is performed by, for example, setting the number of pieces of information (language expression, meta information, etc.) to be extracted as “features” in advance, and extracting the set number of information in order from the most frequently appearing information. Can be done. Further, the extraction of “feature” can be performed using a feature score if, for example, a text mining technique is used.
- the feature extraction unit 40 first selects a feature element (language expression, meta information, etc.) for each corresponding section to be extracted, and calculates a feature score for each feature element. Then, the feature extraction unit 40 determines whether or not the feature score exceeds a set threshold value, and extracts a feature element that exceeds the threshold value as a “feature”.
- a feature element language expression, meta information, etc.
- the calculation of the “feature score” by the feature extraction unit 40 can be performed by various statistical analysis techniques using the appearance frequency of the feature elements.
- the feature extraction unit 40 obtains statistical measures such as the appearance frequency of each feature element, log likelihood ratio, ⁇ 2 value, Yates correction ⁇ 2 value, self-mutual information, SE, ESC, and the obtained value is used as a feature score.
- the feature extraction unit 40 obtains statistical measures such as the appearance frequency of each feature element, log likelihood ratio, ⁇ 2 value, Yates correction ⁇ 2 value, self-mutual information, SE, ESC, and the obtained value is used as a feature score.
- the feature extraction unit 40 can also extract the combination data of the feature element and its feature score as “feature”. For example, consider a case where n feature elements are extracted from the corresponding section 1-1. In this case, the feature 1-1 in the corresponding section 1-1 is expressed by a feature vector composed of 2n elements such as (T1, SC1, T2, SC2, T3, SC3,..., Tn, SCn). can do.
- T1 to Tn indicates n feature elements.
- the feature elements T1 to Tn for example, a language expression such as “effective for cancer” or meta information attached to a document such as sender information (the sender is “beginner”).
- SC1 to SCn is numerical data indicating the feature score added to each feature element.
- the feature elements may not be paired with the feature score, that is, only the feature elements may be extracted as “features”.
- the “feature” is expressed by a feature vector composed of n elements, such as feature 1-1 (T1, T2, T3,..., Tn).
- the comparison unit 50 obtains the inter-feature distance between the feature extracted from the document belonging to one corresponding section and the feature extracted from the document belonging to another corresponding section for each time series data. Further, in the first embodiment, when there are a plurality of combinations of corresponding sections for obtaining the distance between features instead of one set in each time series data, the distance between features is obtained for each of the plurality of sets. The distance value is treated as vector data.
- the time series data (1) and (2) shown in FIG. 5 will be described as an example.
- the corresponding sections 1-1 and 2-1, the corresponding sections 1-2 and 2-2, and the corresponding sections 1-3 and 2-3 are each a corresponding section pair.
- An interval pair exists.
- the time series data (1) it is assumed that three corresponding sections 1-1, 1-2, and 1-3 are selected.
- the inter-feature distance between the feature of the corresponding section 1-1 and the feature of 1-2 the inter-feature distance between the feature of the corresponding section 1-1 and the feature of 1-3, and the corresponding section 1-2.
- the distance between the features 1 and 1-3 is obtained.
- the obtained distance between features is represented by three-dimensional vector data.
- time series data (2) it is assumed that three corresponding sections, corresponding sections 2-1, 2-2, and 2-3, are selected.
- the distance between the features of the corresponding section 2-1 and the features of 2-2 the distance between the features of the corresponding section 2-1 and the features of 2-3, and the corresponding section 2-2.
- the distance between the features of 2-3 is similarly expressed by three-dimensional vector data.
- the inter-feature distance is obtained for all combinations of the corresponding sections selected by the corresponding section selecting unit 30.
- the inter-feature distance is obtained. May be obtained only for the corresponding sections adjacent on the time-series data.
- the distance between features is obtained only for adjacent corresponding sections
- the time series data (1) the characteristics for the corresponding sections 1-1 and 1-2 and the corresponding sections 1-2 and 1-3 are featured. A distance is required.
- the time series data (2) the distance between features is obtained for the corresponding sections 2-1 and 2-2 and the corresponding sections 2-2 and 2-3.
- the distance between features is represented by vector data.
- the combination of corresponding sections for which the distance between features is obtained is appropriately set according to the use and purpose of use of the information analysis apparatus 1 and the nature of the input document set. It ’s fine.
- the comparison unit 50 obtains the distance between features in an arbitrary corresponding section and another corresponding section using a function (distance function) for obtaining the distance between features.
- the distance function is defined in advance and stored in the database 60.
- the distance function can calculate the distance between features when a feature extracted from a document belonging to an arbitrary corresponding section and a feature extracted from a document belonging to another corresponding section are given. Function.
- the distance function is not limited. What function is used as the distance function can be set as appropriate according to the application and purpose of use of the information analysis apparatus 1 and the nature of the input document set. Specifically, a distance function that satisfies the following conditions can be used.
- Conditions 4 and 5 indicate that there are many common feature elements in the two input features, and the distance between the features becomes smaller as the feature score indicating the degree of the feature is closer in both. Yes. Furthermore, the conditions 4 and 5 also indicate that when there is a feature element possessed by only one feature, the greater the feature score indicating the degree of the feature, the greater the feature distance.
- the two input feature vectors are the following feature (1) and feature (2).
- feature (1) (Useful for cancer", 0.8, "No side effects”, 0.6, “Document category: Advertising”, 0.85)
- Feature (2) ("Immediate effect”, 0.4, "No side effects”, 0.5, "Document category: Advertising”, 0.7)
- “effective for cancer”, “no side effect”, and “effective immediately” are linguistic expressions that appear characteristically in documents belonging to each corresponding section.
- “Document category: advertisement” indicates a category of a document that appears characteristically in a document set belonging to the corresponding section.
- the numerical value described next to the feature element in the features (1) and (2) indicates the feature score of each feature element.
- the distance between features is calculated using the number of feature elements that appear in common in the two input features, but the first embodiment is limited to this. It is not something. In the first embodiment, even if the feature elements are not completely common, similar feature elements are regarded as common elements, and the distance between features can be obtained.
- a similarity criterion indicating which feature elements and which feature elements are treated as similar feature elements is defined in advance and stored in the database 60.
- a similar feature element can be defined by using a synonym dictionary or a thesaurus.
- the comparison unit 50 calculates the inter-feature distance vector of the obtained time series data and other times.
- the distance data between features of the series data is compared.
- An arbitrary vector distance function may be used for the comparison.
- the inter-vector distance function a cosine distance can be used.
- the comparison unit 50 outputs the comparison result to the later-described relevance calculation unit 70 as a value for obtaining the relevance between the input document sets.
- the relevance calculation unit 70 calculates the relevance between the input document set (1) and the input document set (2) based on the comparison result output from the comparison unit 50.
- the output unit 80 outputs the relevance calculated by the relevance calculation unit 70 as the relevance between the input document set (1) and the input document set (2).
- the degree of relevance is smaller as the numerical value (cosine distance or the like) indicating the comparison result output from the comparison unit 50 is smaller, that is, between the vector data of the distance between the two features calculated by the comparison unit 50. It is better to specify that the smaller the distance is, the higher the distance is.
- the relevance calculation is performed by, for example, obtaining the reciprocal of the comparison result between the vector data of the distance between features in the time series data (1) and the vector data of the distance between features in the time series data (2), and presetting this This can be done with a constant.
- the calculation of the degree of association can be performed by subtracting the comparison result of the vector data of the distance between features from a preset constant.
- FIG. 6 is a diagram illustrating an example of time-series data that fluctuates due to a common cause (time-series data having high relevance).
- FIG. 7 is a diagram illustrating another example of time-series data that fluctuates due to a common cause (time-series data having high relevance).
- FIG. 8 is a diagram illustrating another example of time-series data that fluctuates due to different causes (such as when time-series data coincides by chance).
- time series data (1) and time series data (2) as shown in FIG. 6, and the time series data (1) and the time series data (2) are truly highly related.
- the time series data (1) and the time series data (2) are truly highly related.
- the corresponding section 1-1 and the corresponding section 1-2 are similar in shape of the time series data.
- the corresponding section 2-1 and the corresponding section 2-2 in the time series data (2) forming a pair of corresponding sections with them have similar time-series data shapes, and these four corresponding sections are the corresponding section set. The condition is met. In such a case, the degree of association between the time series data (1) and the time series data (2) is obtained.
- Non-Patent Document 1 the feature of the document set belonging to the time series data (1) and the feature of the document set belonging to the time series data (2) are directly compared, and the presence or absence of a common feature element is present. From the above, the relevance between them is calculated.
- the corresponding section 1-1 that is a partial section of the time series data (1) and the corresponding section 2-1 that is a partial section of the time series data 2 are highly correlated and attention is paid to those sections, Find the characteristics of the sections and find the distance between them.
- the input document set (1) that is the basis of the time series data (1) and the input document set (2) that is the basis of the time series data (2) are generally document sets having different properties. Even if these are similarly changed due to the common cause a, the common element is not necessarily included in the feature 1-1 found in the corresponding section 1-1 and the feature 2-1 found in the corresponding section 2-1. There is not always there.
- the characteristics 1-1 and 1-2 are Common elements are considered large.
- the peaks of the corresponding section 2-1 and the corresponding section 2-2 are due to a common cause a in the same input document set (2), the characteristics 2-1 and 2-2 The common element is considered to be large.
- the distance between the feature 1-1 and the feature 2-1 is calculated, and then the feature 2-1 and the feature 2-2 are The degree of association can be obtained by calculating the distance and comparing the two calculated distances.
- the distance between the feature 1-1 and the feature 1-2 has many common elements, that is, the distance becomes small.
- the distance between the feature 2-1 and the feature 2-2 has many common elements and the distance becomes small.
- the time-series data (1) and the time-series data (2) are truly related and fluctuate due to a common cause (in the same period).
- -1 and the corresponding section 2-1 have a peak due to the cause a
- the corresponding section pair between the corresponding section 1-2 and the corresponding section 2-2 has a peak due to the cause b.
- the feature 1-1 and the feature 1-2 have different causes of their peaks, so there are few common feature elements and the distance is considered to be large.
- the feature 2-1 and the feature 2-2 have different causes of the peaks, so that there are few common feature elements and the distance is increased. Therefore, the vector data of the distance between features in the time series data (1) (only one element in this example) and the vector data of the distance between features in the time series data (2) (only one element in this example) Both become larger. For this reason, the distance between them becomes small and the relevance degree is calculated highly.
- the cause of the variation in the corresponding interval pair is common based on that assumption. is there. Therefore, the corresponding section 1-1 and the corresponding section 2-1 have a common cause of variation, and the corresponding section 1-2 and the corresponding section 2-2 have a common cause.
- the corresponding section 1-1 and the corresponding section 1-2 do not necessarily have a common cause, but when there is a common cause (in the case of FIG. 6) Logically, the corresponding section 2-1 and the corresponding section 2-2 have a common cause. On the other hand, when the corresponding section 1-1 and the corresponding section 1-2 do not have a common cause, the corresponding section 2-1 and the corresponding section 2-2 also have no common cause.
- the corresponding section 1-1 and the corresponding section 1-2 in the time series data (1) are both caused by the same cause a. Then, the feature 1-1 and the feature 1-2 have more common feature elements, and the distance becomes smaller.
- the features 2-1 and 2-2 have few common elements, and their distance Will grow. Therefore, the vector data of the distance between features in the time series data (1) (only one element in this example) and the vector data of the distance between features in the time series data (2) (only one element in this example) However, since one is small and the other is large, the distance between them is large and the relevance is calculated low.
- both the corresponding section 2-1 and the corresponding section 2-2 are caused by the same cause c, and the corresponding section 2-1 and the corresponding section 1-1, and the corresponding section 2-2 and the corresponding section 1-2 have the same timing.
- vector data of distance between features in the time series data 1 in this example, only one element
- vector data of distance between features in the time series data 2 this In the example, there is only one element. For this reason, the distance between them also becomes small, and a relevance degree is calculated high erroneously.
- the two peak timings of the time-series data (1) and the time-series data (2) coincide with each other (in the case of FIG. 8), but they are not related to each other. Regardless of the fact that peaks occur due to common causes in time-series data (1) and common causes in time-series data (2), and the possibility that these two timings coincide with each other, the constraints are severe. Therefore, it is considered rare.
- the information analysis apparatus 1 has completely the characteristics of the document in both corresponding sections. If they are different, this becomes clear. As a result, according to the information analysis apparatus 1, it is possible to suppress the occurrence of a situation in which it is erroneously determined to be related when the change patterns of both coincide with each other by chance.
- the information analysis apparatus 1 needs to find a highly relevant document set from a set of a large number of documents that fluctuate due to various causes, such as a document set composed of document data on the Internet. It is effective when there is.
- FIG. 9 is a flowchart showing the flow of processing in the information analysis method according to Embodiment 1 of the present invention.
- the information analysis method according to the first embodiment is implemented by operating the information analysis apparatus 1 according to the first embodiment shown in FIG. For this reason, the following description will be described together with the operation of the information analysis apparatus 1 with appropriate reference to FIG.
- the input unit 10 receives input of a plurality of document sets to be analyzed (step A1).
- two document sets are input, which are an input document set (1) and an input document set (2), respectively.
- Each input document set is composed of a plurality of documents with time information.
- the time-series data generation unit 20 generates time-series data based on time information for each document set from the plurality of document sets received by the input unit 10 (step A2).
- the time-series data generation unit 20 generates time-series data (1) from the input document set, and generates time-series data (2) from the input document set (2).
- the corresponding section selection unit 30 compares a plurality of time series data obtained from a plurality of document sets with each other, and changes from each time series data corresponding to each of two or more sections of other time series data. Two or more (corresponding sections) are selected.
- step A2 when step A2 is completed, the corresponding section pair selection unit 31 compares the time-series data (1) and the time-series data (2), and the corresponding section pairs that change with high correlation with each other. Are selected (step A3). Subsequently, the corresponding section pair selection unit 31 determines whether or not two or more corresponding section pairs that fluctuate with high correlation can be selected from the time series data (1) and (2) (step A4). .
- step A4 if the number of corresponding section pairs that have been selected is one pair or less, the corresponding section pair selection unit 31 instructs the relevance calculation unit 70 to cancel the relevance and stops the processing. On the other hand, if there are two or more corresponding section pairs that have been selected as a result of step A4, the corresponding section pair selecting section 31 inputs information for identifying the selected corresponding section pairs to the similar corresponding section pair selecting section 32. .
- the similar corresponding section pair selecting section 32 receives time series data (1) and time series data (2) from the plurality of already selected corresponding section pairs. ) Select corresponding pair of similar sections in each (step A5). Subsequently, the similar corresponding section pair selection unit 32 determines whether two or more corresponding section pairs are selected (the total number of corresponding sections is four or more) (step A6).
- step A6 when two or more corresponding section pairs are not selected in the time series data (1) and (2), the similar corresponding section pair selecting unit 32 determines the relevance level to the relevance level calculating unit 70. Instruct to stop the process. On the other hand, if two or more corresponding section pairs are selected in the time series data (1) and (2) as a result of step A6, the similar corresponding section pair selecting unit 32 performs feature extraction on the selected corresponding section pairs. Input to the unit 40.
- the feature extraction unit 40 receives information from the similar correspondence section pair selection unit 32, the feature extraction unit 40 identifies documents belonging to each corresponding section selected from each time-series data, and determines the characteristics of the identified document as the corresponding section. Extract every time (step A7). Then, the feature extraction unit 40 inputs the extracted features to the comparison unit 50.
- the comparing unit 50 obtains the distance between features between the feature extracted from one corresponding section and the feature extracted from another corresponding section for each time series data, and the obtained time series data.
- the inter-feature distances are compared with each other (step A8).
- the comparison unit 50 pays attention to each time series data, calculates the inter-feature distance between a plurality of corresponding sections within each time series data, and between the features in the time series data (1). The distance is compared with the distance between features in the time series data (2). Then, the comparison unit 50 inputs a comparison result between the inter-feature distance in the time series data (1) and the inter-feature distance in the time series data (2) to the relevance calculation unit 70.
- the relevance calculation unit 70 calculates the relevance between the input document sets based on the comparison result input by the comparison unit 50 (step A9). Thereafter, when the degree-of-association calculation unit 70 outputs analysis data for specifying the degree of association to the outside, the processing in the information analysis apparatus 1 ends.
- the program in the first embodiment may be a program that causes a computer to execute steps A1 to A9 shown in FIG. Therefore, the information analysis apparatus 1 can be embodied by installing this program in a computer and further executing it.
- a CPU central processing unit of the computer functions as the time-series data generation unit 20, the corresponding section selection unit 30, the feature extraction unit 40, the comparison unit 50, and the relevance calculation unit 70 to perform processing.
- the database 60 can be realized by storing a data file in a storage device such as a hard disk or by mounting a recording medium storing the data file on a reading device connected to a computer.
- the storage device constituting the database 60 may be provided in a computer in which the above-described program is installed, or may be provided in another computer connected via a network.
- the reading device may be connected to a computer in which the above-described program is installed, or may be connected to another computer connected via a network.
- FIG. 10 is a block diagram showing a schematic configuration of the information analysis apparatus according to Embodiment 2 of the present invention.
- the information analysis apparatus 2 in the second embodiment does not include a time series data generation unit (see FIG. 1), and is different from the information analysis apparatus 1 in the first embodiment in this respect. ing. Further, since the time series data generation unit is not provided, the information analysis device 2 is different from the information analysis device 1 in the first embodiment also in terms of functions of each unit. Hereinafter, differences from the information analysis apparatus 1 will be described.
- the time series data generated from the document set in advance is input to the information analysis apparatus 2.
- the input unit 10 receives time-series data input. Also in the second embodiment, two pieces of time-series data are input. In the second embodiment, one corresponding section of one time-series data and a corresponding section of the other time-series data corresponding to this corresponding section are set in advance. Information specifying a preset corresponding section (set corresponding section) is also input to the input unit 10.
- the input time-series data (1) and (2) are as shown in FIG. 2, and further, the corresponding section 1-1 and the corresponding section 2-1 changing with high correlation with the corresponding section 1-1 It is assumed that corresponding section pairs are set in advance. In this case, the time series data (1) and (2) and the information specifying the setting corresponding section 1-1 and the setting corresponding section 2-1 are received by the input unit 10.
- the corresponding section selection unit 30 first selects a corresponding section whose change is similar to that of the set corresponding section for one time-series data. Further, the corresponding section selection unit 30 selects a corresponding section corresponding to the corresponding section selected for the other time series data, the change of which is similar to that of the set corresponding section and corresponding to the selected time series data.
- the corresponding section selection unit 30 selects a section that is a partial section of the time-series data (1) and is similar to the setting corresponding section 1-1 as the corresponding section 1-2. Further, the corresponding section selection unit 30 is a partial section of the time series data (2), which is similar to the setting corresponding section 2-1 and changes with high correlation with the corresponding section 1-2. The section is selected as the corresponding section 2-2.
- the feature extraction unit 40 identifies the document belonging to the setting corresponding section of each time-series data and the document belonging to the selected corresponding section of each time-series data, and the identified document Are extracted for each corresponding section.
- the comparison unit 50 obtains the inter-feature distance between the feature extracted from the set corresponding section and the feature extracted from the selected corresponding section. Also in the second embodiment, the comparison unit 50 calculates the inter-feature distance using the distance function stored in the database 60 as in the first embodiment. Further, as in the first embodiment, the comparison unit 50 compares the inter-feature distances for each obtained time-series data, and inputs the comparison result to the relevance calculation unit 70.
- the degree-of-association calculation unit 70 calculates the degree of association based on the comparison result by the comparison unit 50.
- the degree of association is calculated for another set corresponding section.
- FIG. 11 is a flowchart showing the flow of processing in the information analysis method according to Embodiment 2 of the present invention.
- the information analysis method in the second embodiment is performed by operating the information analysis apparatus 2 in the second embodiment shown in FIG. For this reason, the following description will be described together with the operation of the information analysis apparatus 2 with appropriate reference to FIG.
- the input unit 10 includes time-series data (1) and (2) to be analyzed and information (setting corresponding section information) for specifying each corresponding corresponding section.
- An input is received (step A11).
- the corresponding section selection unit 30 selects a corresponding section whose change is similar to the setting corresponding section of the time series data (1), and further, the change is similar to the setting corresponding section of the time series data (2), and The corresponding section corresponding to the corresponding section selected for the time-series data (1) is selected (step A12).
- the feature extraction unit 40 identifies a document belonging to the setting corresponding section of each of the time series data and a document belonging to a selected corresponding section of each of the time series data, and each identified document for each corresponding section. Are extracted (step A13).
- the comparison unit 50 obtains the inter-feature distance between the feature extracted from the set corresponding section and the feature extracted from the selected corresponding section, and calculates the obtained inter-feature distance for each time series data.
- the comparison result is input to the relevance calculation unit 70 (step A14).
- the degree-of-association calculation unit 70 calculates the degree of association for one setting-corresponding section and another setting-corresponding section based on the result of comparison by the comparison unit 50 (step A15). Thereafter, when the degree-of-association calculation unit 70 outputs analysis data for specifying the degree of association to the outside, the processing in the information analysis device 2 ends.
- a document set having a high degree of relevance is selected from an aggregate composed of a large number of documents that fluctuate due to various causes, such as a document aggregate composed of document data on the Internet. It is effective when it is necessary to find out.
- the program in the second embodiment is a program for causing a computer to execute steps A11 to A15 shown in FIG. Therefore, the information analysis apparatus 2 can be realized by installing this program in a computer and further executing it.
- a CPU (central processing unit) of the computer functions as the corresponding section selection unit 30, the feature extraction unit 40, the comparison unit 50, and the related degree calculation unit 70, and performs processing.
- the database 60 stores a data file in a storage device such as a hard disk, or mounts a recording medium storing the data file in a reading device connected to a computer. Can be realized.
- the present invention can be used for analyzing document data on the Internet such as a blog and document data to which time information such as a call center response history is attached. It can also be used for the purpose of obtaining related document sets when analyzing the results of questionnaire surveys and market surveys that are performed regularly. Furthermore, according to the present invention, since the degree of association between document sets that changes with time can be calculated appropriately, it can also be applied to document search navigation, search result classification, and the like.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
本願は、2008年9月24日に、日本に出願された特願2008-244753号に基づき優先権を主張し、その内容をここに援用する。
複数の前記文書集合から、前記文書集合毎に、前記時間情報に基づいて生成された、複数の時系列データを互いに比較し、各時系列データから、他の時系列データの2以上の区間それぞれに対応して変化する2以上の区間を選別する対応区間選別部と、
複数の前記時系列データそれぞれについて、選別された前記2以上の区間に属する前記文書を前記区間毎に特定し、特定された前記文書の特徴を、前記区間毎に抽出する特徴抽出部と、
前記時系列データ毎に、選別された前記2以上の区間における、一の区間から抽出された特徴と、他の区間から抽出された特徴との間の特徴間距離を求め、求められた前記時系列データ毎の特徴間距離を互いに比較する比較部と、
前記比較部による比較の結果に基づいて、前記文書集合間の関連度を算出する関連度算出部とを備える、ことを特徴とする。
(a)複数の前記文書集合から、前記文書集合毎に、前記時間情報に基づいて生成された、複数の時系列データを互いに比較し、各時系列データから、他の時系列データの2以上の区間それぞれに対応して変化する2以上の区間を選別するステップと、
(b)複数の前記時系列データそれぞれについて、選別された前記2以上の区間に属する前記文書を前記区間毎に特定し、特定された前記文書の特徴を、前記区間毎に抽出するステップと、
(c)前記時系列データ毎に、選別された前記2以上の区間における、一の区間から抽出された特徴と、前記他の区間から抽出された特徴との間の特徴間距離を求め、求められた前記時系列データ毎の特徴間距離を互いに比較するステップと、
(d)前記(c)のステップによる比較の結果に基づいて、前記文書集合間の関連度を算出するステップとを有する、ことを特徴とする。
前記コンピュータに、
(a)複数の前記文書集合から、前記文書集合毎に、前記時間情報に基づいて生成された、複数の時系列データを互いに比較し、各時系列データから、他の時系列データの2以上の区間それぞれに対応して変化する2以上の区間を選別するステップと、
(b)複数の前記時系列データそれぞれについて、選別された前記2以上の区間に属する前記文書を前記区間毎に特定し、特定された前記文書の特徴を、前記区間毎に抽出するステップと、
(c)前記時系列データ毎に、選別された前記2以上の区間における、前記一の区間から抽出された特徴と、前記他の区間から抽出された特徴との間の特徴間距離を求め、求められた前記時系列データ毎の特徴間距離を互いに比較するステップと、
(d)前記(c)のステップによる比較の結果に基づいて、前記文書集合間の関連度を算出するステップとを実行させる、ことを特徴とする。
以下、本発明の実施の形態1における情報分析装置、情報分析装置及びプログラムについて、図1~図9を参照しながら説明する。最初に、図1~図5を用いて、本発明の実施の形態1における情報分析装置の構成について説明する。図1は、本発明の実施の形態1における情報分析装置の概略構成を示すブロック図である。図2~図5は、それぞれ、時系列データの一例を示す図である。
距離関数を求める対象となる二つの対応区間から抽出された、二つの特徴が全く同一となる場合、これらの特徴間距離が0(ゼロ)となる。
ある対応区間から特徴(1)が抽出され、別のある対応区間から特徴(2)が抽出されている場合、特徴(1)と特徴(2)との距離は、順序を入れ替えた特徴(2)と特徴(1)との距離と等しくなる。
3つの対応区間の特徴として、特徴(1)、特徴(2)、特徴(3)があるとき、それらの間の距離には、下記の関係が成立する。
(特徴(1)と特徴(3)の特徴間距離)≦(特徴(1)と特徴(2)の特徴間距離)+(特徴(2)と特徴(3)の特徴間距離)
比較部50に2つの特徴が入力されている場合に、一方の特徴がm個の特徴要素からなるベクトルで表現され、他方の特徴がn個の特徴要素からなるベクトルで表現され、更に、両方の特徴がc個の共通の特徴要素を有しているとする。この場合、共通でない特徴要素の数は(m+n-c)個となる。特徴間距離は、共通でない特徴要素の数に応じて、単調に増加する。
比較部50に2つの特徴が入力されている場合に、一方の特徴がm個の特徴要素と対応するm個の特徴スコアとのベクトル(特徴ベクトル)で表現され、他方の特徴がn個の特徴要素と対応するn個の特徴スコアとのベクトル(特徴ベクトル)で表現されるとする。またこのとき、両方の特徴は、c個の共通の特徴要素も有しているとする。この場合は、以下の手順5-1~手順5-3で、2つの特徴ベクトル間の差分が求められ、差分の大きさが特徴間距離となる。
先ず、入力された2つの特徴ベクトルが正規化され、両者の次元数の整合が行われる。これにより、それぞれの特徴ベクトルにおいて、他方のみに存在する特徴要素に対しては、その特徴要素と特徴スコア「0(ゼロ)」とが与えられ、2つの特徴ベクトルの特徴要素が全て共通とされる。
入力された2つの特徴ベクトルそれぞれに対して、特徴要素の種類毎に、特徴ベクトル内の特徴スコアの出現順序のソートが実行される。このとき、種類が同一(言語表現が同一、メタ情報が同一)の特徴要素に対しては、ベクトル内の特徴スコアの出現位置が同じになるように、ソートが実行される。
手順5-1、手順5-2により、次元数と特徴スコアの出現順序との正規化が行われた後、正規化された2つの特徴ベクトルに対して、差分ベクトルが計算される。この差分ベクトルは、2つの特徴ベクトルそれぞれの各特徴スコア間の差分を値として有し、その次元は(m+n-c)次元となる。その後、得られた差分ベクトルの大きさの絶対値を求め、入力された2つの特徴ベクトル間の距離(特徴間距離)とする。
[特徴(1)]
(「ガンに効く」,0.8、「副作用がない」,0.6,「文書カテゴリー:広告」、0.85)
[特徴(2)]
(「即効性がある」,0.4,「副作用がない」,0.5,「文書カテゴリー:広告」,0.7)
[正規化された特徴(1)]
(「ガンに効く」,0.8,「副作用がない」,0.6,「即効性がある」,0,「文書カテゴリー:広告」,0.85)
[正規化された特徴(2)]
(「ガンに効く」,0,「副作用がない」,0.5,「即効性がある」,0.4,「文書カテゴリー:広告」,0.7)
差分ベクトル=((0.8-0),(0.6-0.5),(0-0.4),(0.85-0.7))
更に、上記の式を展開すると、下記の通りとなる。
差分ベクトル=(0.8,0.1,-0.4,0.15)
この差分ベクトルの大きさの絶対値を求めると、これが、特徴間距離となる。
次に、本発明の実施の形態2における情報分析装置、情報分析装置及びプログラムについて、図10及び図11を参照しながら説明する。最初に、図10を用いて、本発明の実施の形態2における情報分析装置の構成について説明する。図10は、本発明の実施の形態2における情報分析装置の概略構成を示すブロック図である。
2 情報分析装置(実施の形態2)
10 入力部
20 時系列データ生成部
30 対応区間選別部
31 対応区間ペア選別部
32 類似対向区間ペア選別部
40 特徴抽出部
50 比較部
60 データベース
70 関連度
80 出力部
Claims (15)
- 時間情報が付与された文書を含む文書集合に対して、情報分析を実行する情報分析装置であって、
複数の前記文書集合から、前記文書集合毎に、前記時間情報に基づいて生成された、複数の時系列データを互いに比較し、各時系列データから、他の時系列データの2以上の区間それぞれに対応して変化する2以上の区間を選別する対応区間選別部と、
複数の前記時系列データそれぞれについて、選別された前記2以上の区間に属する前記文書を前記区間毎に特定し、特定された前記文書の特徴を、前記区間毎に抽出する特徴抽出部と、
前記時系列データ毎に、選別された前記2以上の区間における、一の区間から抽出された特徴と、他の区間から抽出された特徴との間の特徴間距離を求め、求められた前記時系列データ毎の特徴間距離を互いに比較する比較部と、
前記比較部による比較の結果に基づいて、前記文書集合間の関連度を算出する関連度算出部とを備える、ことを特徴とする情報分析装置。 - 複数の前記文書集合の入力を受け付ける入力部と、
入力された複数の前記文書集合から、前記文書集合毎に、前記時間情報に基づいて、複数の前記時系列データを生成する時系列データ生成部とを、更に備えている、請求項1に記載の情報分析装置。 - 前記入力部が2つの前記文書集合の入力を受け付け、前記時系列データ生成部が、2つの前記時系列データを生成している場合において、
前記対応区間選別部が、一方の前記時系列データと他方の前記時系列データとの相関係数を求め、2つの前記時系列データそれぞれにおける、前記相関係数の絶対値が設定された閾値を超える又は前記閾値以上となる2以上の区間を、前記対応して変化する2以上の区間として選別する、請求項2に記載の情報分析装置。 - 前記入力部が2つの前記文書集合の入力を受け付け、前記時系列データ生成部が、2つの前記時系列データを生成している場合において、
前記対応区間選別部が、更に、2つの前記時系列データそれぞれについて、選別された前記対応して変化する2以上の区間の変化が相互に類似するかどうかを判定し、2つの前記時系列データ両方において、変化が相互に類似する2以上の区間が存在する場合は、一方の前記時系列データの相互に類似する2以上の区間それぞれと、他方の前記時系列データの相互に類似する2以上の区間それぞれとが対応しているかどうかを判定し、対応して変化する区間のペアが二以上存在する場合は、これらの区間を再度選別し、
前記特徴抽出部が、2つの前記時系列データそれぞれについて、再度選別された前記2以上の区間に属する前記文書を前記区間毎に特定し、
前記比較部が、前記時系列データ毎に、再度選別された前記2以上の区間における一の区間と他の区間とについて前記特徴間距離を求める、請求項2または3に記載の情報分析装置。 - 前記時間情報に基づいて前記文書集合から生成された時系列データの入力を受け付ける入力部を更に備え、
前記入力部が2つの前記時系列データの入力を受け付け、且つ、一方の時系列データの一区間と、前記一区間に対応して変化する他方の時系列データの一区間とが予め設定されている場合において、
前記対応区間選別部が、前記一方の時系列データについて、その予め設定された前記一区間と変化が類似する区間を選別し、更に、前記他方の時系列データについて、その予め設定された前記一区間と変化が類似し、且つ、前記一方の時系列データについて選別された前記区間に対応して変化する、区間を選別し、
前記特徴抽出部が、2つの前記時系列データそれぞれの予め設定された前記一区間に属する文書と、前記区間毎に、2つの前記時系列データそれぞれの選別された前記区間に属する文書とを特定し、特定された前記文書それぞれの特徴を抽出し、
前記比較部が、前記時系列データ毎に、予め設定された前記一区間に属する文書から抽出された特徴と、選別された前記区間に属する文書から抽出された特徴との間の特徴係間距離を求め、求められた前記時系列データ毎の特徴間距離を互いに比較し、
前記関連度算出部が、前記比較部による比較の結果に基づいて、予め設定された前記一区間同士について前記関連度を算出する、請求項1に記載の情報分析装置。 - 時間情報が付与された文書を含む文書集合に対して、情報分析を実行するための情報分析方法であって、
(a)複数の前記文書集合から、前記文書集合毎に、前記時間情報に基づいて生成された、複数の時系列データを互いに比較し、各時系列データから、他の時系列データの2以上の区間それぞれに対応して変化する2以上の区間を選別するステップと、
(b)複数の前記時系列データそれぞれについて、選別された前記2以上の区間に属する前記文書を前記区間毎に特定し、特定された前記文書の特徴を、前記区間毎に抽出するステップと、
(c)前記時系列データ毎に、選別された前記2以上の区間における、一の区間から抽出された特徴と、前記他の区間から抽出された特徴との間の特徴間距離を求め、求められた前記時系列データ毎の特徴間距離を互いに比較するステップと、
(d)前記(c)のステップによる比較の結果に基づいて、前記文書集合間の関連度を算出するステップとを有する、ことを特徴とする情報分析方法。 - (e)前記(a)のステップの実行前に、複数の前記文書集合の入力を受け付けるステップと、
(f)前記(e)のステップで入力された複数の前記文書集合から、前記文書集合毎に、前記時間情報に基づいて、複数の前記時系列データを生成する、ステップとを更に有する、請求項6に記載の情報分析方法。 - 前記(e)のステップにおいて、2つの前記文書集合の入力を受け付け、前記(f)のステップにおいて、2つの前記時系列データが生成されている場合に、
前記(a)のステップにおいて、一方の前記時系列データと他方の前記時系列データとの相関係数を求め、2つの前記時系列データそれぞれにおける、前記相関係数の絶対値が設定された閾値を超える又は前記閾値以上となる2以上の区間を、前記対応して変化する2以上の区間として選別する、請求項7に記載の情報分析方法。 - 前記(e)のステップにおいて、2つの前記文書集合の入力を受け付け、前記(f)のステップにおいて、2つの前記時系列データが生成されている場合に、
前記(a)のステップにおいて、前記対応して変化する2以上の区間を選別した後に、更に、2つの前記時系列データそれぞれについて、選別された前記2以上の区間の変化が相互に類似するかどうかを判定し、2つの前記時系列データ両方に、変化が相互に類似する2以上の区間が存在する場合は、一方の前記時系列データの相互に類似する2以上の区間それぞれと、他方の前記時系列データの相互に類似する2以上の区間それぞれとが対応して変化しているかどうかを判定し、対応して変化する区間のペアが二以上存在する場合に、これらの区間を再度選別し、
前記(b)のステップにおいて、2つの前記時系列データそれぞれについて、再度選別された前記2以上の区間に属する前記文書を前記区間毎に特定し、
前記(c)のステップにおいて、前記時系列データ毎に、再度選別された前記2以上の区間における一の区間と他の区間とについて前記特徴間距離を求める、請求項7または8に記載の情報分析方法。 - (g)前記(a)のステップの実行前に、前記時間情報に基づいて前記文書集合から生成された時系列データの入力を受け付けるステップを更に有し、
前記(g)のステップにおいて、2つの前記時系列データの入力が受け付けられ、且つ、一方の時系列データの一区間と、前記一区間に対応して変化する他方の時系列データの一区間とが予め設定されている場合に、
前記(a)のステップにおいて、前記一方の時系列データについて、その予め設定された前記一区間と変化が類似する区間を選別し、更に、前記他方の時系列データについて、その予め設定された前記一区間と変化が類似し、且つ、前記一方の時系列データにおいて選別された前記区間に対応して変化する、区間を選別し、
前記(b)のステップにおいて、2つの前記時系列データそれぞれの予め設定された前記一区間に属する文書と、2つの前記時系列データそれぞれの選別された前記区間に属する文書とを特定し、前記区間毎に、特定された前記文書それぞれの特徴を抽出し、
前記(c)のステップにおいて、前記時系列データ毎に、予め設定された前記一区間に属する文書から抽出された特徴と、選別された前記区間に属する文書から抽出された特徴との間の特徴間距離を求め、求められた前記時系列データ毎の特徴間距離を互いに比較し、
前記(d)のステップにおいて、前記(c)のステップによる比較の結果に基づいて、予め設定された前記一区間同士について前記関連度を算出する、請求項6に記載の情報分析方法。 - 時間情報が付与された文書を含む文書集合に対する情報分析をコンピュータに実行させるためのプログラムであって、
前記コンピュータに、
(a)複数の前記文書集合から、前記文書集合毎に、前記時間情報に基づいて生成された、複数の時系列データを互いに比較し、各時系列データから、他の時系列データの2以上の区間それぞれに対応して変化する2以上の区間を選別するステップと、
(b)複数の前記時系列データそれぞれについて、選別された前記2以上の区間に属する前記文書を前記区間毎に特定し、特定された前記文書の特徴を、前記区間毎に抽出するステップと、
(c)前記時系列データ毎に、選別された前記2以上の区間における、前記一の区間から抽出された特徴と、前記他の区間から抽出された特徴との間の特徴間距離を求め、求められた前記時系列データ毎の特徴間距離を互いに比較するステップと、
(d)前記(c)のステップによる比較の結果に基づいて、前記文書集合間の関連度を算出するステップとを実行させる、ことを特徴とするプログラム。 - (e)前記(a)のステップの実行前に、複数の前記文書集合の入力を受け付けるステップと、
(f)前記(e)のステップで入力された複数の前記文書集合から、前記文書集合毎に、前記時間情報に基づいて、複数の前記時系列データを生成する、ステップとを、更に、前記コンピュータに実行させる、請求項11に記載のプログラム。 - 前記(e)のステップにおいて、2つの前記文書集合の入力を受け付け、前記(f)のステップにおいて、2つの前記時系列データが生成されている場合に、
前記(a)のステップにおいて、一方の前記時系列データと他方の前記時系列データとの相関係数を求め、2つの前記時系列データそれぞれにおける、前記相関係数の絶対値が設定された閾値を超える又は前記閾値以上となる2以上の区間を、前記対応して変化する2以上の区間として選別する、請求項12に記載のプログラム。 - 前記(e)のステップにおいて、2つの前記文書集合の入力を受け付け、前記(f)のステップにおいて、2つの前記時系列データが生成されている場合に、
前記(a)のステップにおいて、前記対応して変化する2以上の区間を選別した後に、更に、2つの前記時系列データそれぞれについて、選別された前記2以上の区間の変化が相互に類似するかどうかを判定し、2つの前記時系列データ両方に、変化が相互に類似する2以上の区間が存在する場合は、一方の前記時系列データの相互に類似する2以上の区間それぞれと、他方の前記時系列データの相互に類似する2以上の区間それぞれとが対応して変化しているかどうかを判定し、対応して変化する区間のペアが二以上存在する場合に、これらの区間を再度選別し、
前記(b)のステップにおいて、2つの前記時系列データそれぞれについて、再度選別された前記2以上の区間に属する前記文書を前記区間毎に特定し、
前記(c)のステップにおいて、前記時系列データ毎に、再度選別された前記2以上の区間における一の区間と他の区間とについて前記特徴間距離を求める、請求項12または13に記載のプログラム。 - (g)前記(a)のステップの実行前に、前記時間情報に基づいて前記文書集合から生成された時系列データの入力を受け付けるステップを、更に、前記コンピュータに実行させ、
前記(g)のステップにおいて、2つの前記時系列データの入力が受け付けられ、且つ、一方の時系列データの一区間と、前記一区間に対応して変化する他方の時系列データの一区間とが予め設定されている場合に、
前記(a)のステップにおいて、前記一方の時系列データについて、その予め設定された前記一区間と変化が類似する区間を選別し、更に、前記他方の時系列データについて、その予め設定された前記一区間と変化が類似し、且つ、前記一方の時系列データにおいて選別された前記区間に対応して変化する、区間を選別し、
前記(b)のステップにおいて、2つの前記時系列データそれぞれの予め設定された前記一区間に属する文書と、2つの前記時系列データそれぞれの選別された前記区間に属する文書とを特定し、前記区間毎に、特定された前記文書それぞれの特徴を抽出し、
前記(c)のステップにおいて、前記時系列データ毎に、予め設定された前記一区間に属する文書から抽出された特徴と、選別された前記区間に属する文書から抽出された特徴との間の特徴間距離係を求め、求められた前記時系列データ毎の特徴間距離を互いに比較し、
前記(d)のステップにおいて、前記(c)のステップによる比較の結果に基づいて、予め設定された前記一区間同士について前記関連度を算出する、請求項11に記載のプログラム。
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2010530725A JP5387578B2 (ja) | 2008-09-24 | 2009-09-18 | 情報分析装置、情報分析方法、及びプログラム |
US13/060,572 US20110153601A1 (en) | 2008-09-24 | 2009-09-18 | Information analysis apparatus, information analysis method, and program |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2008-244753 | 2008-09-24 | ||
JP2008244753 | 2008-09-24 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2010035455A1 true WO2010035455A1 (ja) | 2010-04-01 |
Family
ID=42059468
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2009/004752 WO2010035455A1 (ja) | 2008-09-24 | 2009-09-18 | 情報分析装置、情報分析方法、及びプログラム |
Country Status (3)
Country | Link |
---|---|
US (1) | US20110153601A1 (ja) |
JP (1) | JP5387578B2 (ja) |
WO (1) | WO2010035455A1 (ja) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2012115254A1 (ja) * | 2011-02-25 | 2012-08-30 | 楽天株式会社 | 検索装置、検索方法、検索プログラム、及びそのプログラムを記憶するコンピュータ読取可能な記録媒体 |
WO2014034557A1 (ja) * | 2012-08-31 | 2014-03-06 | 日本電気株式会社 | テキストマイニング装置、テキストマイニング方法及びコンピュータ読み取り可能な記録媒体 |
JP2014085862A (ja) * | 2012-10-24 | 2014-05-12 | Kddi Corp | 予測対象コンテンツにおける将来的なコメント数を予測する予測サーバ、プログラム及び方法 |
WO2014184928A1 (ja) * | 2013-05-16 | 2014-11-20 | 株式会社日立製作所 | 検出装置、検出方法、および記録媒体 |
JP2018181296A (ja) * | 2017-04-10 | 2018-11-15 | エヌ・ティ・ティ・コミュニケーションズ株式会社 | 情報提供装置、情報提供方法及びコンピュータープログラム |
KR20210035622A (ko) * | 2019-09-24 | 2021-04-01 | 주식회사 디셈버앤컴퍼니자산운용 | 시계열 데이터 유사도 계산 시스템 및 방법 |
WO2023144967A1 (ja) * | 2022-01-27 | 2023-08-03 | 日本電信電話株式会社 | 処理装置、処理方法およびプログラム |
Families Citing this family (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8407221B2 (en) * | 2010-07-09 | 2013-03-26 | International Business Machines Corporation | Generalized notion of similarities between uncertain time series |
JP5963310B2 (ja) * | 2013-01-30 | 2016-08-03 | インターナショナル・ビジネス・マシーンズ・コーポレーションInternational Business Machines Corporation | 情報処理装置、情報処理方法、及び、情報処理プログラム |
US9886422B2 (en) * | 2014-08-06 | 2018-02-06 | International Business Machines Corporation | Dynamic highlighting of repetitions in electronic documents |
JP5936240B2 (ja) * | 2014-09-12 | 2016-06-22 | インターナショナル・ビジネス・マシーンズ・コーポレーションInternational Business Machines Corporation | データ処理装置、データ処理方法、およびプログラム |
US9875228B1 (en) * | 2015-03-06 | 2018-01-23 | Google Llc | Systems and methods for preserving conditional styles when copying and pasting between applications |
WO2018235841A1 (ja) * | 2017-06-20 | 2018-12-27 | 日本電気株式会社 | グラフ構造解析装置、グラフ構造解析方法、及びコンピュータ読み取り可能な記録媒体 |
US11106664B2 (en) * | 2018-05-03 | 2021-08-31 | Thomson Reuters Enterprise Centre Gmbh | Systems and methods for generating a contextually and conversationally correct response to a query |
US11144734B2 (en) * | 2019-06-12 | 2021-10-12 | International Business Machines Corporation | Self-learning natural-language generation rules engine with diachronic linguistic analysis |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH10320419A (ja) * | 1997-05-22 | 1998-12-04 | Nippon Telegr & Teleph Corp <Ntt> | 情報関連づけ装置およびその方法 |
JPH1125169A (ja) * | 1997-06-30 | 1999-01-29 | Toshiba Corp | 相関関係抽出方法 |
JP2002251590A (ja) * | 2001-02-23 | 2002-09-06 | Fujitsu Ltd | 文書分析装置 |
JP2002351897A (ja) * | 2001-05-22 | 2002-12-06 | Fujitsu Ltd | 情報利用頻度予測プログラム、情報利用頻度予測装置および情報利用頻度予測方法 |
JP2004206391A (ja) * | 2002-12-25 | 2004-07-22 | Mitsubishi Electric Corp | 文書情報分析装置 |
Family Cites Families (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6598054B2 (en) * | 1999-01-26 | 2003-07-22 | Xerox Corporation | System and method for clustering data objects in a collection |
US6834266B2 (en) * | 2001-10-11 | 2004-12-21 | Profitlogic, Inc. | Methods for estimating the seasonality of groups of similar items of commerce data sets based on historical sales data values and associated error information |
US7570262B2 (en) * | 2002-08-08 | 2009-08-04 | Reuters Limited | Method and system for displaying time-series data and correlated events derived from text mining |
US20050171948A1 (en) * | 2002-12-11 | 2005-08-04 | Knight William C. | System and method for identifying critical features in an ordered scale space within a multi-dimensional feature space |
US6871165B2 (en) * | 2003-06-20 | 2005-03-22 | International Business Machines Corporation | Method and apparatus for classifying time series data using wavelet based approach |
JP4773680B2 (ja) * | 2003-08-07 | 2011-09-14 | ソニー株式会社 | 情報処理装置および方法、プログラム記録媒体、並びにプログラム |
US20060173668A1 (en) * | 2005-01-10 | 2006-08-03 | Honeywell International, Inc. | Identifying data patterns |
JP4772378B2 (ja) * | 2005-05-26 | 2011-09-14 | 株式会社東芝 | Webページから時系列データを生成する方法及び装置 |
JP5067556B2 (ja) * | 2005-09-30 | 2012-11-07 | 日本電気株式会社 | トレンド評価装置と、その方法及びプログラム |
-
2009
- 2009-09-18 JP JP2010530725A patent/JP5387578B2/ja active Active
- 2009-09-18 WO PCT/JP2009/004752 patent/WO2010035455A1/ja active Application Filing
- 2009-09-18 US US13/060,572 patent/US20110153601A1/en not_active Abandoned
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH10320419A (ja) * | 1997-05-22 | 1998-12-04 | Nippon Telegr & Teleph Corp <Ntt> | 情報関連づけ装置およびその方法 |
JPH1125169A (ja) * | 1997-06-30 | 1999-01-29 | Toshiba Corp | 相関関係抽出方法 |
JP2002251590A (ja) * | 2001-02-23 | 2002-09-06 | Fujitsu Ltd | 文書分析装置 |
JP2002351897A (ja) * | 2001-05-22 | 2002-12-06 | Fujitsu Ltd | 情報利用頻度予測プログラム、情報利用頻度予測装置および情報利用頻度予測方法 |
JP2004206391A (ja) * | 2002-12-25 | 2004-07-22 | Mitsubishi Electric Corp | 文書情報分析装置 |
Non-Patent Citations (2)
Title |
---|
"Proceedings of the 15th annual meeting of the Association for Natural Language Processing [CD-ROM], The Association for Natural Language Processing, 02 March 2009", 2 March 2009, article TAKASHI ONISHI ET AL.: "Jikeiretsu Bunseki ni yoru Web Bunsho no Joho Shinraisei Handan Shien: Jikeiretsu Henka Juyo Topic no Chushutsu", pages: 104 - 107 * |
AKIHIKO NAKASE ET AL.: "Jikeiretsu Data Mining ni Okeru Sokan Kankei Hakken Hoshiki", ADVANCED DATABASE SYMPOSIUM '97, vol. 97, no. 11, 15 December 1997 (1997-12-15), pages 159 - 164 * |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103262079B (zh) * | 2011-02-25 | 2015-04-01 | 乐天株式会社 | 检索装置及检索方法 |
JP2012178026A (ja) * | 2011-02-25 | 2012-09-13 | Rakuten Inc | 検索装置、検索方法、検索プログラム、及びそのプログラムを記憶するコンピュータ読取可能な記録媒体 |
CN103262079A (zh) * | 2011-02-25 | 2013-08-21 | 乐天株式会社 | 检索装置、检索方法、检索程序、及存储该程序的计算机可读取记录介质 |
KR101346927B1 (ko) * | 2011-02-25 | 2014-01-03 | 라쿠텐 인코포레이티드 | 검색 장치, 검색 방법, 및 검색 프로그램을 기억하는 컴퓨터 판독 가능한 기록 매체 |
WO2012115254A1 (ja) * | 2011-02-25 | 2012-08-30 | 楽天株式会社 | 検索装置、検索方法、検索プログラム、及びそのプログラムを記憶するコンピュータ読取可能な記録媒体 |
US10140361B2 (en) | 2012-08-31 | 2018-11-27 | Nec Corporation | Text mining device, text mining method, and computer-readable recording medium |
JPWO2014034557A1 (ja) * | 2012-08-31 | 2016-08-08 | 日本電気株式会社 | テキストマイニング装置、テキストマイニング方法及びプログラム |
WO2014034557A1 (ja) * | 2012-08-31 | 2014-03-06 | 日本電気株式会社 | テキストマイニング装置、テキストマイニング方法及びコンピュータ読み取り可能な記録媒体 |
JP2014085862A (ja) * | 2012-10-24 | 2014-05-12 | Kddi Corp | 予測対象コンテンツにおける将来的なコメント数を予測する予測サーバ、プログラム及び方法 |
WO2014184928A1 (ja) * | 2013-05-16 | 2014-11-20 | 株式会社日立製作所 | 検出装置、検出方法、および記録媒体 |
GB2528792A (en) * | 2013-05-16 | 2016-02-03 | Hitachi Ltd | Detection device, detection method, and recording medium |
JPWO2014184928A1 (ja) * | 2013-05-16 | 2017-02-23 | 株式会社日立製作所 | 検出装置、検出方法、および記録媒体 |
JP2018181296A (ja) * | 2017-04-10 | 2018-11-15 | エヌ・ティ・ティ・コミュニケーションズ株式会社 | 情報提供装置、情報提供方法及びコンピュータープログラム |
JP7080029B2 (ja) | 2017-04-10 | 2022-06-03 | エヌ・ティ・ティ・コミュニケーションズ株式会社 | 情報提供装置、情報提供方法及びコンピュータープログラム |
KR20210035622A (ko) * | 2019-09-24 | 2021-04-01 | 주식회사 디셈버앤컴퍼니자산운용 | 시계열 데이터 유사도 계산 시스템 및 방법 |
KR102536201B1 (ko) * | 2019-09-24 | 2023-05-24 | 주식회사 디셈버앤컴퍼니자산운용 | 시계열 데이터 유사도 계산 시스템 및 방법 |
WO2023144967A1 (ja) * | 2022-01-27 | 2023-08-03 | 日本電信電話株式会社 | 処理装置、処理方法およびプログラム |
Also Published As
Publication number | Publication date |
---|---|
US20110153601A1 (en) | 2011-06-23 |
JPWO2010035455A1 (ja) | 2012-02-16 |
JP5387578B2 (ja) | 2014-01-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP5387578B2 (ja) | 情報分析装置、情報分析方法、及びプログラム | |
Shu et al. | Beyond news contents: The role of social context for fake news detection | |
Mandal et al. | Measuring similarity among legal court case documents | |
US7788086B2 (en) | Method and apparatus for processing sentiment-bearing text | |
Wang et al. | Automatic online news topic ranking using media focus and user attention based on aging theory | |
US9251249B2 (en) | Entity summarization and comparison | |
US20060200342A1 (en) | System for processing sentiment-bearing text | |
WO2009096523A1 (ja) | 情報分析装置、検索システム、情報分析方法及び情報分析用プログラム | |
WO2017013667A1 (en) | Method for product search using the user-weighted, attribute-based, sort-ordering and system thereof | |
CN101692223A (zh) | 响应于用户输入精炼搜索空间 | |
WO2012096388A1 (ja) | 意外性判定システム、意外性判定方法およびプログラム | |
Moghaddam et al. | Opinion polarity identification through adjectives | |
CN105975459A (zh) | 一种词项的权重标注方法和装置 | |
US9245023B2 (en) | Reputation analysis system and reputation analysis method | |
KR101585644B1 (ko) | 단어 연관성 분석을 이용한 문서 분류 장치, 방법 및 이를 위한 컴퓨터 프로그램 | |
Velmurugan et al. | Mining implicit and explicit rules for customer data using natural language processing and apriori algorithm | |
JP4539616B2 (ja) | 意見収集分析装置及びそれに用いる意見収集分析方法並びにそのプログラム | |
Venkataraman et al. | Classifying the sentiment polarity of Amazon mobile phone reviews and their ratings | |
JP2008282111A (ja) | 類似文書検索方法、プログラムおよび装置 | |
Setievi et al. | A Comparative Study of Supervised Machine Learning Algorithms for Fake Review Detection | |
KR20220041336A (ko) | 중요 키워드 추천 및 핵심 문서를 추출하기 위한 그래프 생성 시스템 및 이를 이용한 그래프 생성 방법 | |
KR101614551B1 (ko) | 카테고리 매칭을 이용한 키워드 추출 시스템 및 방법 | |
US20170249317A1 (en) | Search-based recommendation engine | |
Elavarasan et al. | Effective Mining Approach to Produce Quality Search Results Using Proposed Approach. | |
US11928427B2 (en) | Linguistic analysis of seed documents and peer groups |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 09815877 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 13060572 Country of ref document: US |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2010530725 Country of ref document: JP |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 09815877 Country of ref document: EP Kind code of ref document: A1 |