WO2023012933A1

WO2023012933A1 - Feature extraction device, feature extraction method, and feature extraction program

Info

Publication number: WO2023012933A1
Application number: PCT/JP2021/028957
Authority: WO
Inventors: 太三山本; 愛角田; 高明森谷; 学西尾; 優三好
Original assignee: 日本電信電話株式会社
Priority date: 2021-08-04
Filing date: 2021-08-04
Publication date: 2023-02-09
Also published as: JPWO2023012933A1

Abstract

This feature extraction device comprises: a combining unit (11) that generates a plurality of data pairs that are combinations of two sets of time-series data; a data analysis unit (12) that analyzes the degree of similarity between the two sets of time-series data included in each data pair using a plurality of analysis methods; and an occurrence probability calculation unit (13) that calculates, for each analysis method, the probability of occurrence of the degree of similarity for each data pair. The feature extraction device further comprises: a deviation level calculation unit (14) that calculates a deviation level of the probability of occurrence for each data pair calculated for each analysis method; a visualization unit (15) that visualizes and presents each set of time-series data included in a data pair and the deviation level to a user; an input unit (16) that receives similarity or dissimilarity determination input from the user; and a feature extraction unit (17) that extracts a feature of each analysis method on the basis of the determination input.

Description

Feature extraction device, feature extraction method, and feature extraction program

The present invention relates to a feature extraction device, a feature extraction method, and a feature extraction program.

As a data analysis device for analyzing time-series data, the one disclosed in Patent Document 1 is known. Patent Document 1 describes calculating an index value indicating the amount of change over time in each data to be analyzed, and displaying a plurality of graphs of time-series data arranged in order based on the index value. ing. In Patent Literature 1, for example, it is possible to focus on data that changes significantly at a specific time, so it is possible to support data analysis.

Japanese Patent No. 6592411

However, in Patent Document 1 described above, when there are multiple analysis methods for analyzing time-series data, it is determined which analysis method is suitable for analyzing this time-series data. not mentioned about it.

For this reason, there was a problem that it was difficult to select an appropriate analysis method when analyzing the target time-series data.

The present invention has been made in view of the above circumstances, and its object is to provide a feature extraction device, a feature extraction method, and a feature extraction program capable of extracting features of an analysis method for analyzing time-series data. is to provide

A feature extraction device according to one aspect of the present invention uses a combining unit that generates a plurality of data pairs by combining two pieces of time-series data, and a plurality of analysis methods to determine the similarity of two pieces of time-series data included in each data pair. an analysis unit that analyzes the degree of similarity, an occurrence probability calculation unit that calculates the occurrence probability of the similarity of each data pair for each analysis method based on the analysis result by the analysis unit, and a probability of occurrence for each data pair, calculated for each analysis method a deviation calculation unit that calculates the deviation of the occurrence probability, a visualization unit that visualizes each time-series data included in the data pair and the deviation and presents it to the user; An input unit that receives a similar determination input, and a feature extraction unit that extracts features of the analysis method based on the determination input and the degree of divergence.

A feature extraction method according to one aspect of the present invention includes steps of generating a plurality of data pairs by combining two time-series data, and using a plurality of analysis methods to determine the similarity of the two time-series data contained in each data pair. calculating the occurrence probability of the similarity of each data pair for each analysis method based on the analyzed similarity; and for each data pair, the occurrence probability calculated for each analysis method a step of calculating a degree of divergence; a step of visualizing each time-series data included in a data pair and the degree of divergence and presenting it to a user; and extracting features of the analysis method based on the judgment input and the degree of divergence.

One aspect of the present invention is a feature extraction program for causing a computer to function as the feature extraction device.

According to the present invention, it is possible to extract the characteristics of the analysis method for analyzing time-series data.

FIG. 1 is a block diagram showing the configuration of the feature extraction device according to the first embodiment. FIG. 2 is an explanatory diagram showing an example of time-series data and a data pair obtained by combining two time-series data. FIG. 3 is an explanatory diagram showing first to fourth analysis methods. FIG. 4A is an explanatory diagram showing analysis values calculated by the first to fourth analysis methods for a plurality of data pairs. FIG. 4B is an explanatory diagram showing the distribution curve of the analysis values shown in FIG. 4A, (a) is the distribution curve by the first analysis method, (b) is the distribution curve by the first analysis method, and (c) is the distribution curve by the third analysis method, and (d) is the distribution curve by the fourth analysis method. FIG. 5 is an explanatory diagram showing a normalized curve obtained by normalizing the distribution curve s1 shown in FIG. 4B. FIG. 6 is a diagram showing a plurality of data pairs, appearance probabilities of similarities obtained by analyzing each data pair by four analysis methods, and deviations of the appearance probabilities. FIG. 7A is a diagram showing two pieces of time-series data forming a data pair and appearance probabilities calculated by four analysis methods. FIG. 7B is a diagram showing two pieces of time-series data forming a data pair and appearance probabilities calculated by four analysis methods. FIG. 8 is a flow chart showing the processing procedure of the feature extraction device according to the first embodiment. FIG. 9 is an explanatory diagram showing characteristic patterns of each piece of time-series data acquired by the recording unit 18. As shown in FIG. FIG. 10 is a block diagram showing the configuration of the feature extraction device according to the second embodiment. FIG. 11A is an explanatory diagram showing analysis results of multiple data pairs and a normalization curve of the analysis results. FIG. 11B is an explanatory diagram showing the degree of divergence between time-series data items forming data pairs and the appearance probability of each data pair. FIG. 12 is a block diagram showing the hardware configuration of this embodiment.

Hereinafter, embodiments of the present invention will be described with reference to the drawings.

[First Embodiment]
FIG. 1 is a block diagram showing the configuration of the feature extraction device according to the first embodiment.

As shown in FIG. 1, the feature extraction device 1 according to the first embodiment is connected to a database 2 (denoted as "DB" in the figure). The feature extraction device 1 includes a combination unit 11, a data analysis unit 12 (analysis unit), an occurrence probability calculation unit 13, a divergence calculation unit 14, a visualization unit 15, an input unit 16, and a feature extraction unit 17. , and a recording unit 18 .

The database 2 stores a plurality (m) of time-series data qi (i=1 to m). The time-series data qi is, for example, the consumer price index provided by the Statistics Bureau, Ministry of Internal Affairs and Communications.

The combination unit 11 generates a data pair by combining two pieces of time-series data. Specifically, the combination unit 11 selects and combines two pieces of time-series data qi stored in the database 2 to set data pairs aj (j=1 to n). FIG. 2 is an explanatory diagram showing an example of setting a data pair aj by combining two pieces of time-series data. As shown in FIG. 2, time-series data q1 and q2 are combined to generate data pair a1. A data pair a2 is generated by combining the time-series data q2 and q3. A data pair a3 is generated by combining the time-series data q1 and q4. A data pair a4 is generated by combining the time-series data q3 and q4.

When there are m pieces of time-series data, "m*(m-1)/2" data pairs are set. That is, "n=m*(m-1)/2". For example, when there are 380 pieces of time-series data, (380*379)/2=72010 data pairs are set.

The data analysis unit 12 analyzes the degree of similarity between the two pieces of time-series data included in each data pair using a plurality of analysis methods. Specifically, the data analysis unit 12 is provided with computation programs for a plurality of analysis methods for analyzing the data pairs aj set by the combination unit 11 . The data analysis unit 12 includes a first analysis unit 21 that analyzes data pairs according to a first analysis method, a second analysis unit 22 that analyzes data pairs according to a second analysis method, and a data pair analysis unit that analyzes data pairs according to a third analysis method. and a fourth analysis unit 24 for analyzing data pairs by a fourth analysis method.

The data analysis unit 12 analyzes the degree of similarity between the two pieces of time-series data forming the data pair aj by the first to fourth analysis methods, and outputs the analysis result as an analysis value. In this embodiment, an example using four analysis methods is shown, but analysis methods other than four may be used. Specific processing of the first to fourth analysis methods will be described below with reference to FIG.

In the first analysis method, as shown in FIG. 3(a), the difference in absolute value between two pieces of time-series data is integrated. Specifically, the absolute value of the difference between one time-series data and the other time-series data obtained at predetermined time intervals is calculated, and the absolute values of the differences are integrated within a certain period. Output the integrated value as the analysis value. The higher the degree of similarity between the two pieces of time-series data, the smaller the analysis value.

In the second analysis method, as shown in FIG. 3(b), the amount of change over time in each time-series data is calculated, and the difference between the calculated amounts of change is integrated. Specifically, the difference in variation between one time-series data and the other time-series data obtained at predetermined time intervals is calculated, and the difference is integrated within a certain period. For example, if one change amount is "+1" and the other change amount is "-1", the difference is "2". If one change amount is "+1" and the other change amount is also "+1", the difference is "0". If one change amount is "-2" and the other change amount is "+1", the difference is "3". The second analysis method integrates these differences and outputs the integrated numerical value as an analysis value. The higher the degree of similarity between the two pieces of time-series data, the smaller the analysis value.

The third analysis method, as shown in FIG. 3(c), calculates the rate of change over time for each piece of time-series data, and integrates the difference in the calculated rate of change. Specifically, the difference in variation between one time-series data and the other time-series data obtained at predetermined time intervals is calculated, and the difference is integrated within a certain period. For example, if one time series data is "+3%" and the other time series data is "-1%", the difference is "4". If one is "+1%" and the other is also "+1%", the difference is "0". A third analysis method integrates these differences within a certain period of time and outputs the integrated numerical value as an analysis value. The higher the degree of similarity between the two pieces of time-series data, the smaller the analysis value.

In the fourth analysis method, as shown in FIG. 3(d), average values are calculated for each time series data at predetermined time intervals. Further, similar to the third analysis method described above, the differences in the average values are integrated within a certain period of time, and the integrated numerical value is output as the analysis value. The higher the degree of similarity between the two pieces of time-series data, the smaller the analysis value.

The first analysis unit 21 to fourth analysis unit 24 create a distribution curve plotting the analysis values of each data pair aj based on the analysis values calculated by the first to fourth analysis methods. The processing of the data analysis unit 12 will be described in detail below with reference to FIGS. 4A and 4B.

FIG. 4A is an explanatory diagram showing analysis values calculated by the first to fourth analysis methods for a plurality of data pairs. The data analysis unit 12 calculates an analysis value "bk-j" calculated by the first to fourth analysis methods for each data pair aj (j=1 to n) stored in the database 2 shown in FIG. calculate. Note that "k" indicates the analysis method number, and "j" indicates the data pair number. That is, "k" is an integer from 1 to 4, and "j" is an integer from 1 to n.

For example, the analysis value calculated using the first analysis method for the data pair a1 of the two time-series data "vegetables/seaweed" and "sushi (eating out)" is assumed to be the analysis value "b1-1". In FIG. 4A, the analytical value "b1-1" is "0.05". The analysis value calculated using the second analysis method for the data pair a1 is assumed to be the analysis value "b2-1". In FIG. 4A, the analytical value "b2-1" is "0.21".

Similarly, the analysis value calculated using the third analysis method for the data pair a2 is assumed to be the analysis value "b3-2". In FIG. 4A, the analysis value "b3-2" is "0.33". Let the analysis value calculated using the fourth analysis method for the data pair a3 be the analysis value "b4-3". In FIG. 4A, the analysis value "b4-3" is "0.64". Similarly to these, each analysis value "bk-j" is calculated.

The data analysis unit 12 generates distribution curves of analysis values of data pairs aj calculated by the first to fourth analysis methods. Specifically, the analysis values "bk-1 to bk-n" (k = 1 to 4) calculated using the first to fourth analysis methods for data pairs aj (j = 1 to n) to generate distribution curves s1 to s4 of . For example, as shown in (a) to (d) of FIG. Generate curves s1-s4. FIGS. 4B(a) to (d) are graphs plotting the analytical values obtained by the first to fourth analytical methods, where the horizontal axis indicates the analytical value and the vertical axis indicates the frequency.

In FIG. 4B(a), the analysis values b1-j obtained by analyzing n data pairs a1 to an by the first analysis method are plotted, and the curve along each analysis value is the distribution curve s1.

In FIG. 4B(b), the analysis values b2-j obtained by analyzing n data pairs a1 to an by the second analysis method are plotted, and the curve along each analysis value is the distribution curve s2.

In FIG. 4B(c), the analysis values b3-j obtained by analyzing n data pairs a1 to an by the third analysis method are plotted, and the curve along each analysis value is the distribution curve s3.

In FIG. 4B(d), the analysis values b4-j obtained by analyzing n data pairs a1 to an by the fourth analysis method are plotted, and the curve along each analysis value is the distribution curve s4.

Returning to FIG. 1, the appearance probability calculation unit 13 normalizes the distribution curves s1 to s4 created by a plurality of analysis methods. That is, the distribution curves s1 to s4 shown in FIGS. 4B(a) to (d) cannot be directly compared with each other. Therefore, each distribution curve s1-s4 is normalized. For example, by normalizing the distribution curve s1, a normalized curve s11 shown in FIG. 5 is obtained. That is, the occurrence probability calculation unit 13 normalizes the analysis result of the data analysis unit 12 to calculate the occurrence probability.

Based on the analysis result of the data analysis unit 12, the appearance probability calculation unit 13 calculates the appearance probability of the similarity of each data pair for each analysis method. Specifically, the appearance probability calculation unit 13 calculates the appearance probability of the analysis value (that is, similarity) by the first to fourth analysis methods for each data pair aj (j=1 to n). . The appearance probability is an index indicating the rank of the analysis value in the range of "0 to 1". The closer the appearance probability is to "0", the higher the similarity between the two pieces of time-series data. The closer the appearance probability is to "1", the lower the similarity between the two pieces of time-series data. The appearance probability of the analytical value "bk-j" of the data pair aj by the kth analysis method is indicated by "pk-j". The appearance probability of the analysis value "b1-1" of the data pair a1 by the first analysis method is "p1-1". For example, when the analysis value "b1-1" of the data pair a1 by the first analysis method belongs to the top 30%, the occurrence probability "p1-1" is "0.3".

In the normalized curve s11 shown in FIG. 5, the higher the appearance probability of the analysis value of the data pair aj, the higher the similarity between the two time-series data included in the data pair aj.

The divergence calculation unit 14 calculates the divergence of the appearance probability calculated for each analysis method for each data pair aj. Specifically, the divergence calculator 14 calculates the divergence of the occurrence probability for the target data pair aj by four analysis methods. The degree of divergence of the data pair aj by the k-th analysis method is indicated by "dk-j". For example, the deviation of the data pair a1 by the first analysis method is "d1-1".

The degree of divergence dk-j is defined as the following formulas (1) to (4) using the appearance probabilities "p1-j" to "p4-j".

d1-j={(p2-j)+(p3-j)+(p4-j)}/3-(p1-j) (1)
d2-j={(p1-j)+(p3-j)+(p4-j)}/3-(p2-j) (2)
d3-j={(p1-j)+(p2-j)+(p4-j)}/3-(p3-j) (3)
d4-j={(p1-j)+(p2-j)+(p3-j)}/3-(p4-j) (4)
As can be understood from the above formulas (1) to (4), the degree of deviation of the occurrence probability pk-j by the k-th analysis method is the occurrence probability by the k-th analysis method and the probability of occurrence by the k-th analysis method other than It is a numerical value indicating the difference from the average of appearance probabilities obtained by the three analysis methods. Therefore, the greater the difference between the occurrence probability calculated by one analysis method and the average of the occurrence probabilities calculated by the other three analysis methods, the greater the divergence of the occurrence probabilities calculated by the one analysis method. .

The divergence calculation unit 14 performs a process of rearranging the data pairs aj in descending order of the absolute values of the divergence dk-j calculated by the above equations (1) to (4) for each analysis method. FIG. 6 is an explanatory diagram showing data obtained by rearranging the degrees of divergence d1-j of the appearance probabilities p1-j calculated using the first analysis method in descending order. For example, a pair of time-series data for “vegetables/seaweed” and “sushi (eating out),” a pair of time-series data for “cup noodles” and “fresh food,” and so on (item 1 and item 2) are rearranged.

The visualization unit 15 visualizes each time-series data and the degree of divergence included in the data pair aj and presents them to the user. Specifically, the visualization unit 15 has a display unit (not shown) such as a display, and the degree of divergence dk-j calculated by the degree of divergence calculation unit 14 is larger than a certain value (for example, 0 .6) data pairs are displayed on the display. For example, the graphs and appearance probability data shown in FIGS. 7A and 7B are displayed on the display unit. That is, the visualization unit 15 visualizes only a predetermined number of analysis results with a large degree of divergence calculated by the degree of divergence calculation unit 14 .

FIG. 7A(a) shows a graph of data pair a11 of time-series data q11 (for example, household durable goods) and time-series data q12 (for example, furniture/household goods), and FIG. to the occurrence probability of the data pair a11 calculated using the fourth analysis method. FIG. 7B (a) shows a graph of data pair a12 of time-series data q13 (eg, vegetables/seaweed) and time-series data q14 (eg, sushi (eating out)), and FIG. It shows the appearance probability of the data pair a12 calculated using the fourth analysis method.

The visualization unit 15 displays the data shown in FIGS. 7A and 7B on the display unit. A user can recognize the displayed information by looking at the display unit.

The input unit 16 accepts similarity or dissimilarity judgment input by the user. Specifically, the input unit 16 is equipped with an operating device such as a keyboard, and receives input for determination of similarity or dissimilarity to the information displayed on the visualization unit 15 . For example, as shown in FIG. 7A, the time-series data q11 and q12 are separated, so the user inputs the dissimilarity determination result. On the other hand, since the time-series data q13 and q14 are close to each other as shown in FIG. 7B, the user inputs similar judgment results.

The feature extraction unit 17 extracts features of the analysis method based on the judgment input and the degree of divergence. Specifically, the feature extraction unit 17 extracts features of each analysis method based on the determination input input by the input unit 16 . For example, the graphs of the time-series data q11 and q12 of the data pair a11 shown in FIG. 7A(a) are divergent and have a low degree of similarity. Therefore, the probability of appearance of the data pair a11 should be a large number. As shown in FIG. 7A(b), the appearance probability calculated by the third analysis method is a small numerical value. The feature extraction unit 17 extracts the feature that the third analysis method is not suitable for the analysis of the time-series data q11 and q12. That is, the feature extraction unit 17 extracts time-series data unsuitable for analysis by the analysis method as the feature of the analysis method.

Also, the graphs of the time-series data q13 and q14 of the data pair a12 shown in FIG. 7B(a) are close to each other and have a high degree of similarity. Therefore, the probability of occurrence of the data pair a12 should be a small numerical value. As shown in FIG. 7B(b), the appearance probabilities calculated by the second, third, and fourth analysis methods are large numerical values. The feature extraction unit 17 extracts features that the second, third, and fourth analysis methods are not suitable for the analysis of the time-series data q11 and q12. The feature extraction unit 17 includes a storage device (not shown), and stores the extracted features in the storage device.

The recording unit 18 records characteristic data of time-series data. For example, for "vegetables/seaweed", it is recognized in advance that the characteristics are affected by the change of seasons, so this characteristic data is recorded. As for the "driver's license fee", since it is recognized in advance that the amount varies stepwise, this characteristic data is recorded. Further, the visualization unit 15 described above may visualize characteristics of time-series data in addition to each time-series data and the degree of divergence that constitute a data pair.

Next, the operation of the feature extraction device 1 according to the first embodiment will be described with reference to the flowchart shown in FIG. First, in step S11 of FIG. 8, the combining unit 11 combines a plurality of time-series data qi (i=1 to m) stored in the database 2 to generate data pairs aj. When there are m pieces of time-series data qi, “m*(m−1)/2” data pairs are generated.

In step S12, the data analysis unit 12 analyzes each data pair aj by a plurality of analysis methods to calculate an analysis value. Specifically, the first analysis unit 21 calculates the analysis value of each data pair aj using the first analysis method. The second analysis unit 22 calculates the analysis value of each data pair aj using the second analysis method. The third analysis unit 23 calculates the analysis value of each data pair aj using the third analysis method. The fourth analysis unit 24 calculates the analysis value of each data pair aj using the fourth analysis method.

Furthermore, the data analysis unit 12 generates a distribution curve of analysis values calculated by each analysis method. Specifically, as shown in FIGS. 4B (a) to (d), the distribution curve s1 of the analysis values calculated by the first analysis method, the distribution curve s2 of the analysis values calculated by the second analysis method, A distribution curve s3 of the analysis values calculated by the third analysis method and a distribution curve s4 of the analysis values calculated by the fourth analysis method are generated.

In step S13, the occurrence probability calculation unit 13 generates normalized curves obtained by normalizing the distribution curves s1 to s4. For example, the normalized curve s11 shown in FIG. 5 is generated.

In step S14, the divergence calculation unit 14 calculates the appearance probability of each data pair aj based on the normalization curve s11. For example, as shown in FIG. 5, when the target data pair belongs to the top 30% of all, the appearance probability is set to "0.3". Moreover, when it belongs to the top 70%, the appearance probability is set to "0.7".

In step S15, the divergence calculation unit 14 calculates the divergence of each appearance probability. Specifically, the appearance probability of the data pair aj for each analysis method is calculated using the formulas (1) to (4) described above. Furthermore, the divergence degree calculation unit 14 executes processing for rearranging the data pairs aj in descending order of the degree of divergence. As a result, for example, as shown in FIG. 6, data is obtained in which the degrees of divergence d1-j of the appearance probabilities p1-j calculated using the first analysis method are arranged in descending order.

For example, in the data pair of “vegetables/seaweed” and “sushi (eating out)”, the appearance probability calculated using the first analysis method is “0.0473”, and the second to fourth analysis methods is approximately "1.0000". Therefore, the appearance probability calculated by the first analysis method has a large difference from the appearance probability calculated by the other three analysis methods, and the deviation is a high value of 0.926428. ing.

In step S16, the visualization unit 15 displays a graph of the data pairs determined to have a large degree of divergence d1-j (for example, data pairs of 0.6 or more) and the data of the appearance probability on a display unit (not shown). indicate. That is, it visualizes a graph of data pairs and data of occurrence probabilities. For example, the information shown in FIGS. 7A and 7B is displayed on the screen.

By viewing this screen, the user determines the validity of the analysis results obtained by each analysis method. For example, in the graph shown in FIG. 7A(a), the two time-series data q11 and q12 are not similar. Therefore, it is presumed that the probability of occurrence will be a large numerical value (a numerical value close to "1"). In the data shown in FIG. 7A (b), the appearance probabilities calculated by the first, second, and fourth analysis methods show a value close to "1", and the appearance probability calculated by the third analysis method is the above It is a numerical value "0.16" that diverges from the three analysis methods. In this case, it is assumed that the analysis values obtained by adopting the third analysis method are inappropriate, and the analysis values obtained by adopting the first, second and fourth analysis methods are appropriate.

On the other hand, in the graph shown in FIG. 7B(a), the two time-series data q13 and q14 are similar. Therefore, it is inferred that the probability of appearance will be a small numerical value (a numerical value close to "0"). In the data shown in FIG. 7B (b), the appearance probabilities calculated by the second, third, and fourth analysis methods show a value close to "1", and the appearance probability calculated by the first analysis method is the above It is a numerical value "0.05" that deviates from the three analysis methods. In this case, it is assumed that the analytical values obtained by using the second, third, and fourth analytical methods are inappropriate, and the analytical values obtained by using the first analytical method are appropriate.

Furthermore, the visualization unit 15 reads the characteristic data of each time-series data recorded in the recording unit 18 and displays it on the display unit. For example, if the data pair to be analyzed includes time-series data of "vegetables/seaweed", the characteristic data "affected by seasonal change" is displayed on the display unit. If the data pair to be analyzed includes the time-series data of "driver's license fee", the characteristic data "the amount changes stepwise" is displayed on the display unit. By visually recognizing this characteristic data, the user can refer to the determination of the analysis result.

In step S17, the input unit 16 receives similarity/dissimilarity determination input from the user. The user refers to the visualized information and inputs the determination result as to whether or not the analysis values obtained by each analysis method are appropriate. For example, in the example shown in FIG. 7A described above, input the determination result that the analysis value by the third analysis method is inappropriate and the analysis value by the first, second, and fourth analysis methods are appropriate. Input in part 16 . In the example shown in FIG. 7B described above, the analysis values obtained by the second, third, and fourth analysis methods are inappropriate, and the analysis values obtained by the first analysis method are appropriate. to enter.

That is, the degree of divergence between the occurrence probability calculated by analyzing time-series data using one analysis method and the occurrence probability calculated by analyzing time-series data using another analysis method is high. This means that one analysis method or another analysis method is highly likely to be inappropriate as the analysis method used to analyze this time-series data. By acquiring the judgment input by the user, it becomes possible to recognize with high accuracy the characteristics of each analysis method (for example, the first analysis method is not suitable for the analysis of the time-series data a1).

In step S18, the feature extraction unit 17 calculates a score according to the appropriateness/inappropriate determination result based on the determination input input by the input unit 16. Specifically, a score of "+1" is assigned to an analysis method determined to be appropriate, and a score of "-1" is assigned to an analysis method determined to be inappropriate. In the example shown in FIG. 7A, the score for the third analysis method is "-1", and the scores for the first, second, and fourth analysis methods are "+1". In the example shown in FIG. 7B, the scores for the second, third and fourth analysis methods are "-1", and the score for the first analysis method is "+1". The feature extraction unit 17 integrates scores for each of the first to fourth analysis methods. The score values are not limited to "+1" and "-1", but "+2", "+1", "-1", and "-" according to the degree of "appropriate" and "inappropriate". It may be a numerical value such as 2”.

The feature extraction unit 17 extracts features of each analysis method based on the above-described integrated score value. For example, the feature is extracted that the analysis method with the highest score among the four analysis methods is suitable for the analysis of target time-series data. The feature extraction unit 17 records the extracted features in a storage device (not shown). Alternatively, the features already recorded in storage are modified based on the extracted features.

In step S19, the data analysis unit 12 determines whether or not the first to fourth analysis methods require modification. For example, as shown in FIG. 7A, it is determined that the third analysis method is not suitable for the analysis of data pair a11. judge. If it is determined that correction is necessary (S19; YES), the process proceeds to step S20; otherwise (S19; NO), this process ends.

In step S20, the data analysis unit 12 corrects or makes the target analysis method inappropriate. After that, this process is terminated. In this way, it is possible to extract the characteristics of the analysis method for analyzing the similarity of time-series data.

As described above, the feature extraction device 1 according to the first embodiment uses the combination unit 11 that generates a plurality of data pairs by combining two pieces of time-series data, and uses a plurality of analysis methods to extract two data pairs included in each data pair. An analysis unit (data analysis unit 12) that analyzes the similarity of two pieces of time-series data, and an occurrence probability calculation unit 13 that calculates the occurrence probability of the similarity of each data pair for each analysis method based on the analysis result of the analysis unit. , a deviation calculation unit 14 that calculates the deviation of the appearance probability calculated for each analysis method for each data pair, and each time series data included in the data pair and the deviation are visualized and presented to the user. It comprises a visualization unit 15, an input unit 16 that receives input for determining similarity or dissimilarity from the user, and a feature extraction unit 17 that extracts features of the analysis method based on the determination input and the degree of divergence. ing.

With the feature extraction device 1 configured as described above, it is possible to extract features that indicate to which type of time series data an analysis method for analyzing time series data is suitable or not. Therefore, when a user such as a data scientist analyzes time-series data using a data analysis device, it is possible to support the user in selecting an appropriate analysis method from among the multiple analysis methods that the user has in stock. becomes.

In addition, the visualization unit 15 visualizes only a predetermined number of analysis results with a large degree of divergence calculated by the degree of divergence calculation unit 14 . For example, only analysis results with a degree of divergence of 0.6 or more are visualized. Therefore, it is possible to omit the visualization of analysis results with a small degree of divergence. That is, the fact that the degree of divergence for all the four analysis methods is small means that the analysis values obtained by the four analysis methods are almost the same numerical value, and it is considered that the need for user intervention is low. . By visualizing only a predetermined number of analysis results with a large divergence, it is possible to reduce the user's effort.

Further, feature data of each time-series data recognized in advance is recorded in the recording unit 18, and by displaying this feature data on the display unit of the visualization unit 15, the user can judge the appropriateness of each analysis method. It can be used as a reference at times.

That is, as shown in FIG. 9, the feature data "affected by seasonal variation" is recorded in the recording unit 18 for the data pair a1 including the time-series data of "vegetables/seaweed". In addition, for the data pair a10 containing the time-series data of the "driver's license fee", the recording unit 18 records characteristic data that "commodity prices change stepwise". When the user analyzes the data pair a1 and a10, it becomes possible to refer to these feature data and determine the feature of each analysis method.

[Second embodiment]
Next, a second embodiment will be described. FIG. 10 is a block diagram showing the configuration of the feature extraction device 1a and its peripherals according to the second embodiment. The second embodiment differs from the above-described first embodiment in that a selector 19 is provided. Therefore, the components other than the selection unit 19 are denoted by the same reference numerals, and description of the configuration is omitted.

The selection unit 19 selects time-series data included in one data pair in which the appearance probability of the similarity of one data pair is close to the appearance probability of the similarity of the other data pair among the plurality of data pairs. and when the time-series data included in the other data pair are the same or similar, another data pair is selected.

That is, the selection unit 19 selects data pairs having similar time-series data from among the data pairs generated by the combination unit 11 . The visualization unit 15 excludes the appearance probabilities of the data pairs selected by the selection unit 19 and visualizes them.

FIG. 11A is a diagram showing a normalized distribution curve of analysis results obtained by analyzing a plurality of data pairs with the first analysis method. FIG. 11B is a diagram showing two pieces of time-series data forming a data pair and the degree of divergence d1-j.

The data pairs x1, x2, and x3 shown in FIG. 11B all include time-series data of "university tuition". Also, in FIG. 11A, the locations where the data pairs x1, x2, x3 are plotted are approximate. Therefore, two of these three data pairs x1, x2, x3 are considered redundant and unnecessary. The selection unit 19 excludes the data pair x2 and x3 from the analysis target.

Data pair x4 shown in FIG. 11B includes time-series data for "Chinese noodles", and data pair x5 includes time-series data for "soba". Also, in FIG. 11A, the locations where data pairs x4, x5 are plotted are approximate. Therefore, one of these two data pairs x4, x5 is considered redundant and unnecessary. The selection unit 19 excludes the data pair x5 from the analysis target.

As described above, the feature extraction device 1a according to the second embodiment analyzes data by excluding other data pairs similar to one data pair from a plurality of data pairs. can reduce the load required for

That is, the selection unit 19 determines that the appearance probability of the similarity of one data pair and the appearance probability of the similarity of the other data pair are close to each other among the plurality of data pairs, and form one data pair. If the time-series data that constitutes another data pair is the same or similar to the time-series data that constitutes another data pair, another data pair is selected. Then, the visualization unit 15 excludes the appearance probabilities of the data pairs selected by the selection unit 19 and displays them on the display unit. Therefore, display of unnecessary data can be avoided, and the computational load can be reduced.

As shown in FIG. 12, the feature extraction device 1 of the present embodiment described above includes, for example, a CPU (Central Processing Unit, processor) 901, a memory 902, and a storage 903 (HDD: HardDisk Drive, SSD: Solid State Drive). , a communication device 904, an input device 905, and an output device 906. A general-purpose computer system can be used. Memory 902 and storage 903 are storage devices. In this computer system, each function of the feature extraction device 1 is realized by the CPU 901 executing a predetermined program loaded on the memory 902 .

Note that the feature extraction device 1 may be implemented by one computer, or may be implemented by a plurality of computers. Also, the feature extraction device 1 may be a virtual machine implemented on a computer.

The program for the feature extraction device 1 can be stored in computer-readable recording media such as HDD, SSD, USB (Universal Serial Bus) memory, CD (Compact Disc), DVD (Digital Versatile Disc), etc. It can also be delivered via

It should be noted that the present invention is not limited to the above embodiments, and many modifications are possible within the scope of the gist.　

Reference Signs List

1, 1a feature extraction device 2 database 11 combination unit 12 data analysis unit (analysis unit)
13 Appearance probability calculation unit 14 Deviation degree calculation unit 15 Visualization unit 16 Input unit 17 Feature extraction unit 18 Recording unit 19 Selection unit 21 First analysis unit 22 Second analysis unit 23 Third analysis unit 24 Fourth analysis unit

Claims

a combination unit that generates a plurality of data pairs that combine two pieces of time-series data;
an analysis unit that analyzes the degree of similarity between two pieces of time-series data included in each data pair using a plurality of analysis methods;
an appearance probability calculation unit that calculates the appearance probability of the similarity of each data pair for each analysis method based on the analysis result by the analysis unit;
a divergence calculation unit that calculates the divergence of the occurrence probability calculated for each analysis method for each data pair;
a visualization unit that visualizes each time-series data included in the data pair and the degree of divergence and presents it to the user;
an input unit that receives similarity/dissimilarity judgment input by the user;
a feature extraction unit that extracts features of the analysis method based on the determination input and the degree of divergence;
A feature extractor with
The feature extraction device according to claim 1, wherein the appearance probability calculation unit calculates the appearance probability by normalizing the analysis result of the analysis unit.
The feature extraction device according to claim 1 or 2, wherein the visualization unit visualizes only a predetermined number of analysis results with a large degree of divergence calculated by the degree of divergence calculation unit.
Time-series data included in the one data pair, wherein the occurrence probability of the similarity of one data pair and the occurrence probability of the similarity of the other data pair are close to each other among the plurality of data pairs; A selection unit that selects the other data pair when the time-series data included in the other data pair is the same or similar,
The feature extraction device according to any one of claims 1 to 3, wherein the visualization unit visualizes data pairs excluding the appearance probabilities of other data pairs selected by the selection unit.
A recording unit that records characteristic data of the time-series data,
The feature extraction device according to any one of claims 1 to 4, wherein the visualization unit visualizes characteristic data of the time-series data in addition to each time-series data and the divergence degree included in the data pair.
The feature extraction device according to any one of claims 1 to 5, wherein the feature extraction unit extracts time-series data that is not suitable for analysis by the analysis method as the feature of the analysis method.
generating a plurality of data pairs combining two time-series data;
analyzing the similarity of the two time series data included in each data pair using a plurality of analysis methods;
calculating the occurrence probability of the similarity of each data pair for each analysis method based on the analyzed similarity;
For each data pair, calculating the degree of divergence of the occurrence probability calculated for each analysis method;
a step of visualizing each time-series data included in the data pair and the degree of divergence and presenting it to the user;
a step of receiving a determination input of similarity or dissimilarity from the user;
a step of extracting features of the analysis method based on the judgment input and the degree of divergence;
A feature extraction method with
A feature extraction program that causes a computer to function as the feature extraction device according to any one of claims 1 to 6.