WO2011027714A1

WO2011027714A1 - Data summarization system, data summarization method and storage medium

Info

Publication number: WO2011027714A1
Application number: PCT/JP2010/064538
Authority: WO
Inventors: 今井照之; 喜田弘司; 海老山知生
Original assignee: 日本電気株式会社
Priority date: 2009-09-04
Filing date: 2010-08-20
Publication date: 2011-03-10
Also published as: JPWO2011027714A1

Abstract

Provided is a data summarization system capable of efficiently compressing data that is generated sequentially in a fixed trend and that changes significantly irregularly. An approximate value calculation unit (61) causes each approximation formula to calculate an approximation value of a data value of new data by way of substituting the time of generation of the new data into approximation formulas for calculating approximation values of data values for which each approximation formula has been defined so that the variable is the time of generation, and the valid domain of the variable is an interval of time or a set of points of time. An approximation formula evaluation unit (62) either selects an approximation formula suitable for approximation value calculation of a data value of new data, or determines that there is no approximation formula suitable for approximation value calculation of the data value of the new data on the basis of the data value of the new data and an approximation value calculated for each approximation formula. The update unit (65) updates the valid domain of the approximation formula so as to include the time of generation of the new data when the approximation formula evaluation unit (62) has selected an approximation formula suitable for approximation value calculation of the data value of the new data.

Description

Data summarization system, data summarization method and recording medium

The present invention is applied to a data summarization system, a data summarization method, a data summarization program, a recording medium, and a data summarization system, a data summarization method, a data summarization program, and a recording medium that reduce the amount of information of sequentially generated data. Related to the data structure.

Techniques for dynamically compressing and storing continuously generated data have been proposed. For example, Patent Document 1 describes a data compression method based on a difference for process data. Process data is time-series data values. The compression method based on the difference described in Patent Document 1 uses this process data as a reference point when each process data is represented on a two-dimensional plane with the x-axis as time and the y-axis as process values. A difference in process value is obtained between the reference point and the process data to be processed. In this compression method, process data whose absolute value of the difference exceeds the compression accuracy range calculated for the process data to be processed is stored in the time-series process data storage unit, and other process data is stored. Decrease as much as possible. In this way, the compression method described in Patent Document 1 compresses data. Here, the compression accuracy is set for each process data and is used to determine whether or not to compress. The higher the compression accuracy, the higher the possibility of compression. That is, in the technique described in Patent Document 1, high compression accuracy is a concept similar to an increase in compression rate, and low compression accuracy is a concept similar to a reduction in compression rate. In Patent Document 1, the reference point is referred to as pivot.
Patent Document 2 describes a technique for predicting a next input value based on a past input value, storing a difference between the actual input value and the predicted value, and performing data compression.

JP 2003-15734 A (paragraph 0039, paragraphs 0063-0065) JP-A-2006-259937 (paragraphs 0031 to 0033)

The technique described in Patent Document 1 determines whether or not to compress process data by comparing the difference between the process value of the process data and the reference data with the compression accuracy. For this reason, when the process value of the process data to be compressed changes greatly in a discontinuous manner, the difference between the process value and the reference value exceeds the compression accuracy, and the process data becomes difficult to be thinned out (that is, difficult to compress). ). Therefore, in the technique described in Patent Document 1, efficient data compression is difficult when the value of the data often changes greatly suddenly.
The technique described in Patent Literature 2 realizes data compression by storing a difference between a prediction result based on a past value and an actual input value. For this reason, if the value of the data to be compressed changes at irregular timing, the difference between the predicted value and the actual input value becomes large, and a correspondingly large storage capacity is required. Therefore, the technique described in Patent Document 2 cannot perform data compression efficiently.
An example of data that is sequentially generated and whose data value changes suddenly is a CPU usage rate. As described above, the techniques described in Patent Documents 1 and 2 are not suitable for compressing data in which data values suddenly and suddenly change greatly.
Therefore, the present invention provides a data summarization system, a data summarization method, a data summarization program, and a data summarization method capable of efficiently compressing data that is sequentially generated in a certain tendency and that may change greatly irregularly. An example is to provide a recording medium. Another object is to provide a data structure suitably applied to such a data summarization system, data summarization method, data summarization program, and recording medium.

A data summarization system according to an aspect of the present invention is an approximate expression for calculating an approximate value of a data value in data including a data value and an occurrence time of the data value, the occurrence time being a variable, and an effective definition range of the variable Is an approximate value that calculates the approximate value of the data value of the new data for each approximate expression by substituting the occurrence time included in the new data for each approximate expression that is defined as a time interval or a set of time points Based on the calculation unit, the approximate value calculated for each approximate expression and the data value of the new data, select an approximate expression suitable for calculating the approximate value of the data value of the new data, or the data value of the new data An approximate expression evaluation unit that determines that there is no approximate expression suitable for approximate value calculation, and an unconfirmed data that stores new data determined to have no approximate expression suitable for approximate calculation of data values as approximate formula unconfirmed data. When new data is input, it is determined whether or not a new approximate expression can be generated from the new data and the approximate expression indeterminate data. A new approximate expression generation unit that generates an expression and defines a time interval or a set of time points as an effective definition area of the approximate expression, and an approximate expression suitable for calculating the approximate value of the data value of the new data by the approximate expression evaluation unit And an update unit that updates the effective definition area of the approximate expression so as to include the generation time of the new data when selected.
A data summarization method according to an aspect of the present invention is an approximate expression for calculating an approximate value of a data value in data including a data value and an occurrence time of the data value, wherein the occurrence time is a variable, By substituting the occurrence time included in the new data for each approximate expression whose domain is defined as a time interval or a set of time points, the data value of the new data is calculated for each approximate expression. Select an approximation formula suitable for calculating the approximate value of the new data based on the approximate value calculated every time and the data value of the new data, or an approximation suitable for calculating the approximate value of the data value of the new data New data determined that there is no formula and determined that there is no approximate formula suitable for calculating the approximate value of the data value is stored as approximate formula indeterminate data, and when new data is input, the new data It is determined whether or not a new approximate expression can be generated based on the undefined data of the similar expression, and if it can be generated, a new approximate expression is generated, and a time interval or a single point is defined as an effective definition area of the approximate expression. When the approximate expression suitable for calculating the approximate value of the data value of the new data is selected, the effective definition area of the approximate expression is updated so that the generation time of the new data is included. .
A recording medium storing a data summarizing program according to an aspect of the present invention is an approximate expression for calculating an approximate value of a data value in data including a data value and an occurrence time of the data value in a computer. Approximating the data value of the new data by substituting the time of occurrence included in the new data for each approximate expression where the time is a variable and the effective domain of the variable is defined as a time interval or a set of time points Approximate value calculation processing that calculates a value for each approximate expression, whether to select an approximate expression suitable for calculating the approximate value of the data value of the new data based on the approximate value calculated for each approximate expression and the data value of the new data Alternatively, approximate expression evaluation processing for determining that there is no approximate expression suitable for calculating the approximate value of the data value of the new data, and new data determined that there is no approximate expression suitable for calculating the approximate value of the data value Unconfirmed data storage process to be stored in the unconfirmed data storage unit as approximate expression unconfirmed data, whether new approximate expression can be generated from new data and approximate expression unconfirmed data when new data is input A new approximate expression is generated, and if it can be generated, a new approximate expression is generated, and a new approximate expression generation process that defines a time interval or a set of time points as an effective definition area of the approximate expression, and approximate expression evaluation When an approximate expression suitable for calculating the approximate value of the data value of the new data is selected in the process, a program for executing an update process for updating the effective definition area of the approximate expression so as to include the generation time of the new data is stored. .
In addition, the data structure according to one aspect of the present invention includes an approximation formula for calculating an approximate value of a data value by substituting a variable, and an effective domain that is a domain of a variable that can obtain the approximate value of the data value. And the effective domain is represented by a set of points representing a variable interval or one variable value.

Each form of the present invention can efficiently compress data that is sequentially generated with a certain tendency and that may change greatly irregularly. The data structure according to an aspect of the present invention can be suitably used for a data summarization system, a data summarization method, a data summarization program, and a recording medium having such advantages.

FIG. 1 is an explanatory diagram showing an example of an effective domain. FIG. 2 is a block diagram illustrating an example of the data summarization system according to the first embodiment of this invention. FIG. 3 is an explanatory diagram illustrating an example of one data input to the data input unit 10. FIG. 4 is an explanatory diagram illustrating an example of uncertain points stored in the uncertain point storage unit 13. FIG. 5 is an explanatory diagram schematically showing uncertain points and newly generated data. FIG. 6 is an explanatory diagram schematically showing an approximate expression generated from each data shown in FIG. FIG. 7 is an explanatory diagram illustrating an example of the expression format of the approximate expression. FIG. 8 is an explanatory diagram showing an example of an approximate expression stored in the approximate expression storage unit 15 and an approximate expression ID thereof. FIG. 9 is an explanatory diagram illustrating an example of final data information. FIG. 10 is an explanatory diagram showing an example of processing of the new data generation time substitution unit 12. FIG. 11 is an explanatory diagram showing an example of an effective definition area for each approximate expression. FIG. 12 is an explanatory diagram illustrating an example of selecting an approximate expression that satisfies a criterion. FIG. 13 is an explanatory diagram illustrating an example of selecting an approximate expression that satisfies a criterion. FIG. 14 is an explanatory diagram illustrating an example of selecting an approximate expression that satisfies a criterion. FIG. 15 is an explanatory diagram illustrating an example of valid domain update when a new approximate expression is generated. FIG. 16 is a flowchart illustrating an example of processing progress of the first embodiment. FIG. 17 is a flowchart illustrating an example of the processing progress of step S105. FIG. 18 is an explanatory diagram illustrating an example in which data compression is performed by applying the technique described in Patent Document 1. FIG. 19 is an explanatory diagram illustrating an example of a situation in which irregular data is continuously generated temporarily. FIG. 20 is a block diagram illustrating an example of a data summarization system according to the second embodiment of this invention. FIG. 21 is an explanatory diagram showing an example of a predetermined formula. FIG. 22 is an explanatory diagram showing an example of a case where a predetermined expression is expressed only by a constant term. FIG. 23 is a block diagram illustrating an example of a data summarization system according to the third embodiment of this invention. FIG. 24 is a schematic diagram in which the data stored in the unsummarized data storage unit 30 is schematically arranged in the order of occurrence time. FIG. 25 is a block diagram illustrating an example of a data summarization system according to the fourth embodiment of this invention. FIG. 26 is an explanatory diagram showing an example of deriving a default effective definition area. FIG. 27 is an explanatory diagram summarizing the valid definition areas of x = f1 (t) and x = f2 (t) illustrated in FIG. FIG. 28 is a flowchart illustrating an example of the progress of the default update process according to the fourth embodiment. FIG. 29 is a block diagram showing the minimum configuration of the present invention.

Hereinafter, embodiments of the present invention will be described with reference to the drawings.
First, an outline of a data summarization system according to an embodiment of the present invention will be described. A data summarization system according to an embodiment of the present invention reduces the amount of information of data that is sequentially generated over time. The data summarization system according to an aspect of the present invention can reduce the amount of information, thereby reducing the storage capacity required for storing data as compared with the case where the data itself is stored accurately as it is. This reduction of data information is called “summary”.
Generally, the accuracy of data decreases due to summarization. However, a data summarization system according to an embodiment of the present invention is a data that is generated with a constant tendency sequentially with the passage of time, and the data in which the tendency of the data may change irregularly can be accurately and efficiently obtained. To summarize. “CPU usage rate” has been exemplified as an example of data that is generated with a constant trend and the data trend may change irregularly. The usage rate is not limited. For example, the number of accesses per unit time of the Web page to be observed and the total number of accesses after the Web page has been released suddenly change. Data that may change irregularly ”. In addition, for example, the communication amount of the network device also corresponds to “data that is sequentially generated in a certain tendency and the data tendency may change irregularly”. “Sequentially generated data” is a concept including “sequentially observable data”.
In the following explanation, numerical values that occur with the passage of time will be described as an example as “data that occurs sequentially with a certain tendency and may change irregularly”. However, even if the data itself is not a numerical value, it can be converted into a numerical value, and any data can be used as long as the difference between the numerical data can be derived. An example of applying the present invention to such data other than numerical values will be described later.
In the example shown below, each data includes a data value and an occurrence time of the data value. In the following description, the occurrence time of this data value may be simply referred to as the occurrence time of data.
In addition, the data summarization system according to an aspect of the present invention derives a function for calculating a data value (numerical value) from a set of generated data, using the generation time as a variable. This function is an approximate expression for obtaining an approximate value of the data value from the occurrence time. In this approximate expression, a domain in which an approximate value of the data value can be obtained is determined. This domain is hereinafter referred to as an effective domain. The effective domain is represented by a set of sections (time zones) or points (specific time). A plurality of sections or points may be defined for one effective definition area. FIG. 1 is an explanatory diagram showing an example of an effective domain. In FIG. 1, the horizontal axis represents time, and the vertical axis represents data values. Further, in FIG. 1, the data value changes in the same direction in the sections a to b, the change tendency of the data value changes greatly at time b, and the change tendency of the data value changes again at time c. Is illustrated. It is assumed that functions 91 and 92 are obtained as approximate expressions. In the example shown in FIG. 1, for the data generated in the sections a to b and c to d, approximate values are obtained by the approximate expression represented by the function 91. Therefore, the effective domain of the approximate expression represented by the function 91 is a set of sections a to b and c to d. The effective definition area of the approximate expression represented by the function 92 is the interval b to c.
The data summarization system according to an aspect of the present invention determines whether or not there is an approximate expression that appropriately obtains an approximate value of the data value when new data is generated, and approximate expression that appropriately obtains the approximate value of the data value If there is, the generation time of the new data is added to the effective definition area of the approximate expression. On the other hand, if there is no approximate expression that can appropriately obtain an approximate value of newly generated data, the new data is stored as a point for which the corresponding approximate expression is not determined (hereinafter referred to as an indeterminate point). The data summarization system according to an embodiment of the present invention creates an approximate expression from the uncertain points when the uncertain points are accumulated in a number that can derive a new approximate expression.
As described above, the data summarization system according to an embodiment of the present invention stores each data in the form of an approximate expression and an effective definition area, instead of storing the generation time and the data value for each data. Furthermore, the data summarization system according to an aspect of the present invention allows a plurality of sections and points (specific times) to be defined as an effective definition area of one approximate expression. As a result, a data summarization system according to one aspect of the present invention efficiently compresses (ie summarizes) data.
Embodiment 1. FIG.
FIG. 2 is a block diagram illustrating an example of the data summarization system according to the first embodiment of this invention. The data summarization system of the first embodiment includes a data input unit 10, a final time storage unit 11, a new data generation time substitution unit 12, a new approximate expression generation unit 14, an indeterminate point storage unit 13, and an approximation. An expression storage unit 15, an accuracy constraint input unit 16, a graph evaluation unit 17, a graph update unit 18, and a confirmed graph storage unit 19 are provided. The data summarization system according to the first embodiment of the present invention summarizes data sequentially input to the data input unit 10 and stores the summarized result in the definite graph storage unit 19.
The data input unit 10 acquires data from a data generation source (not shown) that sequentially generates data over time. The mode of the data source differs depending on the type of data. For example, when the data is the number of web page accesses, the web server may be the data generation source. Further, when the data is the usage rate of the CPU, the unit that monitors the usage rate of the CPU may be the data generation source. Each piece of data input from the data generation source to the data input unit 10 includes at least a data value and a generation time of the data value.
FIG. 3 is an explanatory diagram illustrating an example of one data input to the data input unit 10. FIG. 3 illustrates the case where the data value is the CPU usage rate. As shown in FIG. 3, the data includes the generation time of the data value and the data value (CPU usage rate in this example). The data illustrated in FIG. 3 indicates that the data generation time is “2009/01/01 00:00:00” and the CPU usage rate at that time is 5.0%. Data generated at the data generation source and input to the data input unit 10 may be referred to as generation data.
The uncertain point storage unit 13 is a storage device that stores data for which an approximate expression for obtaining an approximate value of a data value has not yet been specified. Regardless of which approximate value is calculated using any existing approximate expression, data that is determined to have a large difference between the actual data value and the approximate value is stored in the indeterminate point storage unit 13 as an indeterminate point. Go. In the first embodiment, the data can be regarded as a point having the occurrence time and the data value as coordinates. Therefore, in the first embodiment, data for which an approximate expression for obtaining an approximate value of a data value has not yet been specified is expressed using the word “indeterminate point”.
The approximate expression for obtaining the approximate value of the data value using the occurrence time as a variable is derived from a plurality of uncertain points stored in the uncertain point storage unit 13. Until the number of pieces of data necessary for determining the approximate expression is obtained, the uncertain point storage unit 13 stores generated data corresponding to the uncertain points. Note that the number of data necessary to determine the approximate expression depends on the type of approximate expression (whether it is a linear expression, a quadratic expression, an expression using a trigonometric function, etc.), an approximate expression, or the like. Depends on the decision algorithm. The type of approximate expression and the approximate expression determination algorithm are determined in advance, and the number of data required to determine the approximate expression is also determined in advance according to the type of approximate expression and the approximate expression determination algorithm.
FIG. 4 is an explanatory diagram showing an example of uncertain points stored in the uncertain point storage unit 13. Each undetermined point includes an occurrence time of the undetermined point (data) and a data value (CPU usage rate in this example). Of the generated data, the generated data determined that the corresponding approximate expression cannot be specified becomes an undetermined point, and therefore the data structure of each undetermined point is the same as the data structure of the data illustrated in FIG.
The new approximate expression generation unit 14 is configured such that when newly generated data is input to the new approximate expression generation unit 14, the number of the generated data and the undefined points stored in the undefined point storage unit 13 is It is determined whether or not the number necessary for determining the approximate expression is exceeded. When the new approximate expression generation unit 14 determines that the number of pieces of data necessary for determining the approximate expression has been prepared, the new approximate expression generation unit 14 calculates a function (approximate expression) for calculating a data value from the data using the occurrence time as a variable. Generate. For example, it is assumed that the number of data required for generating the approximate expression is k, and k−1 uncertain points are stored in the uncertain point storage unit 13. At this time, when one new generation data is newly input to the new approximate expression generation unit 14, the new approximate expression generation unit 14 generates an approximate expression from k pieces of data obtained by adding the data to the undetermined point.
Furthermore, the new approximate expression generation unit 14 generates the approximate expression after generation of the approximate expression and each data used for the generation of the approximate expression (that is, the uncertain point stored in the uncertain point storage unit 13, and Newly generated data) is output to the graph update unit 18. Instead of outputting the newly generated approximate expression to the graph updating unit 18, the new approximate expression generating unit 14 stores the approximate expression and the ID (identification information) of the approximate expression in the approximate expression storage unit 15, The approximate expression ID may be output to the graph update unit 18.
FIG. 5 is an explanatory diagram schematically showing uncertain points used for generating an approximate expression and newly generated data. The horizontal axis shown in FIG. 5 represents time t, and the vertical axis shown in FIG. 5 represents the data value x. Here, it is assumed that the new approximate expression generation unit 14 generates a linear expression as an approximate expression. The generation method is assumed to be a least square method. When the new approximate expression generation unit 14 generates a linear expression by the least square method, the number of necessary data is four. Three uncertain points P shown in FIG.₄₀₀Is stored in the indeterminate point storage unit 13 and the data P is newly added.₄₀₁Is input to the data input unit 10. Then, the new approximate expression generation unit 14 has three uncertain points P.₄₀₀And new generation data P₄₀₁From the above, an approximate expression is obtained by the least square method.
FIG. 6 is an explanatory view schematically showing an approximate expression generated from each data shown in FIG. The new approximate expression generation unit 14 minimizes the approximate expression “x = at + b” of the data value x expressed as a linear function of the occurrence time t from four pieces of data (that is, four sets of occurrence times and data values). Determined by the square method. That is, the new approximate expression generation unit 14 determines the coefficient “a” and the constant term “b” of the variable t by the least square method from the four generation times and the data values. The new approximate expression generation unit 14 generates an approximate expression by determining the coefficient and the constant term in this way.
FIG. 7 is an explanatory diagram showing an example of the expression format of the approximate expression generated by the new approximate expression generation unit 14. The approximate expression “x = at + b” is uniquely determined if the coefficient (primary coefficient) of the variable t and the constant term are determined. Therefore, when the approximate expression is a linear expression, the approximate expression may be expressed by only the primary coefficient and the constant term of the variable t as illustrated in FIG.
In this example, the case where the new approximate expression generation unit 14 obtains an approximate expression represented by a linear function by the least square method using three uncertain points and one new generated data is shown. The function used by the new approximate expression generation unit 14 as an approximate expression is not limited to a linear function. For example, the new approximate expression generation unit 14 may generate an approximate expression represented by a quadratic or higher integer function, an exponential function, or a trigonometric function. Further, the method of generating the approximate expression is not limited to the least square method, and the approximate expression may be generated by another method. As already described, the number of data necessary for generating the approximate expression differs depending on the type of approximate expression and the method for generating the approximate expression. When the number of undetermined points and newly generated data reaches the number of data, the new approximate expression generation unit 14 may generate an approximate expression. Further, the new approximate expression generation unit 14 may generate an approximate expression as a linear function connecting two points. In this case, if there are two data, the new approximate expression generation unit 14 can generate an approximate expression. That is, the new approximate expression generation unit 14 generates, as an approximate expression, a straight line connecting two points in the plane of the generation time t and the data value x from one uncertain point and one newly generated data. May be. In this case, the number of data necessary for generating the approximate expression is two. Further, the new approximate expression generation unit 14 may generate the approximate expression using other methods such as spline interpolation.
In the following description, a case where the new approximate expression generation unit 14 generates an approximate expression of a linear function will be described as an example.
The approximate expression storage unit 15 is a storage device that stores an approximate expression for obtaining a data value using the occurrence time as a variable, together with an ID of the approximate expression. When the new approximate expression generation unit 14 newly generates an approximate expression and outputs the approximate expression to the graph update unit 18, the graph update unit 18 stores the approximate expression in the approximate expression storage unit 15 together with the ID. In this case, the graph update unit 18 may assign the ID of the approximate expression. For example, the approximate expression storage unit 15 may store a combination of a first-order coefficient and a constant term as in the case shown in FIG. However, this storage mode is an example, and the approximate expression storage unit 15 may store the approximate expression in another form.
FIG. 8 is an explanatory diagram illustrating an example of an approximate expression stored in the approximate expression storage unit 15 and an approximate expression ID thereof. As shown in FIG. 8, the approximate expression storage unit 15 displays an approximate expression ID, which is identification information of the approximate expression, and an approximate expression (in this example, expressed by a combination of a primary coefficient and a constant term). Store in association with each other.
The final time storage unit 11 is a storage device that stores a set of the generation time of data that has occurred at the end other than the uncertain point and an approximate expression that approximates the data value of the data. In other words, the last time storage unit 11 is the approximation that appropriately obtains the generation time of the last generated data and the approximate value of the data among the data for which the approximate expression that appropriately obtains the approximate value of the data value is specified. Memorize a pair with an expression. Hereinafter, the combination of the data generation time and approximate expression stored in the final time storage unit 11 is referred to as final data information. The approximate expression indicated by the final data information is referred to as a final approximate expression.
FIG. 9 is an explanatory diagram showing an example of final data information stored in the final time storage unit 11. In the example shown in FIG. 9, the final data information includes an approximate expression ID and a final time. In this example, the approximate expression is specified by the approximate expression ID. The final time is the data generation time of the data that has occurred last among the data that can be approximated by the approximate expression. The uncertain point stored in the uncertain point storage unit 13 is data that cannot be approximated by each known approximate expression, and thus is not a target to be stored in the final time storage unit 11. Therefore, even if an undetermined point occurs after the final time stored in the final time storage unit 11, the final time stored in the final time storage unit 11 is not updated.
In FIG. 9, the approximate expression is represented by the approximate expression ID, but the final time storage unit 11 includes the approximate expression generated by the new approximate expression generation unit 14 and the approximate expression stored by the approximate expression storage unit 15. The approximate expression may be stored in a similar format. For example, the final time storage unit 11 may store a combination of a primary coefficient and a constant term as information representing an approximate expression instead of the approximate expression ID.
The new data generation time substitution unit 12 substitutes the generation time of the generation data newly input to the data input unit 10 for each approximate expression generated in the past, and calculates an approximate value of the data value. Then, the new data generation time substitution unit 12 outputs a set of each approximate expression and an approximate value calculated for each approximate expression to the graph evaluation unit 17. At this time, the new data generation time substitution unit 12 reads the final data information from the final time storage unit 11. The new data generation time substitution unit 12 then determines which of the sets of the approximate expression and the approximate value is the set corresponding to the set of the approximate expression indicated by the final data information and the approximate value obtained from the approximate expression. Is also output to the graph evaluation unit 17. Further, the new data generation time substitution unit 12 also outputs the generated data newly input to the data input unit 10 to the graph evaluation unit 17.
Note that obtaining an approximate value by substituting the time of occurrence of newly generated data into an approximate expression means that a line represented by the approximate expression is represented in a plane with the time as the horizontal axis and the data value as the vertical axis. It can be said that it is to extend to the latest occurrence time on the horizontal axis.
FIG. 10 is an explanatory diagram showing an example of processing of the new data generation time substitution unit 12. In the example shown in FIG. 10, it is assumed that four approximate expressions x = f0 (t), x = f1 (t), x = f2 (t), and x = f3 (t) have been generated in the past. In the example shown in FIG. 10, the value of x obtained by x = f0 (t) is a constant 0. That is, f0 (t) = 0 · t + 0. In FIG. 10, each black circle represents each data. The data shown in the vicinity of the line representing each approximate expression is data that can approximate the data value with the approximate expression. For example, in the example shown in FIG. 10, the data values of six data are approximated by x = f1 (t).
In FIG. 10, the final time indicated by the final data information is t_lastYes, last time t_lastData P generated in₁₀₁₀Is approximated by an approximate expression x = f2 (t).
Also, last time t_lastThe generation time of new generation data input to the data input unit 10 is t = t_iSuppose that
The new data generation time substitution unit 12 sets the approximate expressions x = f0 (t), x = f1 (t), x = f2 (t), and x = f3 (t) stored in the approximate expression storage unit 15. On the other hand, the occurrence time t of new occurrence data_iTo calculate approximate values. This approximate value is X₁₀₁₀, X₁₀₁₁, X₁₀₁₂, X₁₀₁₃Then, the new data generation time substitution unit 12₁₀₁₀= F0 (t), X₁₀₁₁= F1 (t), X₁₀₁₂= F2 (t), X₁₀₁₃= F3 (t) is calculated. Further, the new data generation time substitution unit 12 reads final data information from the final time storage unit 11. Then, the new data generation time substitution unit 12 determines that a set corresponding to the set of the approximate expression indicated by the final data information and the approximate value obtained from the approximate expression is (x = f2 (t), X₁₀₁₂). Then, the new data generation time substituting unit 12 includes information indicating which pair of each approximate expression and approximate value, the pair based on the final approximate expression is, and time t_iThe new data generated in step 1 is output to the graph evaluation unit 17. At this time, the approximate expression may be expressed by an approximate expression ID or may be expressed in a form stored in the approximate expression storage unit 15.
The accuracy constraint input unit 16 receives a standard (accuracy) that can be said that the approximate value calculated by the approximate expression appropriately approximates the actual data value, and stores the standard. As an example of this reference, for example, the absolute value of the difference between the approximate value f (t) by the approximate expression f and the actual generated data value x is less than a predetermined threshold value ε. This criterion can be expressed as | x−f (t) | <ε. Further, for example, a criterion is set that the absolute value of the ratio of the difference between the approximate value f (t) and the actual generated data value x with respect to the approximate value f (t) by the approximate expression f is less than the threshold value ε. It may be. This criterion can be expressed as | (x−f (t)) / f (t) | <ε. In the above two examples, the case where the above calculation results are both less than the threshold is exemplified as the reference, but a reference that the above calculation results are equal to or less than the threshold may be used. These criteria are examples, and other criteria may be defined.
In the following description, the criterion that the absolute value of the difference between the approximate value f (t) by the approximate expression f and the data value x of the actually generated data is less than a predetermined threshold ε (ie, | x−f (t ) | <Ε) is output to the accuracy constraint input unit 16, and the accuracy constraint input unit 16 will be described with reference to an example in which this criterion is stored.
The definite graph storage unit 19 is a storage device that stores an effective domain for each approximate expression that approximates past generated data. FIG. 11 is an explanatory diagram illustrating an example of an effective definition area for each approximate expression stored in the definite graph storage unit 19. In the example shown in FIG. 11, each approximate expression is represented by an approximate expression ID. Then, for each approximate expression ID, either or both of a time range that is an effective definition area and a time point that is an effective definition area are determined. In FIG. 11, a portion expressed as a time range (that is, a time zone) in the valid definition area is shown as “section”, and a portion expressed as a specific point of time is shown as “point”. . Since there may be a plurality of “sections” and “points” included in the effective definition area, two or more “sections” and two or more “points” may be defined in one approximate expression.
In the example shown in FIG. 11, three intervals [t for the approximate expression ID “f0”_01b, T_01e], [T_02b, T_02e], [T_03b, T_03e] Is stipulated. T_01b, T_01eEtc. are times that are the start point or end point of each section. Therefore, the approximate expression of the approximate expression ID “f0” (x = f0 (t)) is t_01b≦ t ≦ t_01e, T_02b≦ t ≦ t_02e, T_03b≦ t ≦ t_03eMeans that the data value at time t can be approximated by the approximate value f0 (t).
Similarly, for the approximate expression ID “f1”, one interval [t_11b, T_11e] And two time points t₁₂, T₁₃Is stipulated. Therefore, the approximate expression of the approximate expression ID “f1” (x = f1 (t)) is t_11b≦ t ≦ t_11eOr t = t₁₂, T = t₁₃Means that the data value at time t can be approximated by the approximate value f1 (t).
The set of these “sections” and “points” is an effective domain. In the example shown in FIG. 11, the effective definition area of the approximate expression of the approximate expression ID “f0” is [t_01b, T_01e] ∪ [t_02b, T_02e] ∪ [t_03b, T_03e] = {T | t_01b≦ t ≦ t_01e, T_02b≦ t ≦ t_02e, T_03b≦ t ≦ t_03e}. Similarly, the effective definition area of the approximate expression of the approximate expression ID “f1” is [t_11b, T_11e] ∪ {t₁₂} ∪ {t₁₃} = {T | t_11b≦ t ≦ t_11e, T = t₁₂, T = t₁₃}. The effective domain of the approximate expression of the approximate expression ID “f2” is [t_22b, T_22e] ∪ [t_23b, T_23e] ∪ {t₂₁} = {T | t_22b≦ t ≦ t_22e, T_23b≦ t ≦ t_23e, T = t₂₁}. The effective domain of another approximate expression can also be specified from the information stored in the definite graph storage unit 19.
The confirmed graph storage unit 19 may store information of the following data structure. That is, the deterministic graph storage unit 19 corresponds to an approximate expression for calculating an approximate value of a data value by substituting a variable, and an effective definition area that is a variable definition range capable of obtaining the approximate value of the data value. The valid domain may store information of a data structure represented by a variable section or a set of points representing one variable value. In the first embodiment, this variable is a variable representing time.
Also, in this data structure, it corresponds to another approximate expression between the sections of the effective domain associated with a certain approximate expression, or between points, or between sections. It is permissible to have a valid domain or point attached. For example, the order of the time series of each section and point shown in FIG._11b, T_11e], [T_01b, T_01e], T₁₂, [T_02b, T_02e], T₂₁, T₁₃, [T_22b, T_22e], [T_03b, T_03e], [T_31b, T_31e], [T_22b, T_22e] In this order. In this example, the interval of the approximate expression “f0” [t_01b, T_01e], [T_02b, T_02e], The point t of another approximate expression “f1”₁₂Is allowed to exist. Further, for example, the point t of the approximate expression “f1”₁₂, T₁₃During the interval [t0 of another approximate expression “f0”_02b, T_02e] Or the point t of another approximate expression “f2”₂₁Is allowed to exist. Also, for example, the interval [t1 of the approximate expression “f1”_11b, T_11e] And point t₁₂During the interval [t0 of another approximate expression “f0”_01b, T_01e] Is allowed to exist. In this way, elements (sections or points) of the effective definition area of another approximate expression may exist between elements (sections or points) of the effective definition area of a certain approximate expression. The data summarization system, data summarization method, and data summarization program according to each aspect of the present invention can preferably use such a data structure.
The graph evaluation unit 17 calculates each approximate value calculated by the new data generation time substituting unit 12 by substituting the generation time of new data for each approximate expression stored in the approximate expression storage unit 15 and the new data. Compare the actual data value of. Then, the graph evaluation unit 17 specifies an approximate expression that satisfies the criteria stored in the accuracy constraint input unit 16. Furthermore, when there are a plurality of approximate expressions satisfying the criterion, the graph evaluation unit 17 identifies an approximate expression that minimizes the increase in storage capacity when updating the effective domain, among the approximate expressions satisfying the criterion. Then, it is determined that the data value of the new data is approximated by the approximate expression. Further, when there is only one approximate expression that satisfies the criterion, the graph evaluation unit 17 determines to approximate the data value of the new data with the approximate expression. Then, the graph evaluation unit 17 outputs the determined approximate expression, its valid domain, and new data (generated data received from the new data generation time substitution unit 12) to the graph update unit 18. At this time, the graph evaluation unit 17 also outputs information indicating whether or not the determined approximate expression is the final approximate expression to the graph update unit 18. Moreover, the graph evaluation part 17 should just output the approximate expression ID to the graph update part 18, for example as the determined approximate expression. The graph evaluation unit 17 may read the effective definition area from the confirmed graph storage unit 19.
When the approximate expression output from the new data generation time substitution unit 12 to the graph evaluation unit 17 is expressed in the form of the approximate expression ID, the graph evaluation unit 17 is stored in the approximate expression storage unit 15. All approximate expressions are read from the approximate expression storage unit 15.
The graph evaluation unit 17 may not be able to specify an approximate expression that satisfies the criteria stored in the accuracy constraint input unit 16. That is, there may be no approximate expression that satisfies the criteria stored in the accuracy constraint input unit 16. In that case, the graph evaluation unit 17 may output new data (generated data received from the new data generation time substitution unit 12) to the graph update unit 18 without selecting an approximate expression.
An example of approximate expression selection by the graph evaluation unit 17 will be specifically described with reference to FIGS. 12, 13, and 14 are explanatory diagrams illustrating an example of selecting an approximate expression that satisfies the criteria stored in the accuracy constraint input unit 16. It should be noted that I shown in FIGS.₁₁, I₀₁, I₁₂Etc. are sections and points included in the effective definition area, and correspond to the sections and points illustrated in FIG. In the examples shown in FIG. 12 to FIG. 14, as approximate expressions generated in the past, x = f0 (t), x = f1 (t), x = f2 (t), x = f3 (t ) And the effective domain of x = f0 (t) is I₀₁∪I₀₂∪I₀₃It is. The effective domain of x = f1 (t) is I₁₁∪I₁₂∪I₁₃It is. The effective domain of x = f2 (t) is I₂₁∪I₂₂∪I₂₃It is. The effective domain of x = f3 (t) is I₃₁It is. Further, the generation time of new generation data received by the graph evaluation unit 17 is t._iAnd The data value of the generated data is x_iAnd
In FIG. 12, time t_iData P₁₀₂₁The case where this occurs is illustrated. In the example shown in FIG. 12, the occurrence time t_iAn approximate value obtained by substituting x into f = (t1) is X₁₀₁₁= F1 (t_i). Data P₁₀₂₁Data value x_iIf the approximate expression that satisfies the criterion that the absolute value of the difference between and the approximate value is less than the threshold value ε is only x = f1 (t), the graph evaluating unit 17 selects x = f1 (t). That is, | x_i-X₁₀₁₁If only | <ε holds, the graph evaluation unit 17 selects x = f1 (t).
When the plurality of approximate expressions satisfy the accuracy criterion, the graph evaluation unit 17 pays attention to the plurality of approximate values individually. Then, the graph evaluation unit 17 calculates the increase in the storage capacity for storing the effective domain when the effective domain is updated on the assumption that the approximate expression of interest is an approximate expression representing new generated data. calculate. Then, the graph evaluation unit 17 selects an approximate expression having the smallest increase amount. FIG. 13 illustrates a selection example of this aspect.
In the example shown in FIG._iData P₁₀₂₂The case where this occurs is illustrated. In the example shown in FIG. 13, the occurrence time t_iThe approximate value obtained by substituting x into f = (t) is X₁₀₁₂= F2 (t_i). T_iAn approximate value obtained by substituting x into f = f3 (t) is X₁₀₁₃= F3 (t_i). As for these two approximate expressions, both of the data P₁₀₂₂Data value x_iSatisfies the criterion that the absolute value of the difference between the approximate value and the approximate value is less than the threshold value ε. That is, | x_i-X₁₀₁₂| <Ε and | x_i-X₁₀₁₃Both || ε is satisfied.
In this case, the graph evaluation unit 17 may select the final approximate expression that is the final approximate expression among the approximate expressions that satisfy the criterion and that has the end of the effective domain as the end point of the “section”. T is the effective definition area of the final approximate expression where the end of the effective definition area is the end point of the “section”._iIs added, the graph evaluation unit 17 sets the end point of the “section” to t_iShould be replaced. This is because by performing this process, the amount of increase in storage capacity for representing the effective domain becomes zero. On the other hand, in the case of an approximate expression other than the final approximate expression or a final approximate expression in which the end of the effective domain is “point”, the effective domain is t_iIs added, the storage capacity for storing the effective domain is increased by one numerical value. Therefore, when there are a plurality of approximate expressions that satisfy the criterion, the graph evaluation unit 17 selects the final approximate expression in which the end of the effective domain is the end point of the “section”.
In the case of the example shown in FIG.₁₀₂₂When the approximate expression representing x = f2 (t) is assumed, the storage capacity necessary for expressing the effective domain does not increase when the effective domain is updated. Generation data P₁₀₂₂The effective domain of x = f2 (t) before is input is I₂₁∪I₂₂∪I₂₃It is. I₂₁= {T₂₁}, I₂₂= [T_22b, T_22e], I₂₃= [T_23b, T_23e], This effective domain is t₂₁, T_22b, T_22e, T_23b, T_23eIt is expressed by five numbers. The end of this effective domain is the section I₂₃End point (right end) of t_23e= T_lastIt is. Therefore, section I₂₃At time t_iThe section added with is I₂₃∪ {t_i} = [T_23b, T_i]It can be expressed as. That is, the graph evaluation unit 17 sets the time t in the effective definition area of x = f2 (t)._i, The effective domain after the update is {t₂₁} ∪ [t_22b, T_22e] ∪ [t_23b, T_i], The storage capacity necessary for expressing the effective domain does not increase.
On the other hand, in the example shown in FIG.₁₀₂₂Is assumed to be x = f3 (t), the storage capacity necessary for expressing the effective domain increases when the effective domain is updated. Generation data P₁₀₂₂The effective domain of x = f3 (t) before is input is I₃₁It is. I₃₁= [T_31b, T_31e], This effective domain is t_31b, T_31eIt is expressed by two numerical values. Where t_31e<T_last<T_iIt is. t_lastIs an effective domain of x = f2 (t), the graph evaluation unit 17_lastCannot be included in the valid domain of x = f3 (t). Therefore, the graph evaluation unit 17 sets the time t in the effective definition area of x = f3 (t)._iTo add₃₁∪ {t_i} = [T_31b, T_31e] ∪ {t_i}, {T_i} Will be added as a point. As a result, when representing an effective domain, t_31b, T_31e, T_iTherefore, the storage capacity increases by one numerical value.
Therefore, in the example shown in FIG. 13, the graph evaluation unit 17 selects x = f2 (t) among the approximate expressions that satisfy the criterion.
Note that a new occurrence time t also applies to the final approximate expression where the end of the valid domain is a “point”._iIs added to the effective domain, the storage capacity for expressing the effective domain is increased by one numerical value. This starts from the time that was “point” before the update._iThis is because a section with the end point is newly determined. That is, the graph evaluation unit 17 has to store the portion stored as “point” as “section”.
When a plurality of approximate expressions satisfy the accuracy criterion and the amount of increase in storage capacity for storing the updated effective domain is equal in each approximate expression, the graph evaluation unit 17 selects 1 from the approximate expressions. One approximate expression may be selected. The method for selecting the approximate expression may be a method in which one approximate expression is arbitrarily determined in advance.
Figure 14 shows time t_iData P₁₀₂₃The case where this occurs is illustrated. In the example shown in FIG. 14, the occurrence time t_iAn approximate value obtained by substituting x into f = (t1) is X₁₀₁₁= F1 (t_i). T_iAn approximate value obtained by substituting x into f = f3 (t) is X₁₀₁₃= F3 (t_i). As for these two approximate expressions, both of the data P₁₀₂₃Data value x_iSatisfies the criterion that the absolute value of the difference between the approximate value and the approximate value is less than the threshold value ε. That is, | x_i-X₁₀₁₁| <Ε and | x_i-X₁₀₁₃Both || ε is satisfied. Also, x = f1 (t) and x = f3 (t) are not final approximation expressions, and t_iThe amount of increase in storage capacity when adding is equal in the two approximate equations.
In this case, the graph evaluation unit 17 calculates the approximate value and the actual data value x._iAn approximate expression that minimizes the difference between and may be selected. In the case of the example illustrated in FIG._i-X₁₀₁₁| <| X_i-X₁₀₁₃Since | <ε, the graph evaluation unit 17 may select x = f1 (t). This selection method is an example of selecting one approximate expression from a plurality of approximate expressions having the same increase in storage capacity, and the approximate expression may be selected by another method.
For example, t in the effective definition area_iAmong a plurality of approximate formulas having the same increase in storage capacity when adding, the graph evaluation unit 17 may select an approximate formula whose upper limit (end) of the effective domain is closest to the current time. In other words, the graph evaluation unit 17 may select an approximate expression having the maximum upper limit (end) of the effective domain. In the example shown in FIG. 14, the upper limit of the effective domain of x = f1 (t) is I₁₃= {T₁₃}. The upper limit of the effective domain of x = f3 (t) is the interval I₃₁End point of t_31eIt is. Therefore, t₁₃<T_31eTherefore, the graph evaluation unit 17 may select x = f3 (t). Further, the graph evaluation unit 17 may select one approximate expression from a plurality of approximate expressions by a method other than that exemplified here.
When there is an input from the graph evaluation unit 17, the graph update unit 18 updates the effective definition area of the approximate expression selected by the graph evaluation unit 17 according to the content, or the uncertain point storage unit 13. Add indeterminate points to. In addition, when there is an input from the new approximate expression generation unit 14, the graph update unit 18 newly registers the approximate expression in the approximate expression storage unit 15.
The graph update unit 18 uses the graph evaluation unit 17 to display the approximate expression determined by the graph evaluation unit 17 and its valid definition area, new generated data, and information indicating whether the determined approximate expression is the final approximate expression. The graph updating unit 18 updates the effective domain of the approximate expression. The graph updating unit 18 may update the effective definition area so as to add the generation time of new generation data to the effective definition area of the input approximate expression. The graph update unit 18 stores the updated effective definition area of the approximate expression determined by the graph evaluation unit 17 in the confirmed graph storage unit 19.
At this time, the graph updating unit 18 may update the effective domain by dividing the case as follows. As a first update mode, if the approximate expression input to the graph evaluation unit 17 is the final approximate expression, and the end of the effective domain of the approximate expression is the end point of the “section”, the graph update unit 18 The end point of the “section” may be replaced with the generation time of new generation data.
As a second update mode, if the approximate expression input to the graph evaluation unit 17 is the final approximate expression, and the end of the valid domain is a “point”, the graph update unit 18 uses the point as the start point, What is necessary is just to create the new "section" which makes the generation | occurrence | production time of new generation data the end point. Also, the “point” that was the end of the valid definition area is included in the “section”. Therefore, the graph updating unit 18 may exclude the “point” that is the end of the effective definition area from the classification of “points” in the effective definition area. In the example illustrated in FIG. 11, the graph updating unit 18 excludes one point from the “point” item and adds a new section to the “section” item.
As a third update mode, when the approximate expression input to the graph evaluation unit 17 is not the final approximate expression, the graph update unit 18 sets the generation time of the new generation data as “point” and validates the approximate expression. Add to the domain.
Further, when updating the effective definition area of the approximate expression determined by the graph evaluation unit 17, the graph update unit 18 also updates the final data information. The graph update unit 18 updates the approximate expression ID of the final data information (see FIG. 9) stored in the final time storage unit 11 to the approximate expression ID of the approximate expression determined by the graph evaluation unit 17, The last time is updated to the time of occurrence of new occurrence data.
When the graph evaluation unit 17 outputs only new generated data to the graph update unit 18 without selecting an approximate expression, the graph update unit 18 stores the generated data as an undefined point in the undefined point storage unit 13. Remember. In this case, the graph update unit 18 only stores one unconfirmed point in the unconfirmed point storage unit 13, and information stored in the final time storage unit 11, the approximate expression storage unit 15, and the confirmed graph storage unit 19. Will not be updated.
Further, when the new approximate expression generation unit 14 newly generates an approximate expression from the uncertain point, the new approximate expression generation unit 14 stores the approximate expression and each data used for generating the approximate expression (that is, the uncertain point storage). Each uncertain point stored in the unit 13 and newly input generated data) are output to the graph update unit 18. In this case, the graph updating unit 18 assigns an approximate expression ID to the new approximate expression, and stores the new approximate expression and the approximate expression ID in association with each other in the approximate expression storage unit 15. At this time, the graph update unit 18 deletes the undetermined point used for generating a new approximate expression. In other words, the graph update unit 18 deletes each undetermined point stored in the undetermined point storage unit 13. Further, the graph update unit 18 determines an effective definition area of the new approximate expression from the generation time of each data used for generating the approximate expression, and stores it in the definite graph storage unit 19 together with the approximate expression ID. When the valid domain is defined, the graph updating unit 18 reduces the number of sections having the generation time of the data used for generating the new approximate expression as the starting point and the ending point as much as possible, and the effective definition of the existing approximate expression. The effective domain is defined so that it does not overlap with the domain. The graph updating unit 18 may determine a time that exists independently between sections and points in the effective definition area of the existing approximate expression and cannot be set as the start point or end point of the section.
Referring to FIG. 11, FIG. 13, FIG. 12, and FIG. 15, an example of updating the valid domain by the graph update unit 18 will be described.
First, an example of effective domain update when the approximate expression determined by the graph evaluation unit 17 as an approximate expression for approximating the data value of new data is the final approximate expression will be described with reference to FIG. In the example shown in FIG. 13, the approximate expression ID that designates the approximate expression “x = f2 (t)”, the effective domain of the approximate expression, and the generated data P₁₀₂₂, “X = f2 (t)” is output from the graph evaluation unit 17 to the graph update unit 18 to the effect that it is the final approximate expression. The graph updating unit 18 uses the effective domain I of “x = f2 (t)”.₂₁∪I₂₂∪I₂₃{T_i}, The effective domain is updated to add. At this time, the end of the effective domain is the section I₂₃Since the end point of the graph, the graph updating unit 18₂₃[T_23b, T_23e] (See FIG. 11) end point t_23eT_iUpdate to. The graph update unit 18 stores the updated valid definition area in the confirmed graph storage unit 19. As a result, the section corresponding to the approximate expression ID “f2” shown in FIG._22b, T_22e], [T_23b, T_23e] To [t_22b, T_22e], [T_23b, T_i] Is updated.
Next, an example of the effective domain update when the approximate expression determined by the graph evaluation unit 17 as an approximate expression that approximates the data value of the new data is not the final approximate expression will be described with reference to FIG. In the example shown in FIG. 12, the approximate expression ID that designates the approximate expression “x = f1 (t)”, the effective domain of the approximate expression, and the generated data P₁₀₂₁, Information indicating that “x = f1 (t)” is not the final approximate expression is output from the graph evaluation unit 17 to the graph update unit 18. In this case, since the approximate expression designated by the graph evaluation unit 17 is not the final approximate expression, the graph update unit 18 includes the generated data P in the effective definition area of x = f1 (t).₁₀₂₁Occurrence time t_iMay be added as a “point”. In other words, in the example illustrated in FIG. 11, the graph update unit 18 sets t to the “point” of the effective domain corresponding to the approximate expression ID “f1”._iIs added, the stored contents of the confirmed graph storage unit 19 are updated.
Next, an example of new effective domain registration when the new approximate expression generation unit 14 newly generates an approximate expression is shown. In the example shown in FIG. 15, the new approximate expression generation unit 14 determines that the undetermined point P₁₀₃₀And new generation data P₁₀₃₁From the approximate expression “x = f_new(T) ″ and the new approximate expression generation unit 14 generates x = f_new(T) and uncertain point P₁₀₃₀And new generation data P₁₀₃₁Are output to the graph update unit 18. The graph updating unit 18 calculates the approximate expression “x = f_new(T) "is assigned an approximate expression ID, and" x = f_new(T) "is stored in the approximate expression storage unit 15. The graph update unit 18 also stores the undetermined point P.₁₀₃₀And new generation data P₁₀₃₁The effective domain is determined from The graph update unit 18 determines the uncertain point P shown in FIG.₁₀₃₀And new generation data P ₁₀₃₁1 section I from the first data generation time to the last data generation time, which does not overlap with the effective domain of other approximate expressions_newCan be determined. Therefore, the graph update unit 18 determines that “x = f_new(T) "and the effective definition area (section I_new) Is stored in the confirmed graph storage unit 19.
The data input unit 10, the new data generation time substitution unit 12, the new approximate expression generation unit 14, the accuracy constraint input unit 16, the graph evaluation unit 17, and the graph update unit 18 are executed by, for example, a CPU of a computer that operates according to the data summarization program. Realized. In this case, a program storage device (not shown) of the computer stores the data summarization program, and the CPU reads the program, and according to the program, the data input unit 10, the new data generation time substitution unit 12, and the new approximate expression generation unit 14 The accuracy constraint input unit 16, the graph evaluation unit 17, and the graph update unit 18 may be operated. In addition, each of these units may be realized by separate hardware.
Further, the final time storage unit 11, the uncertain point storage unit 13, the approximate expression storage unit 15, and the confirmed graph storage unit 19 may be realized by separate storage devices. Alternatively, it may be realized by the same storage device. Further, some combinations of the final time storage unit 11, the uncertain point storage unit 13, the approximate expression storage unit 15, and the confirmed graph storage unit 19 may be realized by the same storage device.
Next, the operation of the first embodiment will be described.
FIG. 16 is a flowchart illustrating an example of processing progress of the first embodiment. When a data generation source (not shown) sequentially generates data (No in step S100), data is sequentially input from the data generation source to the data input unit 10 (step S101). In this example, it is assumed that data is input to the data input unit 10 in order of generation time one by one. The data summarization system performs the subsequent operations for each piece of data. A plurality of data may be input to the data input unit 10 all together, but even in that case, the data summarization system performs the subsequent processing for each piece of data in order of data generation time.
Next, can the new approximate expression generation unit 14 generate a new approximate expression from one generated data input to the data input unit 10 and the uncertain points stored in the uncertain point storage unit 13? It is determined whether or not (step S102). In step S <b> 102, the new approximate expression generation unit 14 reads each undetermined point stored in the undetermined point storage unit 13, and uses the undetermined point and new generated data input to the data input unit 10. What is necessary is just to determine whether or not the number of data necessary for the approximate expression generation has been prepared. The new approximate expression generation unit 14 determines that a new approximate expression can be generated if the number of data necessary for generating the approximate expression is complete, and if it determines that a new approximate expression cannot be generated otherwise. Good. As already described, the number necessary for generating the approximate expression is determined in advance according to the type of approximate expression and the approximate expression generation algorithm.
When the new approximate expression generation unit 14 determines that a new approximate expression can be generated (Yes in step S102), the new approximate expression generation unit 14 inputs each uncertain point and 1 input to the data input unit 10. From the two pieces of generated data, an approximate expression that approximates the data value is generated using the generation time of the data as a variable (step S103).
Approximate expression types and approximate expression generation algorithms are determined in advance, but the types and algorithms are not particularly limited. As already described, it is assumed that the approximate expression is a linear expression with the occurrence time t as a variable, and when four data are prepared, the new approximate expression generation unit 14 generates four sets of occurrence time and data. An approximate expression may be generated by determining the first-order coefficient and the constant term from the value by the least square method. When the two pieces of data are collected, the new approximate expression generation unit 14 obtains a straight line passing through two points having (occurrence time, data value) as coordinates as a linear expression having the occurrence time t as a variable. May be. Also in this case, the new approximate expression generation unit 14 may determine the primary coefficient and the constant term.
In this example, it is assumed that the approximate expression is a linear expression, and the approximate expression is represented by a primary coefficient and a constant term of the variable t as shown in FIG. However, the expression form of the approximate expression is not limited to this form, and the approximate expression may be expressed in another form.
In step S103, the new approximate expression generation unit 14 outputs the generated new approximate expression and each data (each indeterminate point and new generated data) used for generating the approximate expression to the graph update unit 18.
The graph update unit 18 assigns an approximate expression ID to the approximate expression received from the new approximate expression generation unit 14 and stores the approximate expression ID in the approximate expression storage unit 15 (step S104). As a result, one additional approximate expression is newly registered. In step S104, the graph updating unit 18 further determines an effective definition area of the newly generated approximate expression based on the generation time of each data used for generating the approximate expression, and stores the definite graph together with the approximate expression ID. Store in the unit 19. At this time, the graph updating unit 18 satisfies the condition that the number of sections starting from and ending with the generation time of each data used for generating the approximate expression is as small as possible and does not overlap with the effective definition area of the existing approximate expression. Establish an effective definition area for the new approximate expression. The graph update unit 18 may be a point that exists independently between the sections and points of the effective definition area of the existing approximate expression and cannot be set as the start point or end point of the section as a “point (see FIG. 11)”. .
When the new approximate expression generation unit 14 determines that a new approximate expression cannot be generated (No in step S102), the new data generation time substitution unit 12 sets the generation time of the data input to the data input unit 10 as follows: Substitution is performed for each approximate expression already generated in the past (that is, each approximate expression stored in the approximate expression storage unit 15). Then, the new data generation time substitution unit 12 calculates an approximate value of the data input to the data input unit 10 for each approximate expression (step S105). Then, the new data generation time substitution unit 12 outputs the approximate value calculated using the approximate expression and the set of the approximate expression to the graph evaluation unit 17. At this time, the new data generation time substitution unit 12 refers to the final data information and also outputs to the graph evaluation unit 17 information indicating which is the final approximate expression and the set of approximate values calculated from the final approximate expression. . Further, the new data generation time substitution unit 12 also outputs the generation data (data input to the data input unit 10) that is currently processed to the graph evaluation unit 17. Details of the processing in step S105 will be described later.
After step S105, the graph evaluation unit 17 compares the approximate value of the data calculated for each approximate expression with the actual data value of the data input to the data input unit 10 (step S106). In this example, the accuracy constraint input unit 16 stores a criterion that the absolute value of the difference between the approximate value and the actual data value is less than the threshold value ε (ie, | x−f (t) | <ε). The graph evaluation unit 17 is an example in the case of selecting an approximate expression that satisfies this criterion. In this case, the graph evaluation unit 17 calculates the absolute value of the difference between the approximate value of the data calculated for each approximate expression and the actual data value of the data input to the data input unit 10 in step S106.
Next, the graph evaluation unit 17 determines whether there is an approximate expression that satisfies the criterion that the absolute value of the difference calculated in step S106 is less than the threshold ε (step S107).
When there is an approximate expression that satisfies the criterion (Yes in step S107), the graph evaluation unit 17 selects an approximate expression that minimizes the amount of increase in storage capacity at the time of updating the effective domain from among the approximate expressions that satisfy the criterion. Select (step S108). That is, the graph evaluation unit 17 selects an approximate expression that minimizes the amount of increase from the storage capacity for storing the effective definition area before the update to the storage capacity for storing the effective definition area after the update.
However, if there is only one approximate expression that satisfies the criterion, the graph evaluation unit 17 may select the approximate expression.
In addition, if there are a plurality of approximate expressions satisfying the criteria, and there is a final approximate expression that is the final approximate expression and the end of the effective domain is the end point of the “section”, among the approximate expressions, The graph evaluation unit 17 may select the final approximate expression. This is because the final approximate expression minimizes the increase in storage capacity.
In addition, there are multiple approximate expressions that meet the criteria, and among these approximate expressions, there is a final approximate expression in which the end of the effective section is the end point of the “section”. If not, the graph evaluation unit 17 selects one from a plurality of approximate expressions that satisfy the criterion. This selection method may be determined in advance.
When the graph evaluation unit 17 selects one approximate expression satisfying the criterion in step S108, the graph evaluation unit 17 outputs the selected approximate expression and its effective definition area and the generated data to be processed to the graph update unit 18. The graph evaluation unit 17 also outputs information indicating whether or not the selected approximate expression is the final approximate expression to the graph update unit 18. The graph evaluation unit 17 may output an approximate expression ID for identifying the approximate expression to the graph update unit 18 as an approximate expression.
The graph update unit 18 updates the effective definition area of the approximate expression received from the graph evaluation unit 17 in step S108, and stores it in the confirmed graph storage unit 19 (step S109). Since each mode in which the graph update unit 18 updates the valid domain has already been described, a description thereof is omitted here. In step S109, the graph update unit 18 updates the approximate expression ID of the final data information (see FIG. 9) stored in the final time storage unit 11 to the approximate expression ID of the approximate expression selected in step S108. Then, the final time (see FIG. 9) of the final data information is updated to the generation time of the generated data to be processed.
When it is determined in step S107 that there is no approximate expression that satisfies the criterion (No in step S107), the graph evaluation unit 17 outputs the generated data to be processed to the graph update unit 18, and the graph update unit 18 stores the generated data as an undetermined point in the undetermined point storage unit 13 (step S110).
When the data summarization system completes any one of steps S104, S109, and S110, the same processing is repeated for the next data (next data in the order of occurrence time) input to the data input unit 10. . In this way, the processing from step S102 onward is performed individually for each piece of generated data input to the data input unit 10 in the order of generation time. When the data generation source ends the data generation (Yes in step S100), the data summarization system ends the process.
In the example shown in FIG. 16, the case where the accuracy constraint input unit 16 stores a criterion that the absolute value of the difference between the approximate value and the actual data value is less than the threshold ε is exemplified. The criteria stored in the input unit 16 are not limited to the above criteria.
Next, the operation of step S105 will be described in more detail. FIG. 17 is a flowchart illustrating an example of the processing progress of step S105. In step S105, the new data generation time substitution unit 12 determines whether there is an approximate expression that has not yet been read from the approximate expression storage unit 15 (step S201). If there is an approximate expression that has not been read, the new data generation time substitution unit 12 reads the approximate expression and the approximate expression ID stored in the approximate expression storage unit 15 (step S202). Here, the approximate expression read by the new data generation time substitution unit 12 is x = F (t), and the approximate expression ID is “f”. Next, the new data generation time substitution unit 12 adds the generation time t of the data input to the data input unit 10 to the read approximate expression x = F (t)._iIs substituted and approximate value F (t_i). The new data generation time substitution unit 12 then uses the approximate expression ID “f” and the approximate value F (t_i) Are output to the graph evaluation unit 17 (step S203).
After step S203, the new data generation time substitution unit 12 repeats the processing after step S201. If there is no approximate expression that has not yet been read from the approximate expression storage unit 15 (No in step S201), the new data generation time substitution unit 12 reads the final data information from the final time storage unit 11 (step S204). The approximate expression ID of the final approximate expression included in the final data information is output to the graph evaluation unit 17 (step S205). At this time, the new data generation time substitution unit 12 also outputs to the graph evaluation unit 17 the data to be processed input to the data input unit 10.
Note that the new data generation time substitution unit 12 may execute steps S204 and S205 before the loop processing of steps S201 to S203.
The data summarization system according to the first embodiment is such that “data that is generated sequentially with a certain tendency and may change irregularly, such as the CPU data rate and the number of web page accesses”. Are stored as an approximate expression using the occurrence time as a variable and an effective domain where it can be said that approximation by the approximate expression is appropriate. The effective domain of one approximate expression is represented as a set of points indicating a time interval and a time point. The data summarization system according to the first embodiment specifies an approximate expression that satisfies the criteria stored in the accuracy constraint input unit 16 for new data, and approximates the new data using the approximate expression. Therefore, the data summarization system of the first embodiment can summarize (compress) data with high accuracy.
In addition, since the effective definition area of one approximate expression is allowed to include a plurality of “sections” and “points”, the data summarization system of the first embodiment suppresses the storage capacity of the summarized data. And efficient data summarization can be realized. For example, after a state in which data that can be approximated by a certain approximate expression is continuously generated (first state), the tendency of the data value is temporarily changed (second state), and then again in the original approximate expression. Assume that a state (third state) in which data that can be approximated is generated occurs. In this case, although there are three states, since the effective domain allows a plurality of “sections” and “points” to be included, the data summarization system of the first embodiment includes the first state and the first state. The generated data in the three states can be expressed by the same approximate expression, and the storage capacity can be reduced accordingly. Assuming that the effective domain includes only one section or point, in the above example, the data summarization system uses the approximate expression and effective domain in the first state and the approximate expression and effective definition in the third state. Each area needs to be stored, and the approximate expression is stored redundantly, resulting in an increase in storage capacity. The data summarization system of the first embodiment can prevent such an increase in storage capacity.
A specific example is shown by taking as an example a case where data is sequentially generated as in the case shown in FIG. It is assumed that data compression is performed on such data by applying the technique described in Patent Document 1. In that case, when a certain data is set as a reference point (pivot), and the difference between the pivot and the data value of the generated data exceeds the compression accuracy, they are not approximated. In this case, as shown in FIG. 18, approximate equations x = h1 (t), x = h2 (t), x = h3 (t), x = h4 (t), x = h5 (t) 5 approximation formulas and point P that cannot be approximated by them₁₉₁, P₁₉₂, P₁₉₃The generation data group is represented. Then, one effective time zone is determined for one approximate expression. Each approximate expression is a linear function, and the approximate expression is expressed by a linear coefficient and a constant term. Therefore, each approximate expression is expressed by two numerical values. Further, since there is one section corresponding to the approximate expression, the section is represented by two numerical values of the start point and end point of the section. Therefore, a storage capacity for four numerical values is required to store one approximate expression and its section. In the example shown in FIG. 18, since there are five approximate expressions, the storage capacity for storing each approximate expression and its section is 4 × 5 = 20.
Here, for the sake of convenience, the storage capacity is represented by the number. In the example shown in FIG. 18, the point P is further added as information to be stored.₁₉₁, P₁₉₂, P₁₉₃There is. The data summarization system must store the time of occurrence and data value for one point. For example, the data summarization system has a point P₁₉₁Occurrence time t₁And data value X₁₉₁Must be remembered. Therefore, the data summarization system needs a storage capacity of two numerical values for one point. Therefore, in the case illustrated in FIG. 18, the required storage capacity is 4 × 5 + 2 × 3 = 26.
On the other hand, the data summarization system of the first embodiment approximates the same generated data with the four approximate expressions shown in FIG. In this case, the capacity required for storing the approximate expression itself is 2 × 4 = 8. Focusing on the storage capacity of the effective domain, the data summarization system of the first embodiment may store one section and two points for x = f1 (t). Regarding the section to be stored as an effective definition area, a storage capacity for two numerical values is sufficient, and for a point, a storage capacity for one numerical value is sufficient. Therefore, for the effective definition area of x = f1 (t), the required storage capacity is 2 × 1 + 2 = 4. The effective domain of the approximate expression x = f0 (t) is 3 sections. The effective domain of the approximate expression x = f2 (t) is two sections and one point. The effective domain of the approximate expression x = f3 (t) is one section. Therefore, the capacity required for storing each effective domain is (2 × 3) + (2 × 1 + 2) + (2 × 2 + 1) + (2 × 1) = 17. Therefore, the data summarization system of the first embodiment can store the summary results with a storage capacity of 8 + 17 = 25. Therefore, the data summarization system of the first embodiment can reduce the storage capacity.
It should be noted that the present invention aims at summarizing “data that is generated sequentially with a certain tendency and whose data tendency may change irregularly”, but temporarily irregular data May occur continuously. FIG. 19 is an explanatory diagram illustrating an example of a situation in which irregular data is continuously generated temporarily. In the period “b” shown in FIG. 19, generation of data represented by the approximate expression x = f2 (t) continues. Similarly, generation of data represented by the approximate expression x = f1 (t) continues in the period “c”, and generation of data represented by the approximate expression x = f3 (t) continues in the period “d”. . However, in the period “a”, data having different tendencies (in other words, data approximated by different approximate expressions) are continuously generated. In this period “a”, irregular data is continuously generated, and each of these data may be accumulated as uncertain points. In this case, an approximate expression different from the ideal approximate expression is obtained from those uncertain points. However, if data with a certain trend is continuously generated and the trend changes irregularly, an approximate expression different from the ideal approximate expression may be temporarily obtained. In the long run, data with a certain tendency often occurs continuously. Therefore, an ideal approximate expression can be obtained from these data. For example, even if an undesirable approximate expression is obtained in section a in FIG. 19, x = f1 (t), x = f2 (t), x = f3 (t) in the subsequent sections b, c, and d. An ideal approximation formula such as is obtained. As a result, efficient data summarization can be realized as a whole.
In addition, when the approximate expression is a linear expression at time t and one uncertain point and one new data are prepared, the data summarization system of the first embodiment uses a straight line connecting the two points as an approximate expression. Suppose that it generates. In this case, the data summarization system according to the first embodiment does not store data with a storage capacity larger than at least the storage capacity in the case of storing the data itself as it is. For example, when storing two pieces of data, the data summarization system according to the first embodiment stores two numerical values, that is, a data value and an occurrence time, so that two pieces of data are equivalent to four numerical values. Storage capacity is required. Here, as described above, it is assumed that the data summarization system according to the first embodiment generates a straight line connecting two points as an approximate expression when one uncertain point and one new data are prepared. To do. In this case, since the approximate expression is a linear expression, the data summarization system according to the first embodiment needs to store a linear coefficient and a constant term. Moreover, the data summarization system of 1st Embodiment should just memorize | store the generation | occurrence | production time of two data as a starting point and an end point about an effective definition area. Therefore, it is the numerical value of the storage capacity required for storing the approximate expression and the effective domain. Therefore, the data summarization system of the first embodiment stores at least the approximate expression and the effective domain with the same capacity as when storing two pieces of data as they are, even if the compression is not efficient in this case. it can.
Embodiment 2. FIG.
FIG. 20 is a block diagram illustrating an example of a data summarization system according to the second embodiment of this invention. Constituent elements similar to those in the first embodiment are denoted by the same reference numerals as those in FIG. 2, and detailed description thereof is omitted. The data summarization system of the second embodiment includes a data input unit 10, a final time storage unit 11, a new data generation time substitution unit 12, a new approximate expression generation unit 14, an indeterminate point storage unit 13, and an approximation. An expression storage unit 15, an accuracy constraint input unit 16, a graph evaluation unit 17a, a graph update unit 18a, a deterministic graph storage unit 19, a default expression input unit 20, and a default expression storage unit 21 are provided.
The default expression storage unit 21 is a storage device that stores one approximate expression that is known as an approximate expression for obtaining an approximate value of a data value. For example, FIG. 10 of the first embodiment shows a case where four approximate expressions from x = f0 (t) to x = f3 (t) are generated. Is known before the start of processing. The default formula storage unit 21 stores an approximate formula that is known before the start of such processing as a default formula.
The data summarization system according to the second embodiment reduces the storage capacity of the effective definition area by not defining a predefined effective definition area, and realizes data summarization more efficiently than the first embodiment.
FIG. 21 is an explanatory diagram illustrating an example of a default formula stored in the default formula storage unit 21. In the example shown in FIG. 21, the default equation storage unit 21 stores a first-order coefficient and a constant term of a default equation. Here, as in the case shown in FIGS. 7 and 8, the linear expression of the variable t is expressed by the primary coefficient of the variable and the constant term. That is, FIG. 21 illustrates a case where a predetermined primary expression of the variable t is stored. Specifically, the default expression storage unit 21 has x = a₀Xt + b₀The case where the predetermined expression is stored is illustrated.
In FIG. 21, as in the case shown in FIGS. 7 and 8, the case where the default expression is expressed by a combination of the first-order coefficient and the constant term is illustrated, but the default expression may be expressed in other forms. Good. FIG. 22 shows an example in which the default expression is expressed only by the constant term. In the example shown in FIG. 22, x = x₀The default formula (constant) is shown. This is the same as the default formula when the primary coefficient is 0, and the approximate value is x regardless of the variable (occurrence time).₀It represents that. The predetermined formula may also be expressed by various functions such as a quadratic or higher-order integer function, an exponential function, and a trigonometric function.
The default formula input unit 20 receives a default formula input from the user and stores the default formula in the default formula storage unit 21.
In the graph evaluation unit 17a in the second embodiment, each approximate value calculated by the new data generation time substituting unit 12 substituting the generation time of new data for each approximate expression stored in the approximate expression storage unit 15 And the actual data value of the new data. This point is the same as the graph evaluation unit 17 of the first embodiment. The graph evaluation unit 17a of the second embodiment further calculates an approximate value when the time of new data (generated data newly generated and received from the new data generation time substituting unit 12) is substituted into a predetermined formula. The approximate value is also compared with the actual data value of the new data. Then, the graph evaluation unit 17a identifies an approximate expression that satisfies the criteria stored in the accuracy constraint input unit 16. Then, when there are a plurality of approximate expressions that satisfy the criterion, the graph evaluation unit 17a identifies an approximate expression that minimizes the increase in storage capacity when updating the effective domain among the approximate expressions that satisfy the criterion. The approximate expression is determined to approximate the data value of the new data. When there is only one approximate expression that satisfies the criterion, the graph evaluation unit 17a determines that the approximate expression approximates the data value of the new data.
Here, when there are a plurality of approximate expressions satisfying the criteria, the graph evaluating unit 17a selects a predetermined expression if there is a predetermined expression in the approximate expressions. This is because an effective definition area is not defined in the default formula, so that the storage capacity for expressing the effective definition area is zero. When there are a plurality of approximate expressions that satisfy the criteria and there is no default expression among the approximate expressions, the method of selecting an approximate expression is the same as that of the graph evaluation unit 17 in the first embodiment, and a description thereof will be omitted.
If the approximate expression determined by the graph evaluation unit 17a is an approximate expression other than the default expression, the graph evaluation unit 17a includes the determined approximate expression and its effective domain, and new data (from the new data generation time substitution unit 12). The received generated data) is output to the graph updating unit 18a. At this time, the graph evaluation unit 17a also outputs information indicating whether or not the determined approximate expression is the final approximate expression to the graph update unit 18a. Further, the graph evaluation unit 17a may output, for example, the approximate expression ID to the graph update unit 18a as the determined approximate expression. This operation is the same as the operation of the graph evaluation unit 17 in the first embodiment.
On the other hand, if the determined approximate expression is a default expression, the graph evaluation unit 17a outputs information notifying that the default expression has been selected and the input new generated data to the graph update unit 18a. As information for notifying that the default formula has been selected, a default formula ID dedicated to the default formula representing the default formula may be used.
Also, there may be no approximate expression that satisfies the criteria stored in the accuracy constraint input unit 16. In that case, the graph evaluation unit 17a may output new data to the graph update unit 18a without selecting an approximate expression. This operation is the same as the operation of the graph evaluation unit 17 in the first embodiment.
In the graph update unit 18a in the second embodiment, the graph evaluation unit 17a determines an approximate expression other than the default expression, and the approximate expression and its effective domain, new generation data, and the determined approximation are determined from the graph evaluation unit 17a. When information indicating whether or not the expression is the final approximate expression is received, the effective domain of the approximate expression is updated. The graph updating unit 18a may update the effective definition area so as to add the generation time of new generation data to the effective definition area of the received approximate expression. The graph update unit 18a stores the updated effective definition area of the approximate expression determined by the graph evaluation unit 17a in the confirmed graph storage unit 19. This operation is the same as the operation of the graph update unit 18 in the first embodiment.
In addition, the graph evaluation unit 17a selects a predetermined formula as an approximation formula that approximates the data value of the new data, and notifies the fact that the default formula has been selected and the input new generated data to the graph update unit 18a. The graph updating unit 18a operates as follows. That is, the graph update unit 18a stores the generation time of the new data and the default formula dedicated to the default formula indicating the default formula in the final time storage unit 11 as final data information.
When the graph evaluation unit 17a outputs only new generated data to the graph update unit 18a without selecting an approximate expression, the graph update unit 18a sets the generated data as an undefined point in the undefined point storage unit 13. Remember. This operation is the same as the operation of the graph update unit 18 in the first embodiment.
Further, the new approximate expression generation unit 14 newly generates an approximate expression from the uncertain point, and the approximate expression and each data used for generating the approximate expression (that is, each of the data stored in the uncertain point storage unit 13). The operation of the graph update unit 18a when the uncertain point and newly generated data) are output to the graph update unit 18a is the same as the operation of the graph update unit 18 in the first embodiment.
The graph evaluation unit 17a, the graph update unit 18a, and the default input unit 20 in the second embodiment are realized by a CPU of a computer that operates according to a data summarization program, for example. In this case, a program storage device (not shown) of the computer stores the data summarization program, and the CPU reads the program, and according to the program, the data input unit 10, the new data generation time substitution unit 12, and the new approximate expression generation unit 14 The accuracy constraint input unit 16, the graph evaluation unit 17 a, the graph update unit 18 a, and the default expression input unit 20 may be operated. In addition, each of these units may be realized by separate hardware.
An example of processing progress of the second embodiment will be described with reference to FIG. However, description of the same processing as in the first embodiment will be omitted. The operation up to step S105 is the same as in the first embodiment. After step S105, the graph evaluation unit 17a compares the approximate value of the data calculated for each approximate expression with the actual data value of the data input to the data input unit 10 (step S106). However, in the second embodiment, the graph evaluation unit 17a substitutes the generation time of new data for the default formula stored in the default formula storage unit 21, and obtains the approximate value obtained as a result and the actual data value. Compare. For example, when the accuracy constraint input unit 16 stores a reference of | x−f (t) | <ε, the graph evaluation unit 17a calculates the absolute value of the difference between the approximate value and the actual data value. Good.
Next, the graph evaluation unit 17a determines whether there is an approximate expression that satisfies the criterion that the absolute value of the difference calculated in step S106 is less than the threshold ε (step S107). The operation when there is no approximate expression that satisfies the criterion (No in step S107) is the same as in the first embodiment.
When there are a plurality of approximate expressions that satisfy the criterion (Yes in step S107), the graph evaluation unit 17a selects an approximate expression that minimizes the amount of increase in the storage capacity at the time of effective domain update from the approximate expressions that satisfy the criterion. (Step S108). In step S108, if there is a predetermined expression in the approximate expression satisfying the criterion, the graph evaluation unit 17a selects the predetermined expression, information notifying that the predetermined expression has been selected, and the newly generated data that has been input. Is output to the graph updating unit 18a. The operation when there is no predetermined expression in the approximate expression that satisfies the criterion is the same as in the first embodiment.
When the information for notifying that the predetermined formula is selected and the generated data are received from the graph evaluation unit 17a, the graph update unit 18a receives the final data information including the default formula ID and the generation time of the data as the final time storage unit 11. (Step S109). The operation in step S109 in other cases is the same as that in the first embodiment, and a description thereof will be omitted.
The data summarization system of the second embodiment does not provide an effective definition area for the default formula. In the data summarization system of the second embodiment, the generated data determined that the approximate expression that approximates the data value of the data does not correspond to the default expression is associated with any approximate expression. Therefore, the data summarization system of the second embodiment can efficiently summarize generated data approximated by an approximate expression other than the default expression with a small storage capacity, as in the first embodiment. . Further, the data summarization system of the second embodiment does not store the effective domain for data approximated by a predetermined formula, so that the summarization can be performed more efficiently. Therefore, the data summarization system according to the second embodiment can efficiently summarize with a smaller storage capacity than the first embodiment.
For example, it is assumed that x = f0 (t) is a default expression among the four approximate expressions shown in FIG. In this case, the capacity necessary for storing the approximate expression itself is 8 (= 2 × 4) numerical values. The data summarization system according to the second embodiment does not need to store the effective domain of x = f0 (t) with respect to the effective domain, so the capacity necessary for storing each effective domain is (2 × 1 + 2) + (2 × 2 + 1) + (2 × 1) = 11. Therefore, the capacity necessary for storing the summarized data is 19 (= 8 + 11), and the data summarization system of the second embodiment can further reduce the storage capacity compared to the first embodiment.
Embodiment 3. FIG.
The data summarization system according to the third embodiment first stores data received from a data generation source (not shown) as it is without being summarized, and when storage resources (memory resources) for storing the data are insufficient, Summarize the stored data.
FIG. 23 is a block diagram illustrating an example of a data summarization system according to the third embodiment of this invention. The data summarization system according to the third embodiment includes a data input unit 10, a final time storage unit 11, a new data generation time substitution unit 12, a new approximate expression generation unit 14, an uncertain point storage unit 13, and an approximation. Expression storage unit 15, accuracy constraint input unit 16, graph evaluation unit 17, graph update unit 18, confirmed graph storage unit 19, unsummarized data storage unit 30, available storage area monitoring unit 31, and summary A control unit 32 is provided.
Components similar to those in the first embodiment are denoted by the same reference numerals as those in FIG. 2, and detailed description thereof is omitted. However, when data is input to the data input unit 10 from a data generation source (not shown), the data input unit 10 outputs the input data to the summary control unit 32. Further, the new data generation time substitution unit 12 and the new approximate expression generation unit 14 receive data from the summary control unit 32.
The unsummarized data storage unit 30 is a storage device that stores data generated by a data generation source (not shown) in an unsummarized state. Since the data includes a data value and an occurrence time, the unsummary data storage unit 30 stores a data group including the data value and the occurrence time. FIG. 4 is an explanatory diagram illustrating a plurality of uncertain points stored in the uncertain point storage unit 13, but the unsummary data storage unit 30 also includes data values and generation times as illustrated in FIG. Store multiple data. The process of storing data in the unsummarized data storage unit 30 is performed by the summary control unit 32.
The unsummarized data storage unit 30, the final time storage unit 11, the uncertain point storage unit 13, the approximate expression storage unit 15, and the confirmed graph storage unit 19 may be realized by separate storage devices. Alternatively, they may be realized by the same storage device. Further, some combinations of the unsummarized data storage unit 30, the final time storage unit 11, the uncertain point storage unit 13, the approximate expression storage unit 15, and the confirmed graph storage unit 19 are realized by the same storage device. May be. For example, the unsummarized data storage unit 30 and the final time storage unit 11 are realized by the same storage device, and the approximate expression storage unit 15 and the definite graph storage unit 19 are realized by a storage device different from the storage device. The uncertain point storage unit 13 may be realized by the storage device.
The available storage area monitoring unit 31 monitors the amount of available resources in the storage device that stores at least unsummarized data, and outputs the monitoring result to the summary control unit 32. The unsummarized data here means data stored in the unsummarized data storage unit 30 from the data generation source via the data input unit 10 and the summary control unit 32. That is, it means data that has not yet been summarized as a target of the summary process. Therefore, the available storage area monitoring unit 31 only needs to monitor the amount of resources that can be used in the unsummarized data storage unit 30. If the unsummarized data storage unit 30 is realized by the same storage device as other storage units, the available storage area monitoring unit 31 may monitor the amount of resources that can be used in the storage device.
As an example of an aspect in which the available storage area monitoring unit 31 monitors an available resource amount, there is an aspect in which the remaining available memory capacity is monitored. However, this aspect is an example, and the available storage area monitoring unit 31 may monitor the resource amount in another aspect. For example, when the unsummarized data storage unit 30 is a disk storage device, the unused rate of the disk may be monitored. Here, the case where the available storage monitoring unit 31 monitors the amount of available resources is shown, but the amount of resources already used by the available storage monitoring unit 31 (for example, the disk usage rate). ) May be monitored. In the following description, the case where the available storage area monitoring unit 31 monitors the amount of available resources in the unsummarized data storage unit 30 will be described as an example.
Further, the available storage area monitoring unit 31 may monitor the unsummarized data storage unit 30 at regular intervals. Alternatively, for example, the available storage area monitoring unit 31 may monitor the unsummarized data storage unit 30 when an instruction to perform monitoring is input at an arbitrary timing from a user or the like.
The summary control unit 32 stores the generated data input to the data input unit 10 in the unsummarized data storage unit 30 in an unsummarized state according to the monitoring result by the available storage area monitoring unit 31. Alternatively, the summary control unit 32 performs summary control for summarizing the data stored in the unsummarized data storage unit 30 according to the monitoring result by the available storage area monitoring unit 31. If the amount of available resources (for example, the remaining memory capacity or the disk unused rate) in the unsummarized data storage unit 30 is larger than the threshold value, the summarization control unit 32 determines the generated data input to the data input unit 10. Then, the data is stored in the unsummarized data storage unit 30 without being summarized. On the other hand, if the available resource amount in the unsummarized data storage unit 30 is equal to or less than the threshold value, the summary control unit 32 performs summary control for summarizing the data stored in the unsummary data storage unit 30. Specifically, the summary control unit 32 outputs the data stored in the non-summary data storage unit 30 to the new approximate expression generation unit 14 and the new data generation time substitution unit 12 to start the data summarization process. . When the summary control unit 32 outputs the data stored in the non-summary data storage unit 30 to the new approximate expression generation unit 14 and the new data generation time substitution unit 12, the summary control unit 32 stores the data from the unsummary data storage unit 30. to erase.
In addition, when the available storage monitoring unit 31 monitors the amount of resources already used, the summary control unit 32 displays the generated data when the amount of used resources is less than the threshold, as an unsummarized data storage unit. 30 and the summary control may be performed when the amount of resources used is equal to or greater than the threshold.
Also, when performing summary control, the summary control unit 32 may store new generated data in the unsummary data storage unit 30 and simultaneously perform summary control on the data.
The summary control unit 32 outputs the data stored in the non-summary data storage unit 30 to the new approximate expression generation unit 14 and the new data generation time substitution unit 12 one by one when performing summary control. The summary control unit 32 outputs the same data simultaneously to the new approximate expression generation unit 14 and the new data generation time substitution unit 12, for example. The summary control unit 32 should output the data one by one so that the output order of each data satisfies the condition that the generation time of the data output after the generation time of the data output earlier is later. That's fine. However, when erasing output data from the unsummary data storage unit 30, the summary control unit 32 does not need to delete the output data in order of time if the condition that the output data is not output again is satisfied. Good.
In addition, the summary control unit 32 does not necessarily have to output all of the data stored in the non-summary data storage unit 30 to the new approximate expression generation unit 14 and the new data generation time substitution unit 12 as summary targets. . FIG. 24 is a schematic diagram in which the data stored in the unsummarized data storage unit 30 is schematically arranged in the order of occurrence time. It is assumed that the data 51 is data that is first input to the data input unit 10 and stored in the non-summary data storage unit 30, and thereafter, the data 52 and subsequent data are sequentially stored in the non-summary data storage unit 30. The summary control unit 32 may output “data to the new approximate expression generation unit 14 and the new data generation time substitution unit 12 in the order of occurrence from the data 51. Alternatively, the summary control unit 32 may be in the middle of the generated data ( For example, from the data 55), the data may be output in the order of generation time to the new approximate expression generation unit 14 and the new data generation time substitution unit 12. In this case, the data 51 to 54 are not subject to summarization and are deleted. It is kept in the unsummary data storage unit 30 without being done.
If the condition that the generation time of data to be output later is later than the generation time of data to be output first to the new approximate expression generation unit 14 and the new data generation time substitution unit 12 is summarized, The control unit 32 may skip and output the data. For example, after the output of the data 51 to 54, the summary control unit 32 may skip the data 55 and output the data 56 to 59. Also in this case, the skipped data is not included in the summary target and is kept in the unsummarized data storage unit 30.
Further, it is assumed that data summarization is started and final data information is stored in the final time storage unit 11. In this case, the summary control unit 32 may output data satisfying the condition that the generation time is after the final time indicated by the final data information to the new approximate expression generation unit 14 and the new data generation time substitution unit 12.
The summary control unit 32 outputs the input generated data as it is to the new data generation time substituting unit 12 and the new approximate expression generating unit 14 in order to summarize the new generated data when available resources are still smaller. . As described above, a threshold for determining whether to summarize new data without storing it in the unsummarized data storage unit 30 may be set in advance separately from the above threshold.
When the summary control unit 32 outputs data to the new approximate expression generation unit 14 and the new data generation time substitution unit 12, the new approximate expression generation unit 14, the new data generation time substitution unit 12, the graph evaluation unit 17, and the graph update unit The operation 18 is the same as in the first embodiment. That is, the new approximate expression generation unit 14, the new data generation time substitution unit 12, the graph evaluation unit 17, and the graph update unit 18 perform the steps shown in FIG. 16 for each piece of data output from the summary control unit 32. The operations after S102 are performed. When the process proceeds from step S102 to step S103, the new data generation time substitution unit 12 does not perform processing.
The available storage area monitoring unit 31 and the summary control unit 32 in the third embodiment are realized by, for example, a CPU of a computer that operates according to a data summarization program. In this case, the program storage device (not shown) of the computer stores the data summarization program, and the CPU reads the program, and according to the program, the data input unit 10, the available storage area monitoring unit 31, the summarization control unit 32, the new The data generation time substitution unit 12, the new approximate expression generation unit 14, the accuracy constraint input unit 16, the graph evaluation unit 17, and the graph update unit 18 may be operated. In addition, each of these units may be realized by separate hardware.
The data summarization system of the third embodiment stores data in the unsummarized data storage unit 30 without summarizing the data if there are many resources that can be used to store the data. In this case, the data summarization system according to the third embodiment does not summarize the data, and therefore can hold the data with high accuracy. The data summarization system according to the third embodiment summarizes the data stored in the unsummarized data storage unit 30 when the resources that can be used to store the data are reduced. Similar to the embodiment, it is possible to efficiently store with a small storage capacity in the form of the approximate expression and its effective domain. Therefore, the data summarization system of the third embodiment can realize efficient data summarization as in the other embodiments, and can hold data with high accuracy when there are many available resources.
Note that each component shown in FIG. 23 may be realized by a plurality of devices instead of being realized by one device. For example, the data input unit 10, the available storage area monitoring unit 31, the summary control unit 32, the unsummary data storage unit 30, and the final time storage unit 11 may be realized by the first information processing apparatus. . The new data generation time substitution unit 12, the new approximate expression generation unit 14, the uncertain point storage unit 13, the accuracy constraint input unit 16, the graph evaluation unit 17, and the graph update unit 18 include the second information. You may implement | achieve with a processing apparatus. And the approximate expression memory | storage part 15 and the definite graph memory | storage part 19 may be implement | achieved by the database apparatus. The data summarization system may be configured to include the first information processing device, the second information processing device, and the database device.
In addition, the data summarization system of the third embodiment includes a default formula input unit 20 and a default formula storage unit 21 (see FIG. 20), as in the second embodiment, and includes a graph evaluation unit 17 and a graph update unit 18. Instead, a configuration including a graph evaluation unit 17a and a graph update unit 18a (see FIG. 20) may be employed. The data summarization system including such a configuration can perform data summarization more efficiently as in the second embodiment.
Embodiment 4 FIG.
In the fourth embodiment, a predetermined formula is used as in the second embodiment. However, in the data summarizing system of the fourth embodiment, the storage capacity of the effective domain of any approximate expression other than the default formula is greater than the storage capacity when the effective domain is defined for the default formula. If it is larger, the approximate expression is changed to a default expression.
FIG. 25 is a block diagram illustrating an example of a data summarization system according to the fourth embodiment of this invention. The same components as those of the second embodiment are denoted by the same reference numerals as those in FIG. 20, and detailed description thereof is omitted. The data summarization system according to the fourth embodiment includes a data input unit 10, a final time storage unit 11, a new data generation time substitution unit 12, a new approximate expression generation unit 14, an uncertain point storage unit 13, and an approximation. Formula storage unit 15, accuracy constraint input unit 16, graph evaluation unit 17a, graph update unit 18a, deterministic graph storage unit 19, default formula input unit 20, default formula storage unit 21, and default formula valid definition An area calculation unit 40, an effective domain capacity evaluation unit 41, and an effective domain update unit 42 are provided.
As described in the second embodiment, the valid definition area is not defined in the default formula stored in the default formula storage unit 21. The default effective domain calculator 40 calculates the storage capacity required for the effective domain of each approximate expression other than the default formula and the storage capacity required for the effective domain of the default formula when the effective domain is defined in the default formula. In order to determine the magnitude relationship, a default effective definition area is calculated, and the effective definition area is output to the effective definition area capacity evaluation unit 41. The default valid domain calculator 40 calculates the default valid domain as follows. A time period from the time of occurrence of the first occurrence data to the occurrence time of the last occurrence data at the present time is expressed as ∪. In addition, the effective domain of each approximate expression x = f0 (x), x = f1 (x),.₀, S₁, ..., S_nAnd In addition, the default effective domain is S_defaultAnd The default effective range calculator 40 performs the calculation of the following formula (1) to obtain the effective range S of the default formula._defaultAsk for.
S_default= ∪- (S₀∪S₁∪ ・・・ ∪S_n) Formula (1)
That is, the default effective domain calculator 40 removes the union of the effective domains of each approximate expression from the time zone from the time of occurrence of the first occurrence data to the time of occurrence of the last occurrence data at the present time, Default valid domain S_defaultCalculate
FIG. 26 is an explanatory diagram showing an example of deriving a default effective definition area. The horizontal axis shown in FIG. 26 represents the data generation time t, and the vertical axis represents the data value x. Each data is T₀~ T₇It is assumed that it occurs in the time zone. In FIG. 26, the default formula is x = f0 (t), and the time is T₀It is assumed that it is an integer above. x = f1 (t) and x = f2 (t) are approximate expressions other than the default expression. The effective domain of the approximate expression x = f1 (t) is I₁₁∪I₁₂∪I₁₃More specifically, [T₀, T₁] ∪ [T₂+1, T₃] ∪ [T₅+1, T₆]. The effective domain of the approximate expression x = f2 (x) is I₂₁∪I₂₂More specifically, [T₃+1, T₄] ∪ [T₆+1, T₇]. Time T₀To time T₁[T₀, T₁] And the other sections are also represented in the same manner. FIG. 27 is an explanatory diagram summarizing the effective definition areas of x = f1 (t) and x = f2 (t).
∪ In this example, ∪ = [T₀, T₇The predetermined effective range calculation unit 40 may calculate the default effective range as shown in the following equation (2).
S_default
= [T₀, T₇]-([T₀, T₁] ∪ [T₂+1, T₃] ∪ [T₅+1, T₆]) ∪ ([T₃+1, T₄] ∪ [T₆+1, T₇])
= [T₁+1, T₂] ∪ [T₄+1, T₅] Formula (2)
[T₁+1, T₂], [T₄+1, T₅] Is the section I shown in FIG.₀₁, I₀₂It is.
The valid domain capacity evaluation unit 41 refers to the valid domain of the default formula received from the default formula valid domain calculation unit 40 and the valid domain of each approximate expression stored in the definite graph storage unit 19 to validate An approximate expression that maximizes the storage capacity required to store the domain is specified. The effective domain capacity evaluation unit 41 outputs the approximate expression and the default effective domain to the effective domain update unit 42. Note that the valid domain capacity evaluation unit 41 may output an approximate formula ID or a default formula ID to the valid domain update unit 42 instead of outputting the approximate formula itself.
In the example shown in FIG. 27, the effective definition area of the approximate expression of the approximate expression ID “f1” is [T₀, T₁] ∪ [T₂+1, T₃] ∪ [T₅+1, T₆Therefore, a storage capacity of 6 numerical values is required. The effective definition area of the approximate expression ID “f2” is [T₃+1, T₄] ∪ [T₆+1, T₇Therefore, a storage capacity for four numerical values is required. Furthermore, the effective domain calculated for the default formula is [T₁+1, T₂] ∪ [T₄+1, T₅Therefore, a storage capacity for four numerical values is required. Therefore, the effective domain capacity evaluation unit 41 calculates the approximate expression x = f1 (t) that maximizes the storage capacity of the effective domain and the effective domain [T₁+1, T₂] ∪ [T₄+1, T₅] Is output to the valid domain update unit 42.
When the valid domain calculated for the default formula and the approximate formula that maximizes the storage capacity of the valid domain are input to the valid domain update unit 42, the valid domain update unit 42 responds to the input contents. To update the default formula. However, if the approximate expression that maximizes the storage capacity of the valid domain is the current default formula, the valid domain update unit 42 ends the process without updating.
The default valid domain calculation unit 40, the valid domain capacity evaluation unit 41, and the valid domain update unit 42 in the fourth embodiment are realized by a CPU of a computer that operates according to a data summarization program, for example. In this case, a program storage device (not shown) of the computer stores the data summarization program, and the CPU reads the program, and according to the program, the data input unit 10, the new data generation time substitution unit 12, and the new approximate expression generation unit 14 , Accuracy constraint input unit 16, graph evaluation unit 17 a, graph update unit 18 a, default formula input unit 20, default formula valid domain calculation unit 40, valid domain capacity assessment unit 41, and valid domain update unit 42. Good. In addition, each of these units may be realized by separate hardware.
Next, the operation of the fourth embodiment will be described.
FIG. 28 is a flowchart showing an example of the progress of the default formula update process by the default formula valid domain calculation unit 40, the valid domain capacity evaluation unit 41, and the valid domain update unit 42 in the fourth embodiment.
Note that the data summarization system performs this default formula update process by performing the data input unit 10, the final time storage unit 11, the new data generation time substitution unit 12, the new approximate formula generation unit 14, and the uncertain point storage unit 13. Data summarization processing (same data summarization processing as in the second embodiment) by the approximate expression storage unit 15, the graph evaluation unit 17a, the graph update unit 18a, the confirmed graph storage unit 19, and the default formula storage unit 21 ) And asynchronous. For example, the data summarization system may execute a predefined update process shown in FIG. 28 at regular time intervals. Alternatively, each time a predetermined number of generated data is input to the data input unit 10, the data summarization system may execute a default update process. Alternatively, when a new approximate expression is added to the approximate expression storage unit 15, the data summarization system may execute a default expression update process. These are illustrations of the execution timing of the default update process shown in FIG. 28, and the execution timing of the default update process is not limited to the above examples.
When the default formula update process is started, first, the default formula valid domain calculation unit 40 reads the valid domain of all approximate formulas from the definite graph storage unit 19 (step S401). Further, the default effective range calculation unit 40 performs the calculation of the above-described formula (1), so that the default effective range S_defaultIs calculated (step S402). The predefined effective domain calculation unit 40 outputs the effective domain to the effective domain capacity evaluation unit 41.
Next, the effective domain capacity evaluation unit 41 uses the default valid domain S_defaultThen, referring to the effective domain of each approximate expression, an approximate expression that maximizes the storage capacity required to store the effective domain is specified (step S403). Then, the valid domain capacity evaluation unit 41 outputs the identified approximate expression and the default valid domain to the valid domain update unit 42.
Next, the valid domain update unit 42 determines whether or not the approximate formula specified in step S403 is a default formula as an approximate formula that maximizes the storage capacity required to store the valid domain ( Step S404).
If the approximate expression specified in step S403 is not a default expression (No in step S404), the valid domain update unit 42 updates the approximate expression specified in step S403 as a new prescribed expression (step S405). Specifically, the valid domain update unit 42 performs the following processing.
The valid domain update unit 42 sets the approximate formula that maximizes the storage capacity of the valid domain as a new default formula, and updates the default formula stored in the default formula storage unit 21 to the new default formula. Then, the valid domain update unit 42 deletes the approximate expression and the approximate expression ID as a new default expression from the approximate expression storage unit 15. Further, the valid domain update unit 42 stores the approximate expression that has been set as the default expression in the approximate expression storage unit 15. At this time, the valid domain update unit 42 assigns an approximate expression ID to the approximate expression (approximate expression that has been set as a default expression so far), and stores the approximate expression ID together with the approximate expression ID in the approximate expression storage unit 15.
The effective domain update unit 42 deletes the approximate expression ID of the approximate expression as a new default formula and its effective domain from the confirmed graph storage unit 19. Further, the valid domain update unit 42 confirms the approximate expression ID assigned to the approximate expression that has been used as the default formula and the valid domain (the valid domain calculated by the default formula valid domain calculation unit 40). The data is stored in the storage unit 19.
Further, if the approximate expression ID of the new approximate expression is stored in the final time storage section 11 as the final approximate expression ID, the valid domain update unit 42 sets the ID as the default expression ID. Update to On the other hand, if the default formula ID is stored in the final time storage unit 11 as the ID of the final approximate formula, the valid domain update unit 42 assigns the default formula ID to the approximate formula that has been the default formula so far. Update to approximate expression ID.
If the approximate expression specified in step S403 is a default expression (Yes in step S404), the valid domain update unit 42 does not update the default expression and ends the process as it is. That is, the valid domain update unit 42 ends the process without updating the contents stored in the default formula storage unit 21, the approximate formula storage unit 15, the confirmed graph storage unit 19, and the final time storage unit 11.
The data summarization system of the fourth embodiment compares each effective definition area of an approximate expression including a default expression, and sets an approximate expression having the maximum storage capacity for storing the effective definition area as a new default expression. Update the expression. Since the effective domain is not defined in the default formula, the data summarization system of the fourth embodiment reduces the storage capacity required for storing the valid domain by updating the default formula as described above, and more efficiently. Data can be summarized.
For example, it is assumed that data is input as illustrated in FIG. 26 and x = f0 (t) is a default formula. In this case, as shown in FIG. 27, a storage capacity of 10 numerical values is required for the effective definition area of x = f1 (t) and x = f2 (t). If the data summarization system of the fourth embodiment has x = f1 (t) as the default formula and stores the previous x = f0 (t) together with the valid domain, [T₀, T₁] ∪ [T₂+1, T₃] ∪ [T₅+1, T₆] Instead of [T₁+1, T₂] ∪ [T₄+1, T₅] Need only be stored. Therefore, the data summarization system of the fourth embodiment can reduce the capacity required for storing the effective domain by two in this example.
As in the third embodiment, the data summarization system according to the fourth embodiment includes an unsummary data storage unit 30, an available storage area monitoring unit 31, and a summary control unit 32, and stores data. Data may be stored as it is when there are many resources that can be used, and data summarization may be performed when resources are reduced.
In each of the above embodiments, the case where the data includes a data value that is a numerical value is taken as an example. However, instead of the numerical value itself, the data value can be converted into a numerical value and a difference between the numerical data can be derived. May be included. For example, text information may be used as a data value if a conversion rule for numerical values is defined. When the sequentially generated data includes such text information and generation time, for example, the data input unit 10 may convert the text information into a numerical value. The subsequent processing is the same as in the above embodiment.
In each of the above embodiments, the case where data includes a scalar quantity as a data value has been illustrated, but a vector may be included as a data value. That is, the data may include a vector and an occurrence time. In this case, the new approximate expression generation unit 14 may generate an approximate expression for deriving an approximate value of a vector from a plurality of data (undefined points and newly generated data). In step S106 (see FIG. 16), when the graph evaluation unit 17 or the graph evaluation unit 17a compares the vector calculated as the approximate value with the vector actually included in the data, the distance between the two in the vector space. May be calculated. In step S107, the graph evaluation unit 17 or the graph evaluation unit 17a may determine whether there is an approximate expression whose distance is less than the threshold value ε (or less than ε).
As mentioned above, although the form for implementing this invention was demonstrated, this invention is not limited to the above embodiment. Various other additions and modifications that can be understood by those skilled in the art can be made to the configuration and details of the present invention within the scope of the present invention.
Next, the minimum configuration of the present invention will be described. FIG. 29 is a block diagram showing the minimum configuration of the present invention. The data summarization system of the present invention includes an approximate value calculation unit 61, an approximate expression evaluation unit 62, an unconfirmed data storage unit 63, a new approximate expression generation unit 64, and an update unit 65.
The approximate value calculation unit 61 (for example, the new data generation time substitution unit 12) is an approximate expression for calculating an approximate value of a data value in data including the data value and the generation time of the data value, and the generation time is a variable. Approximate the approximate value of the data value of the new data by substituting the occurrence time included in the new data for each approximate expression in which the effective domain of the variable is defined as a time interval or a set of time points. Calculate for each formula.
The approximate expression evaluation unit 62 (for example, the graph evaluation unit 17) selects an approximate expression suitable for calculating the approximate value of the data value of the new data based on the approximate value calculated for each approximate expression and the data value of the new data. Or it is determined that there is no approximate expression suitable for calculating the approximate value of the data value of the new data.
The indeterminate data storage unit 63 (for example, the indeterminate point storage unit 13) converts the new data determined to have no approximate expression suitable for calculating the approximate value of the data value into the approximate expression indeterminate data (for example, the indeterminate point). Remember as.
The new approximate expression generation unit 64 (for example, the new approximate expression generation unit 14) generates a new approximate expression from the new data and the approximate expression unconfirmed data when the new data is input to the new approximate expression generation unit 64. It is determined whether or not it is possible, and if it can be generated, a new approximate expression is generated, and a time interval or a set of points of time is defined as an effective definition area of the approximate expression.
The update unit 65 (for example, the graph update unit 18) approximates so that the generation time of the new data is included when the approximate expression evaluation unit 62 selects an approximate expression suitable for calculating the approximate value of the data value of the new data. Update the effective domain of the expression.
The data summarization system including the above configuration stores each data in the form of an approximate expression and its effective domain, and defines the effective domain of one approximate expression as a time interval or a set of time points. Therefore, the data summarization system requires a small storage capacity for storing the approximate expression and its effective domain. Therefore, the data summarization system can efficiently summarize (compress) the data. In particular, this advantage is remarkably obtained when summarizing data that occurs sequentially in a certain tendency and that may vary greatly irregularly.
In each of the above embodiments, a data summarization system having the following configuration is described.
(1) An approximate expression for calculating an approximate value of a data value in data including a data value and an occurrence time of the data value, where the occurrence time is a variable, and the effective domain of the variable is a time interval or a single time An approximate value calculation unit that calculates the approximate value of the data value of the new data for each approximate expression by substituting the occurrence time included in the new data for each approximate expression defined as a set of (for example, new data generation Based on the time substitution unit 12) and the approximate value calculated for each approximate expression and the data value of the new data, an approximate expression suitable for calculating the approximate value of the data value of the new data is selected, or the new data An approximate expression evaluation unit (for example, the graph evaluation unit 17) that determines that there is no approximate expression suitable for calculating the approximate value of the data value, and new data determined that there is no approximate expression suitable for the approximate value calculation of the data value are approximated. Formula not yet When an unconfirmed data storage unit (for example, an unconfirmed point storage unit 13) that stores as fixed data (for example, unconfirmed points) and new data is input, new data and approximate expression unconfirmed data A new approximate expression that determines whether an approximate expression can be generated, generates a new approximate expression if it can be generated, and defines a time interval or a set of time points as an effective domain of the approximate expression When the generation unit (for example, the new approximate expression generation unit 14) and the approximate expression evaluation unit select an approximate expression suitable for calculating the approximate value of the data value of the new data, the approximation is performed so that the generation time of the new data is included. A data summarization system comprising: an update unit (for example, graph update unit 18) that updates an effective domain of an expression.
(2) The approximate expression evaluation unit specifies an approximate expression in which the relationship between the approximate value and the data value of the new data satisfies a predetermined criterion (for example, the criterion stored in the accuracy constraint input unit 16), and the approximate equation that satisfies the criterion If there is one, select the approximate expression. If there are multiple approximate expressions that meet the criteria, the effective definition area includes the time of occurrence of new data from the approximate expressions. If an approximate expression that minimizes the increase in storage capacity required to store the effective domain is selected when an update is made, and there is no approximate expression that satisfies the criteria, an approximate value of the data value of the new data A data summarization system that determines that there is no approximate expression suitable for calculation.
(3) An approximate expression for calculating the approximate value of the data value, and including a default formula storage unit (for example, the default formula storage unit 21) that stores a default formula that is an approximate formula that does not define an effective domain, and calculates an approximate value. For each approximate expression including the default expression, the approximate value of the data value of the new data is calculated for each approximate expression by substituting the occurrence time included in the new data. If the relation between the approximate value and the data value of the new data satisfies the predetermined criterion from among the approximate expressions including the one, and if there is one approximate expression that satisfies the criterion, select the approximate expression If there are multiple approximation formulas that satisfy the criteria and the default formula is included in the multiple approximation formulas, select the default formula, and there are multiple approximation formulas that satisfy the criteria. If the default expression is not included in the approximate expression, select from the multiple approximate expressions. When the effective domain is updated to include the time of occurrence of new data, an approximate expression that minimizes the increase in storage capacity required for storing the effective domain is selected, and there is no approximate expression that satisfies the criteria In this case, the data summarization system determines that there is no approximate expression suitable for calculating the approximate value of the data value of the new data.
(4) For storing a valid domain from a default formula valid domain calculator (for example, default formula valid domain calculator 40) for calculating a valid domain of a default formula and each approximate expression including the default formula The effective domain capacity evaluation unit (for example, the effective domain capacity evaluation unit 41) that specifies an approximate expression that maximizes the storage capacity required for the storage, and the storage capacity required to store the effective domain If the approximate expression is not a default formula, the approximate formula with the maximum storage capacity is stored as a new default formula in the default formula storage unit, and the approximate formula with the maximum storage capacity and its effective domain are excluded. A data summarization system comprising a predefined update unit (for example, an effective domain update unit 42).
(5) New data storage unit (for example, unsummarized data storage unit 30) for storing input new data, and monitoring unit (for example, available storage) for monitoring resources capable of storing new data in the new data storage unit Area monitoring unit 31) and new data stored in the new data storage unit one by one in the approximate value calculation unit and the new approximate expression generation unit when resources that can store new data are less than a predetermined amount A data summarization system comprising a summary control unit (for example, a summary control unit 32) for outputting.
In each of the above embodiments, the following data structure is described.
(1) An approximate expression for calculating an approximate value of a data value by substituting a variable is associated with an effective definition area that is a variable definition area capable of obtaining an approximate value of the data value. Is a data structure characterized by being represented by a variable interval or a set of points representing one variable value.
(2) Corresponding to another approximate expression between sections of the effective domain associated with one approximate expression, between points or between points, or between sections and points A data structure that allows valid domain intervals or points to exist.
This application claims priority based on Japanese Patent Application No. 2009-205012, filed on Sep. 4, 2009, the entire disclosure of which is incorporated herein.

Each form of the present invention can be suitably applied to a data summarization apparatus that summarizes data that is sequentially generated with a certain tendency and may change irregularly.

DESCRIPTION OF SYMBOLS 10 Data input part 11 Final time memory | storage part 12 New data generation | occurrence | production time substitution part 13 Uncertain point memory | storage part 14 New approximate expression production | generation part 15 Approximation expression memory | storage part 16 Accuracy

constraint input part

17, 17a

Graph evaluation part

18, 18a Graph update part DESCRIPTION OF SYMBOLS 19 Deterministic graph memory | storage part 20 Default formula input part 21 Default formula memory | storage part 30 Unsummary data storage part 31 Available storage area monitoring part 32 Summary control part 40 Default formula effective domain calculation part 41 Effective domain capacity evaluation part 42 Effective definition Area update unit 61 Approximate value calculation unit 62 Approximation formula evaluation unit 63 Unconfirmed data storage unit 64 New approximation formula generation unit 65 Update unit

Claims

An approximate expression that calculates an approximate value of a data value in data including the data value and the time of occurrence of the data value, where the time of occurrence is a variable and the effective domain of the variable is a time interval or a set of time points An approximate value calculation unit that calculates an approximate value of the data value of the new data for each approximate expression by substituting the occurrence time included in the new data for each determined approximate expression,
Based on the approximate value calculated for each approximate expression and the data value of the new data, select an approximate expression suitable for calculating the approximate value of the data value of the new data, or approximate the data value of the new data An approximate expression evaluation unit that determines that there is no approximate expression suitable for value calculation;
An unconfirmed data storage unit that stores new data determined as having no approximate expression suitable for calculating an approximate value of data values as approximate expression unconfirmed data;
When new data is input, it is determined whether or not a new approximate expression can be generated based on the new data and the approximate expression indeterminate data. If it can be generated, a new approximate expression is generated. A new approximate expression generation unit that defines a time interval or a set of time points as an effective domain of the approximate expression;
And an update unit that updates the effective definition area of the approximate expression so as to include the generation time of the new data when the approximate expression evaluation unit selects an approximate expression suitable for calculating the approximate value of the data value of the new data. A data summarization system.
The approximate expression evaluation unit
Identify an approximate expression where the relationship between the approximate value and the data value of the new data satisfies a given criterion,
If there is one approximate expression that satisfies the above criteria, select that approximate expression,
When there are a plurality of approximate expressions satisfying the above criteria, it is necessary for storing the effective definition area when the effective definition area is updated to include the time of occurrence of new data from the plurality of approximate expressions. Select the approximate expression that minimizes the increase in storage capacity
The data summarization system according to claim 1, wherein if there is no approximate expression that satisfies the criterion, it is determined that there is no approximate expression suitable for calculating the approximate value of the data value of the new data.
A default formula storage unit for storing a default formula that is an approximate formula that calculates an approximate value of a data value and that does not define an effective domain;
The approximate value calculation unit calculates the approximate value of the data value of the new data for each approximate expression by substituting the occurrence time included in the new data for each approximate expression including the default expression.
The approximate expression evaluation unit specifies an approximate expression that satisfies a predetermined criterion in a relationship between the approximate value and the data value of the new data from among the approximate expressions including the predetermined expression,
If there is one approximate expression that satisfies the above criteria, select that approximate expression,
When there are a plurality of approximate expressions that satisfy the above-mentioned criteria and the default expression is included in the approximate expressions, select the default expression,
If there are multiple approximation formulas that meet the above criteria and the default formula is not included in the multiple approximation formulas, the effective definition is to include the generation time of new data from the multiple approximation formulas. Select an approximate expression that minimizes the increase in storage capacity required to store the effective domain when the domain is updated,
The data summarization system according to claim 2, wherein if there is no approximate expression that satisfies the criterion, it is determined that there is no approximate expression suitable for calculating the approximate value of the data value of the new data.
A default effective domain calculator that calculates the effective domain of the default formula;
An effective domain capacity evaluation unit for identifying an approximate expression that maximizes the storage capacity required for storing the effective domain from among the approximate formulas including the predetermined formula;
When the approximate expression that maximizes the storage capacity required for storing the effective domain is not a default expression, the approximate expression that maximizes the storage capacity is stored in the default expression storage unit as a new default expression, and The data summarization system according to claim 3, further comprising: an approximate expression that maximizes a storage capacity and a default expression update unit that excludes an effective domain thereof.
A new data storage unit for storing the input new data;
A monitoring unit that monitors resources capable of storing new data in the new data storage unit;
A summary control unit that outputs new data stored in the new data storage unit to the approximate value calculation unit and the new approximate expression generation unit one by one when resources that can store the new data are less than a predetermined amount; The data summarization system according to any one of claims 1 to 4.
An approximate expression that calculates an approximate value of a data value in data including the data value and the time of occurrence of the data value, where the time of occurrence is a variable and the effective domain of the variable is a time interval or a set of time points For each approximate expression defined, the data value of the new data is calculated for each approximate expression by substituting the occurrence time included in the new data,
Based on the approximate value calculated for each approximate expression and the data value of the new data, select an approximate expression suitable for calculating the approximate value of the data value of the new data, or approximate the data value of the new data Judge that there is no approximate expression suitable for value calculation,
New data determined to have no approximate expression suitable for calculating the approximate value of the data value is stored as approximate expression indeterminate data,
When new data is input, it is determined whether or not a new approximate expression can be generated based on the new data and the approximate expression indeterminate data. If it can be generated, a new approximate expression is generated. Set a time interval or a set of time points as the effective domain of the approximate expression,
A data summarization method comprising: updating an effective definition area of the approximate expression so as to include an occurrence time of the new data when an approximate expression suitable for calculating an approximate value of the data value of the new data is selected.
When choosing an approximation formula suitable for calculating approximate values for new data values,
Identify an approximate expression where the relationship between the approximate value and the data value of the new data satisfies a given criterion,
If there is one approximate expression that satisfies the above criteria, select that approximate expression,
When there are a plurality of approximate expressions satisfying the above criteria, it is necessary for storing the effective definition area when the effective definition area is updated to include the time of occurrence of new data from the plurality of approximate expressions. Select the approximate expression that minimizes the increase in storage capacity
The data summarization method according to claim 6, wherein if there is no approximate expression that satisfies the criterion, it is determined that there is no approximate expression suitable for calculating the approximate value of the data value of the new data.
Stores a default expression that is an approximate expression that calculates an approximate value of a data value and that does not define an effective domain,
For each approximate expression including the default expression, the data value of the new data is calculated for each approximate expression by substituting the occurrence time included in the new data,
When choosing an approximation formula suitable for calculating approximate values for new data values,
From among the approximate expressions including a predetermined expression, specify an approximate expression that satisfies a predetermined criterion for the relationship between the approximate value and the data value of the new data,
If there is one approximate expression that satisfies the above criteria, select that approximate expression,
When there are a plurality of approximate expressions that satisfy the above-mentioned criteria and the default expression is included in the approximate expressions, select the default expression,
If there are multiple approximation formulas that meet the above criteria and the default formula is not included in the multiple approximation formulas, the effective definition is to include the generation time of new data from the multiple approximation formulas. Select an approximate expression that minimizes the increase in storage capacity required to store the effective domain when the domain is updated,
The data summarization method according to claim 7, wherein when there is no approximate expression that satisfies the criterion, it is determined that there is no approximate expression suitable for calculating the approximate value of the data value of the new data.
Calculate the default effective domain,
From the approximate formulas including the default formula, specify the approximate formula that maximizes the storage capacity required for storing the effective domain,
When the approximate expression that maximizes the storage capacity required for storing the effective domain is not a default expression, the approximate expression that maximizes the storage capacity is stored as a new default expression, and the storage capacity is The data summarization method according to claim 8, wherein the approximate expression and its effective domain are excluded.
Memorize the new data entered,
Monitor resources that can store new data,
When the amount of resources that can store new data becomes less than the specified amount, the stored new data is subject to approximate value calculation and whether or not a new approximate expression can be generated. The data summarization method according to any one of claims 6 to 9.
On the computer,
An approximate expression that calculates an approximate value of a data value in data including the data value and the time of occurrence of the data value, where the time of occurrence is a variable and the effective domain of the variable is a time interval or a set of time points Approximate value calculation processing for calculating an approximate value of the data value of the new data for each approximate expression by substituting the occurrence time included in the new data for each determined approximate expression,
Based on the approximate value calculated for each approximate expression and the data value of the new data, select an approximate expression suitable for calculating the approximate value of the data value of the new data, or approximate the data value of the new data Approximate expression evaluation process for determining that there is no approximate expression suitable for value calculation,
Unconfirmed data storage processing for storing new data determined to have no approximate expression suitable for calculating approximate values of data values in the unconfirmed data storage unit as approximate expression unconfirmed data,
When new data is input, it is determined whether or not a new approximate expression can be generated based on the new data and the approximate expression indeterminate data. If it can be generated, a new approximate expression is generated. A new approximate expression generation process for defining a time interval or a set of time points as an effective definition area of the approximate expression; and
When an approximate expression suitable for calculating the approximate value of the data value of the new data is selected in the approximate expression evaluation process, an update process is executed to update the effective definition area of the approximate expression so that the generation time of the new data is included. Recording medium storing a data summarization program.
An approximation formula for calculating approximate values of data values by substituting variables,
Is associated with an effective domain, which is a domain of variables that can be approximated by data values,
A valid domain is a data structure characterized by a variable interval or a set of points representing a single variable value.
An effective domain associated with another approximate expression between sections of an effective domain associated with one approximate expression, or between points, or between an area and a point The data structure according to claim 12, wherein a section or a point is allowed to exist.