WO2010095459A1

WO2010095459A1 - Analysis preprocessing system, analysis preprocessing method, and analysis preprocessing program

Info

Publication number: WO2010095459A1
Application number: PCT/JP2010/001108
Authority: WO
Inventors: 喜田弘司; 藤山健一郎; 今井照之; 中村暢達
Original assignee: 日本電気株式会社
Priority date: 2009-02-20
Filing date: 2010-02-19
Publication date: 2010-08-26
Also published as: JPWO2010095459A1

Abstract

Provided is an analysis preprocessing system capable of passing large amounts of data to a means for analyzing data at high speed while preventing the data from overflowing even if the data is transmitted from a plurality of data generation sources. A data acquisition means (71) acquires the data constellation generated by the plurality of data generation sources. A data clipping means (72) clips each data from the data constellation acquired by the data acquisition means (71). A filtering means (73) determines whether or not a predetermined condition is satisfied for each of the data clipped by the data clipping means (72), stores data which satisfies the predetermined condition in a buffer (74), and discards data which does not satisfy the predetermined condition. An analysis data determination means (75) determines an analysis data constellation which is a set of data used for analysis from the data stored in the buffer (74). An analysis data output means (76) transmits the analysis data constellation to a data analysis means for analyzing data.

Description

Analysis preprocessing system, analysis preprocessing method, and analysis preprocessing program

The present invention relates to an analysis preprocessing system, an analysis preprocessing method, and an analysis preprocessing program for performing preprocessing on data to be analyzed.

There is a time series analysis device that analyzes data in a time series for multiple sensors and geographically distributed server logs. In such a time series analysis apparatus, data to be analyzed is temporarily stored as a database or a file and analyzed by batch processing or the like.

Non-patent document 1 describes a database for storing such data. In the technique described in Non-Patent Document 1, sensor data observed by a sensor network is stored in a single database on the network. When referring to past data, the data is referred to by making an inquiry using SQL.

In addition, an example of analyzing logs of apache (Apache Software Foundation) widely used as a Web server will be described. Usually, a plurality of Web servers are prepared to distribute access from clients. Each Web server independently stores access and error logs as files. In the default configuration of apache, error logs are recorded in the /usr/local/apache/logs/error.log file. When the analysis apparatus analyzes these logs, the analysis apparatus collects logs recorded in a plurality of servers by using FTP (File Transfer 等 Protocol) or the like, and analyzes the logs.

FIG. 22 shows an example of a general configuration for collecting data to be analyzed by analysis data. Each Web server 202 serving as a data generation source is accessed by the client 201 to generate data (log). Each Web server 202 transmits the log to the log collection unit 203. Upon receipt of the data, the log collection unit 203 stores the data in the storage unit as a database or a file. Then, the log collection unit 203 converts the data into a data format for data analysis and passes it to the data analysis device 204, and the data analysis device 204 performs data analysis.

As a simple configuration for realizing a configuration in which a data generation source (Web server 202 in the example shown in FIG. 22) and a data analysis device operate independently, the generated data is saved as a database or file, and data analysis is performed. A configuration in which the apparatus analyzes the data is mentioned. Further, in the configuration in which the data generation source and the data analysis apparatus advance the processing asynchronously while communicating with each other, it is necessary for both parties to determine whether or not there is a request for communication from the other party, resulting in a complicated system. In order to avoid such a complicated operation, a configuration in which generated data is stored as a database or a file is employed.

Also, there are many license-free libraries that can be used for the process of transmitting data from the data generation source, the process of receiving the data, and the process of temporarily storing the received data. For example, when transferring a file, an FTP server may be used. Further, an ODBC (Open Database Connectivity) driver may be used in the database. Since such a library can be used, a configuration in which generated data is stored as a database or a file is employed.

Patent Document 1 describes a configuration in which a microcomputer collects data measured by a plurality of sensors such as a vibration sensor and a pulse sensor, and the microcomputer outputs data to a PDA or the like. The microcomputer performs processing for removing the disturbance signal on the original data of the biological signal, totaling processing in units of seconds and minutes, and the like, and generates processed data. The microcomputer transmits the processing data to the PDA. Further, in Patent Document 1, when it is determined that there is no change in measurement data and the state of the subject is not yet to measure a biological signal, the measurement operation of the biological signal is waited until a predetermined time elapses. Are listed.

Patent Document 2 describes a process for suppressing the amount of data per unit time output by a sensor in a sensor network. Specifically, increase the measurement interval of sensor nodes, perform batch transmission of observation information, or perform communication between sensor nodes and router nodes to reduce the amount of data transmitted per unit time Is described.

Patent Document 3 describes that when the received data is received again in the subsequent stream, the subsequent data stream is interrupted. Further, it is described that filtering related to a customer organization or a user organization is performed on a data stream.

Patent Document 4 describes a charged beam length measuring device that deletes measurement data when the absolute value of the difference between the first measurement data and the second measurement data exceeds a predetermined value.

Japanese Unexamined Patent Publication No. 2003-30775 (paragraphs 0037, 0048-0050, 0063, FIG. 1) JP 2008-42458 A (paragraph 0051) JP 2002-77277 A (paragraphs 0033, 0035) JP 2002-62123 A (paragraph 0021)

In a configuration in which there are a plurality of data generation sources such as sensors and web servers, and these data are temporarily stored as a database or file and passed to the data analysis device (for example, the configuration shown in FIG. 22), the number of data generation sources is If the number increases, processing by the data collecting means may not be in time due to concentration of access to the data collecting means (for example, the log collecting means 203 shown in FIG. 22). For example, when data is stored as a database or a file, the I / O for data storage is low speed, so there is a possibility that the process of storing the data may not be in time.

Further, when the number of data generation sources increases, the amount of data sent to the data collecting means (for example, the log collecting means 203 shown in FIG. 22) also increases, which may exceed the storable data capacity. There is. Patent Document 2 describes that a sensor node increases a measurement interval, performs communication between a sensor node and a router node, and the like. Japanese Patent Application Laid-Open No. H10-228707 describes waiting for measurement by a sensor. However, if the number of data generation sources such as sensor nodes is large, it is difficult to individually control the data generation sources. For example, if a probe car is a data generation source, it is difficult to individually instruct tens of thousands of probe cars to wait for data transmission and the like from the viewpoint of processing load.

Therefore, the present invention provides an analysis preprocessing system capable of passing data to a means for analyzing data at high speed while preventing data from overflowing even when a large amount of data is transmitted from a large number of data generation sources. An object of the present invention is to provide an analysis preprocessing method and an analysis preprocessing program.

A pre-analysis processing system according to the present invention uses a data acquisition means for acquiring a data group generated by a plurality of data generation sources, a data cutout means for cutting out individual data from the data group acquired by the data acquisition means, and used for analysis. For each of the data that is stored by the data extraction unit and the data extracted by the data extraction unit, whether or not a predetermined condition is satisfied is stored, and the data that satisfies the predetermined condition is stored in the buffer, and the predetermined condition is satisfied Filtering means for discarding unresolved data, analysis data determining means for determining an analysis data group that is a set of data used for analysis from data stored in the buffer, and data analysis means for analyzing data And an analysis data output means for sending the data group.

In addition, the pre-analysis processing method according to the present invention acquires a data group generated by a plurality of data generation sources, cuts out individual data from the acquired data group, and satisfies a predetermined condition for each cut out data The data that satisfies the predetermined condition is stored in the buffer, the data that does not satisfy the predetermined condition is discarded, and from the data stored in the buffer, a set of data used for analysis A certain analysis data group is defined, and the analysis data group is sent to the data analysis means for analyzing the data.

The analysis preprocessing program according to the present invention is a data acquisition process for acquiring a data group generated by a plurality of data generation sources in a computer, a data cutout process for cutting out individual data from the data group acquired by the data acquisition process, A filtering process that determines whether or not a predetermined condition is satisfied for each data cut out by the data cut-out process, stores data that satisfies the predetermined condition in a buffer, and discards data that does not satisfy the predetermined condition Analytical data decision processing for determining an analytical data group that is a set of data used for analysis from the data stored in the buffer, and analytical data output processing for sending the analytical data group to the data analyzing means for analyzing the data It is made to perform.

According to the present invention, even when a large amount of data is transmitted from a large number of data generation sources, the data can be transferred at high speed to the means for analyzing the data while preventing the data from overflowing.

It is a block diagram which shows the example of the analysis pre-processing system of the 1st Embodiment of this invention. It is a block diagram which shows the structural example of a data stream production | generation means. It is explanatory drawing which shows an example of the physical structure of an analysis pre-processing system. It is explanatory drawing which shows the example of the data which a time series data generation source produces | generates. It is explanatory drawing which shows the example of the data which a data transmission means transmits. It is explanatory drawing which shows an analysis window typically. It is explanatory drawing which shows the example of the input / output of a data stream production | generation means. It is explanatory drawing which shows the example of the cut out data. It is a schematic diagram which shows the example of the memory image in a transmission data buffer. It is a block diagram which shows the structural example of a filtering means. It is a flowchart which shows the example of the process progress of the 1st Embodiment of this invention. It is a flowchart which shows the example of the process progress of a filtering process. It is a block diagram which shows the structural example of the filtering means in 2nd Embodiment. It is explanatory drawing which shows the example of the reference | standard which an effective data definition means memorize | stores. It is a flowchart which shows the example of the process progress of the filtering process in 2nd Embodiment. It is explanatory drawing which shows the specific example of the condition where replication of data arises. It is a block diagram which shows the structural example of the filtering means in 3rd Embodiment. It is explanatory drawing which shows the example of data identification information. It is a flowchart which shows the example of the process progress of the filtering process in 3rd Embodiment. It is a block diagram which shows the structural example of the data stream production | generation means in reference embodiment. It is explanatory drawing which shows the minimum structure of this invention. It is a block diagram which shows the general structural example of the system which collects analysis object data.

Hereinafter, embodiments of the present invention will be described with reference to the drawings.
Embodiment 1. FIG.
FIG. 1 is a block diagram illustrating an example of a pre-analysis processing system according to the first embodiment of this invention. The analysis preprocessing system 7 of the present invention includes a data receiving means 3 for receiving data generated by the time series data generation source 1 and a data stream generating means 4 for processing the received data and sending it to the time series data analyzing means 5. With.

The time-series data generation source 1 is a data generation source that sequentially generates data with the passage of time. The data transmission means 2 transmits the data generated by the time series data generation source 1 to the analysis preprocessing system 7. The time-series data analysis unit 5 performs analysis processing on the data input from the data stream generation unit 4. As shown in FIG. 1, a plurality of time-series data generation sources 1 and data transmission means 2 may be provided.

The data receiving means 3 receives the data generated by the time series data generating source 1 from each data transmitting means 2. The data stream generation unit 4 performs a filtering process on the received data. The data stream generation means 4 determines a set of data to be analyzed for one time out of the data obtained by filtering for each analysis in the time series data analysis means 5, and performs time series data analysis. Send to means 5. The time series data analysis means 5 performs analysis using this data. The operation of the data stream generation unit 4 corresponds to preprocessing for analysis.

The time-series data generation source 1 and the data transmission means 2 may be included in the analysis preprocessing system. Similarly, the time series data analysis means 5 may be included in the analysis preprocessing system.

FIG. 2 is a block diagram showing a configuration example of the data stream generation means 4. The same elements as those shown in FIG. 1 are denoted by the same reference numerals as those in FIG. The data stream generation unit 4 includes a stream data generation unit 401, a filtering unit 407, a transmission data buffer 402, an analysis window generation unit 403, and a stream data transmission unit 404. The stream data generating unit 401 converts the data received by the data receiving unit 3 into a data format for analysis. The filtering unit 407 performs a filtering process on the data, and stores the data obtained by the filtering in the transmission data buffer 402. The transmission data buffer 402 is a memory that temporarily stores data. When notified that the data has been registered in the transmission data buffer, the analysis window generation means 403 generates a set of data that the time-series data analysis device 5 analyzes at a time. The stream data transmission unit 404 transmits data from the transmission data buffer 402 to the time-series data analysis unit 5 in response to a command from the analysis window generation unit 403.

FIG. 3 is an explanatory diagram showing an example of a physical configuration of the analysis preprocessing system. Typically, the time-series data generation source 1 exists at physically dispersed positions, and the server collects and analyzes the data. In the example shown in FIG. 3, each of the n clients PC1, PC2,..., PCn includes a time-series data generation source 1 and a data transmission unit 2. Each client is an information processing apparatus such as a PC (personal computer). In the example shown in FIG. 3, the data receiving means 3, the data stream generating means 4, and the time series data analyzing means 5 are provided in the server PC 8 that performs data analysis.

However, the physical configuration shown in FIG. 3 is an example, and is not limited to the example shown in FIG. For example, a plurality of time-series data generation sources may be realized by a single computer. Further, the data receiving means 3, the data stream generating means 4, and the time series data analyzing means 5 may be realized by different computers. What kind of apparatus implements each unit shown in FIG. 3 may be determined as appropriate according to the number of data to be generated, the processing capability of the computer, and the physical distribution of the time-series data generation source 1. The time series data generation source 1, the data transmission means 2, the data reception means 3, the data stream generation means 4, and the time series data analysis means 5 may be provided in one computer.

In the following description, a case where a plurality of clients generate data, transmit this data to the server PC, and the server PC performs preprocessing and analysis will be described as an example.

Details of each means will be explained.

The time series data generation source 1 continuously generates data to be analyzed. The time-series data generation source 1 may be a sensor, and sensor data to be analyzed may be continuously generated. Further, the time-series data generation source 1 may be a server device such as a Web server, and a log to be analyzed may be continuously generated. In the present embodiment, a case where the time series data generation source 1 is mounted on a vehicle (probe car) and is a sensor that measures, for example, speed, position, traveling direction, and the like will be described as an example. Traffic information can be generated by running tens of thousands of probe cars and collecting and analyzing data from the sensors of each probe car. However, the present invention is applicable to other than data analysis of probe cars. FIG. 3 shows a case where each PC operates as the time-series data generation source 1 and the data transmission unit 2. In this example, a base station provided separately from the probe car corresponds to the data transmission unit 2.

FIG. 4 is an explanatory diagram showing an example of data generated by a sensor (time-series data generation source 1) provided in each probe car. In this example, the time-series data generation source 1 provided in each probe car generates data including date and time, vehicle ID, latitude, longitude, and speed. The date and time is the date and time when the data occurred. The vehicle ID is an ID (identification information) of a probe car on which the time-series data generation source 1 is mounted. Each probe car is assigned a unique vehicle ID. The latitude is the latitude of the probe car position, and the longitude is the longitude of the probe car position. The speed is the speed of the probe car, and is the speed in the example shown in FIG. Therefore, the data shown in FIG. 4 is data generated at “2008/7/20 12:00:00”, the probe car “CID0001” exists at “latitude 35.000”, “longitude 135.000”, and the speed is 60 It indicates that the vehicle is traveling at 0.0 km. In this example, a set of date / time, vehicle ID, latitude, longitude, and speed is set as one data.

The data transmission means 2 transmits the data generated by the time series data generation source 1 to the analysis preprocessing system (server PC). In this example, a base station provided separately from the probe car corresponds to the data transmission means 2. The probe car is also provided with transmission means (not shown) for transmitting data to the base station. Transmitting means (not shown) provided in the probe car transmits data to the base station (data transmitting means 2) via the wireless LAN, and the base station (data transmitting means 2) transmits the data to the server PC. To do. The base station (data transmission means 2) is connected to the server PC via a wired LAN, for example. The present invention is also applicable to cases other than data collected from a probe car, and the data transmission method of the data transmission means 2 is not particularly limited. For example, data may be transmitted using FTP (FILE | TRANSFER | PROTOCOL | RFC | 959).

FIG. 5 is an explanatory diagram showing an example of data transmitted by the data transmission means 2. It is preferable that the data transmission means 2 does not transmit each piece of data individually to the server PC, but transmits a certain number of data collectively. Thus, by transmitting a plurality of data collectively, communication cost can be reduced. As illustrated in FIG. 5, the data transmission unit 2 concatenates data at a delimiter 107, adds a header 106, and transmits the data to the server PC. The header 106 is a header defined by a communication protocol, and includes parameters such as the size of transmission data, for example. The delimiter 107 is information indicating the boundaries of individual data.

The data receiving unit 3 receives the data transmitted by the data transmitting unit 2 (for example, data illustrated in FIG. 5). The data receiving unit 3 may receive data according to the same communication protocol as the data transmitting unit 2. For example, data may be received by FTP.

The data stream generating unit 4 divides the data received by the data receiving unit 3 into data one by one and collects the data for analysis by the time-series data unit 5. The data stream generation means 4 performs a filtering process on the data, and generates an analysis window from the data obtained as a result. Usually, the time-series data analysis means 5 does not analyze data one by one, but repeatedly analyzes a set of data. The analysis window is a set of data to be analyzed in this one analysis. FIG. 6 is an explanatory diagram schematically showing an analysis window. Each circle shown in FIG. 6 represents data generated over time. The set of data 110 is an analysis window 120, and the time-series data analysis means 5 performs one analysis process using one analysis window. The data stream generation unit 4 performs processing for determining an analysis window from data obtained by filtering, and sends the analysis window to the time-series data analysis unit 5.

Examples of analysis window types include time-base window (Time-Base Window) and top-base window (Topple-Base Window). The time base window is an analysis window in which data belonging to a certain time is collected. The topple base window is an analysis window in which a certain number of data is specified in time series and collected. FIG. 6 shows an example of a tuple base window and shows a case where an analysis window is generated for each two pieces of data.

The data stream generation means 4 determines an ID (window ID) for identifying the analysis window for each analysis window, inserts the window ID into the data, and passes it to the time-series data analysis means 5.

FIG. 7 is an explanatory diagram showing an example of input / output of the data stream generating means 4. The data stream generating unit 4 receives data including a communication header 106 from the data receiving unit 3, in which a plurality of pieces of data are concatenated 107. The data stream generation means 4 cuts out each piece of data from the input data, assigns a window ID, and passes the data assigned the window ID to the time-series data analysis means 5. The data stream generation means 4 assigns a common window ID to each data to be included in one analysis window. A set of data to which a common window ID is assigned is analyzed simultaneously in one analysis. The individual data to which the window ID is assigned is data generated by the time-series data generation source 1, and in this example includes date and time, vehicle ID, latitude, longitude, and speed.

Each element provided in the data stream generation unit 4 will be described with reference to FIG. The stream data generating means 401 performs format conversion on the data received by the data receiving means 3 from the data transmitting means 2 (not shown in FIG. 2, refer to FIG. 1), and divides the data into individual data. The stream data generation unit 401 may determine the header 106 and the break 107 (see FIG. 7), and cut out the data between the header 106 and the break 107 and the data between the breaks 107, respectively. The format of the data is standardized by RFC (Request for Comments) etc., and when the received data conforms to the RFC specification, the boundary between the header and the data and the delimiter between the data are determined according to the specification, and each data Can be cut out. FIG. 8 shows an example of data cut out by the stream data generating unit 401. When the data illustrated in FIG. 5 is input, the stream data generation unit 401 cuts out three pieces of data as shown in FIG.

The filtering unit 407 performs a filtering process on each piece of data cut out by the stream data generating unit 401 from the data received by the data receiving unit 3. In other words, the filtering unit 407 determines for each data whether each piece of data extracted by the stream data generation unit 401 satisfies a predetermined condition, and transmits data that satisfies the predetermined condition to the transmission data buffer 402. The data that does not satisfy the predetermined condition is discarded. The predetermined condition is a condition indicating that the data is useful for analysis.

As an example of the predetermined condition, for example, a condition that “the data content is different from any data already stored in the transmission buffer 402” may be used. Assume that data having the same contents as data already stored in the transmission data buffer 402 is stored in the transmission data buffer 402. In this case, the stream data transmission unit 404 transmits a plurality of data having the same content to the time-series data analysis unit 5. However, the time series data analysis means 5 may not require a plurality of data having the same contents when performing analysis.

For example, a sensor (time-series data generation source 1) provided in each probe car generates data (see FIG. 4) including the position, speed, and vehicle ID of the probe car at regular time intervals, and time-series data analysis means Assume that 5 performs analysis on the data. In this case, the stopped probe car repeatedly generates data having the same content in position, speed, and vehicle ID. On the other hand, in the analysis process of the time-series data analysis means 5, when the situation (position or speed) of a certain probe car changes, the changed contents are required and it is not necessary to refer to the data whose contents have not changed. Sometimes. In such a case, the data with the same contents of position, speed, and vehicle ID is redundant data and is not used for analysis. As a specific example, when calculating the average speed of each vehicle in the analysis, the data of the stopped vehicle is not necessary for calculating the average speed, and a plurality of such data is sent to the time-series data analysis means 5. There is no need.

The filtering unit 407 stores data satisfying the condition that “the content of the data is different from any data already stored in the transmission buffer 402” in the transmission data buffer 402, and data that does not satisfy the condition (that is, The data having the same contents as the data already stored in the transmission data buffer 402 is discarded. As a result, it is possible to prevent redundant data from being sent to the time series analysis means 5.

Hereinafter, a case will be described as an example in which a condition that “the data content is different from any data already stored in the transmission buffer 402” is used as the predetermined condition. This condition is referred to as a first condition. The first condition is an example of a predetermined condition indicating that the data is useful for analysis, and other conditions may be used as will be described in the second embodiment or the third embodiment.

The transmission data buffer 402 is a memory that stores data determined by the filtering unit 407 as satisfying a predetermined condition. FIG. 9 is a schematic diagram illustrating an example of a memory image in the transmission data buffer 402. FIG. 9 illustrates a case where a list structure is employed. One data is stored in the memory area 131 for storing one data. In addition, a pointer 132 that connects the memory areas is defined. The filtering unit 407 notifies each pointer to the analysis window generation unit 403 via the stream data generation unit 401 when each data is stored. Alternatively, the pointer may be notified directly to the analysis window generation unit 403. By following the pointer, each data can be accessed in order. However, the manner in which the transmission data buffer 402 stores data is not limited to the example of FIG. For example, the transmission data buffer 402 may store data in a table structure instead of a list structure.

The analysis window generation unit 403 receives a notification of a pointer to the memory area storing the data at the timing when the filtering unit 407 stores the data in the transmission data buffer, and generates an analysis window based on the pointer. In the analysis window generation means 403, the specification of the analysis window is set in advance. The analysis window specifications include the type of analysis window and the size of the window. As the type of the analysis window, it is determined whether the analysis is performed in the time base window or the top base window. As the window size, time is determined in the case of a time base window, and the number of data is determined in the case of a top base window.

The analysis window generation means 403 generates an analysis window according to the defined specifications. For example, assume that analysis is performed using a time base window, and time is defined as the window size. In this case, when generating the analysis window, the analysis window generating unit 403 stores the generation date and time of the analysis window, and calculates the timing for generating the next analysis window by adding the window size to the date and time. . When the notification of the pointer is received from the filtering unit 407 as new data is added, the analysis window generation unit 403 accesses the date / time field in the data in the memory area indicated by the notified pointer. Then, it is determined whether or not the date and time exceeding the generation timing of the next analysis window is stored. When the date and time exceeding the generation timing of the next analysis window is stored, the analysis window generation means 403 assigns a new window ID to each data stored in the transmission data buffer, thereby analyzing one of those analysis. A window is defined, and a transmission command for the collection of data (analysis window) is issued to the stream data transmission unit 404.

Also, for example, it is determined that the analysis is performed in a top base window, and the number of data is determined as the window size. The analysis window generation unit 403 counts the number of times the notification is received each time the pointer notification is received as new data is added. The number of times the notification is received means the number of data added to the transmission data buffer 402. Upon receiving notifications for the number of windows determined by the window size, the analysis window generation means 403 assigns a new window ID to each data stored in the transmission data buffer, thereby determining that one analysis window. The stream data transmission means 404 issues a command to transmit the data set (analysis window). At this time, the count value of the number of times of notification is initialized to zero.

In both the time base window and the tuple base window, a set of pointers to memory areas for storing data belonging to the newly defined analysis window is issued as a data set transmission command.

When the stream data transmission unit 404 receives a transmission command for a set of data (that is, a pointer to a memory area for storing data to be transmitted) from the analysis window generation unit 403, the stream data transmission unit 404 stores the instruction in the memory area indicated by each pointer. Data is transmitted to the time series data analysis means 5. When the data is transmitted, the stream data transmission unit 404 deletes the data from the transmission data buffer 402.

The time series data analysis means 5 analyzes the data received from the data stream generation means 4. The time series data analysis means 5 includes storage means (not shown) for storing the data received from the data stream generation means 4, and stores the received data in the storage means. Then, the time-series data analysis unit 5 reads the data to which the same window ID is added and analyzes the data. The read data is deleted from the storage means. When analyzing the probe car data, the time-series data analysis means 5 matches the probe car data with a road map, for example, and shows the traffic jam at which position the traffic jam occurs from the average speed of the probe car. Generate information. This process is performed at regular intervals (for example, every 5 minutes). In this case, it may be determined that the analysis is performed in the time base window. The processing performed by the time-series data analysis unit 5 may be determined according to the data generated by the data generation source 1 and the analysis purpose, and is not limited to a specific analysis process.

FIG. 10 is a block diagram illustrating a configuration example of the filtering unit 407. The filtering unit 407 includes a data selection unit 40701 and an identity determination unit 40702.

The identity determination unit 40702 determines whether or not the contents of the data are the same between the data input from the stream data generation unit 401 and the data already stored in the transmission data buffer 402. judge. Each piece of data input from the stream data generation unit 401 is data to be subjected to filtering determination, and is hereinafter referred to as filtering determination target data.

In this example, since the data contents are the same, it is essential that the time series data source 1 is the same. For example, in the case of data relating to the probe car illustrated in FIG. 4, it is essential that the vehicle IDs are the same. Data with different vehicle IDs are not data of the same content even if the latitude, longitude, and speed match. In addition, when it is assumed that the same time-series data generation source 1 is the same data, the date and time are different among the data generated with the passage of time. Therefore, when determining whether or not the contents are the same, whether or not the dates and times are the same may be ignored. Among the items included in the data, there may be items such as date and time that can be ignored whether or not they are the same.

Also, items including errors in the data (for example, the latitude, longitude, and speed illustrated in FIG. 4) do not need to match completely. In this case, the identity determination unit 40702 calculates the difference between the value included in the data stored in the transmission data buffer 402 and the value included in the filtering determination target data, and the difference is determined in advance. What is necessary is just to determine whether it is in the range. For example, regarding the speed, the difference between the speed in the data stored in the transmission data buffer 402 and the speed in the filtering determination target data is calculated, and if the difference is within the range of −5 to +5, the speed Are determined to be the same. The unit of −5, +5 shown in this example is “km / h”. Regarding latitude and longitude, it is determined whether or not the difference in value between the data is within a predetermined range, and if it is within the range, it may be determined that the content is the same.

As described above, the identity determination unit 40702 matches the ID of the time-series data generation source 1 (for example, the vehicle ID) between the filtering determination target data and the data stored in the transmission data buffer 402. If it is determined that the contents of the items (for example, latitude, longitude, and speed) are the same, the data may be determined to be the same. Also, when the IDs of the time-series data source 1 do not match or there are items that are determined not to have the same content in any of the other items (for example, latitude, longitude, or speed) What is necessary is just to determine that data is not the same content.

The data selection means 40701 checks for each filtering determination target data whether or not the content of the filtering determination target data is determined not to be the same as any data in the transmission data buffer 402. Then, the data selection means 40701 stores the filtering determination target data in the transmission data buffer 402 or discards it according to the confirmation result.

When it is determined that the content of the filtering determination target data is not the same as any data in the transmission data buffer 402, the filtering target data satisfies the first condition. In this case, the data selection unit 40701 stores the filtering determination target data in the transmission data buffer 402. When the filtering selection target data is stored in the transmission data buffer 402, the data selection unit 40701 notifies the analysis window generation unit 403 of the pointer of the memory area.

On the other hand, when it is determined that the content of the filtering determination target data is the same as any data in the transmission data buffer 402, the filtering target data does not satisfy the first condition. In this case, the data selection unit 40701 discards the filtering determination target data.

In this embodiment, the data receiving means 3, the stream data generating means 401 of the data stream generating means 4, the filtering means 407 (data selecting means 40701, identity determining means 40702), the analysis window generating means 403, and the stream data transmitting means 404 are: For example, it is realized by a CPU of a computer that operates according to an analysis preprocessing program. In this case, the analysis preprocessing system includes program storage means (not shown) for storing the analysis preprocessing program, and the CPU reads the program, and in accordance with the program, the data receiving means 3 and the data stream generating means 4 generate stream data. The unit 401, the filtering unit 407, the analysis window generation unit 403, and the stream data transmission unit 404 may be operated. Each of these means may be realized by separate dedicated circuits.

Further, the time-series data generation source 1, the data transmission means 2, and the time-series data analysis means 5 are also realized by a CPU that operates according to a program, for example.

Next, the operation will be described.
FIG. 11 is a flowchart illustrating an example of processing progress according to the first embodiment of this invention. A process in which each time-series data generation source 1 generates data and the data transmission means 2 transmits data to the pre-analysis processing system is referred to as a time-series data generation / transmission step (step S1). A data stream generation step includes a process in which the analysis preprocessing system (for example, server PC) receiving the data receives the data, performs a filtering process on the data, stores the data in the transmission data buffer 402, and generates an analysis window. This is described as (Step S2). A process in which the time series data analyzing means 5 analyzes the data is referred to as a time series data reception analyzing step (step S3). Steps S1, S2, and S3 are independent processes and are executed in parallel. That is, steps S1, S2, and S3 are executed asynchronously.

In the time-series data generation and transmission step (step S1), each time-series data generation source 1 continuously generates data as time passes (step S101). Each time-series data generation source 1 may include the generation time (data generation time) in the data to be generated. Each time-series data generation source 1 sends data to the data transmission unit 2, and the data transmission unit 2 stores the data in a buffer (not shown) in order to transmit the data collectively (step S102). This buffer is a buffer for buffering data on the data transmission means 2 side. Further, the data transmission means 2 determines whether or not it is time to transmit the data accumulated in the buffer (step S103). For example, if a predetermined number of data has been accumulated, it may be determined that data is to be transmitted, and if the accumulated data has not reached a predetermined number, it may be determined that no data will be transmitted. Alternatively, it may be determined that data will be transmitted if a certain period has elapsed since the previous data transmission, and may not be transmitted if the certain period has not elapsed. If it is determined that it is time to transmit data (Yes in step S103), the data transmission unit 2 combines the data and transmits it to the pre-analysis processing system 7 (step S104), and deletes the transmitted data from the buffer. (Step S105). If it is not time to transmit data, steps S101 and S102 are repeated.

In addition, when the time series data generation source 1 and the data transmission means 2 are realized in the same device, the time series data generation source 1 may execute the processes of steps S101, S102, S103, and S105.

In the data stream generation step (step S2), the data reception means 3 receives the data transmitted by the data transmission means 2 (step S201). The data receiving means 3 also includes a buffer (not shown), and temporarily stores the received data in the buffer. Then, the data in the buffer is input to the data stream generation means 4 asynchronously with the data reception timing. For this reason, step S2 can be performed asynchronously with step S1.

The stream data generating unit 401 converts the format of the data input from the data receiving unit 3 and cuts out each piece of data from the combined data (step S202). The stream data generation unit 401 inputs the cut out individual data to the filtering unit 407. The filtering unit 407 performs a filtering process on the input data (step S203). That is, the filtering unit 407 determines whether or not the input data satisfies a predetermined condition, stores the data that satisfies the predetermined condition in the transmission data buffer 402, and does not satisfy the predetermined condition. Is discarded. The filtering unit 407 notifies the analysis window generation unit 403 of a pointer to the memory area in which the data is stored.

When the analysis window generation unit 403 is notified of the pointer, the analysis window generation unit 403 determines whether or not a condition for generating the analysis window is satisfied (step S204). For example, in the case where it is specified that the analysis is performed in the top base window, it is determined whether or not the number of notifications determined by the window size has been received. Alternatively, when the analysis is specified to be performed in the time base window, it is determined whether or not the period determined by the window size has elapsed since the last analysis window generation. If the conditions for generating the analysis window are satisfied (Yes in step S204), a common window ID is added to each data to be included in the analysis window, and an analysis window transmission command is issued (step S205). In response to the transmission command, the stream data transmission unit 404 transmits a data group (that is, an analysis window) to which a common window ID is assigned to the time-series data analysis unit 5 (step S206). Then, the stream data transmission unit 404 deletes the data transmitted in step S206 from the transmission data buffer 402 (step S207).

The process of cutting out each piece of data and using it as an analysis window corresponds to the pre-processing of analysis.

In the time-series data reception analysis step (step S3), the time-series data analysis unit 5 receives the data (analysis window) transmitted by the stream data transmission unit 404 (step S301). The time-series data analysis unit 5 includes an analysis buffer (not shown), and temporarily stores the data transmitted by the stream data transmission unit 404 in the analysis buffer. Then, the time-series data analysis means 5 analyzes the data stored in the analysis buffer asynchronously with the data reception timing (step S302). For this reason, step S2 and step S3 can also be performed asynchronously. Specifically, data analysis can be performed asynchronously with the operation in which the stream data transmission unit 404 transmits the analysis window. The time-series data analyzing unit 5 deletes the data that has been analyzed in step S302 from the buffer of the time-series data analyzing unit 5 (step S303).

FIG. 12 is a flowchart showing an example of processing progress of the filtering process (step S203). When the stream data generation unit 401 cuts out each piece of data (see step S202, FIG. 11), the stream data generation unit 401 inputs the data to the filtering unit 407. Each piece of data is filtering determination target data.

When the filtering determination target data is input, the identity determination unit 40702 determines whether or not the content is the same with each piece of data stored in the transmission data buffer 402 for each filtering determination target data. (Step S701).

The data selection unit 40701 stores the filtering determination target data determined not to have the same content as any data in the transmission data buffer 402 in the transmission data buffer 402, and analyzes the pointer to the memory area in which the data is stored. The window generation unit 403 is notified (step S702). On the other hand, the filtering determination target data determined to have the same content as any data in the transmission data buffer 402 is discarded (step S702). And the operation | movement after step S204 shown in FIG. 11 is performed.

According to this embodiment, when the data receiving means 3 receives the data generated by each time-series data generation source 1, the data is stored in the memory (transmission data buffer 402), not as a database or a file. When accessing a database by SQL or accessing a file, processing time is required. However, in the present invention, since data is stored in the memory, the data can be sent to the time-series data analysis means 5 quickly. it can.

In particular, in the present embodiment, not all data received by the data receiving unit 3 is stored in the transmission data buffer 402, but data selected by the filtering process is stored in the transmission data buffer 402. Therefore, even if there are a large number of time-series data generation sources 1 and a large amount of data is received, it is possible to prevent the data from overflowing in the analysis pre-processing system and to send the pre-processed data to the time-series data analysis means 5 Can send.

Also, the filtering unit 407 discards redundant data that is not used in the analysis. Accordingly, it is possible to prevent redundant data from being stored in the transmission data buffer 402, and the transmission data buffer 402 can be used effectively.

Further, the filtering means 407 included in the pre-analysis processing system is not synchronized with the data transmission means 2 and the time series data generation source 1 instead of causing the individual data transmission means 2 or the time series data generation source 1 to perform the filtering process. Execute the filtering process. Therefore, it is not necessary to perform control for causing the data transmission means 2 or the time-series data generation source 1 to perform filtering processing individually.

Embodiment 2. FIG.
Similarly to the first embodiment, the analysis preprocessing system of the second embodiment of the present invention includes a data receiving means 3 and a data stream generating means 4 (see FIG. 1), and a time-series data generation source 1 is generated. When the received data is received from the data transmission means 2, the data is preprocessed and sent to the time series data analysis means 5.

Also in the second embodiment, as in the first embodiment, the data stream generation unit 4 includes a stream data generation unit 401, a filtering unit 407, a transmission data buffer 402, and an analysis window generation unit 403. And stream data transmission means 404 (see FIG. 2). However, the operation of the filtering unit 407 is different from that of the first embodiment. Other means are the same as those in the first embodiment.

In the first embodiment described above, a case where the condition (first condition) that “the content of the data is different from any data already stored in the transmission buffer 402” is used as the predetermined condition used in the filtering process. explained. In the second embodiment, other conditions are used as the predetermined conditions.

In the second embodiment, as the predetermined condition used in the filtering process, a condition that “the content of the data satisfies a predetermined standard” is used. This condition is referred to as a second condition. For example, an error may be included in the contents included in the data. Even if the data includes an error, it can be effectively used for analysis if the data satisfies the criteria. In this way, a criterion for discriminating valid data that can be used for analysis is determined in advance, and the filtering unit 407 determines whether or not the content of the filtering determination target data satisfies this criterion, and satisfies the criterion. Discard no data.

Referring to an example of data generated by a sensor (time-series data generation source 1) provided in each probe car, the data often includes position, speed, direction, and the like. However, these values include errors. In particular, the position (for example, latitude and longitude) is generally acquired by GPS (Global Positioning System), and if it is affected by a building or the like, the position calculation may include a large error. Since the data including such a large error cannot be used for analysis, the filtering unit 407 eliminates it.

FIG. 13 is a block diagram illustrating a configuration example of the filtering unit 407 according to the second embodiment. The filtering means 407 in the second embodiment includes valid data definition means 40713, validity determination means 40712, and data selection means 40711.

The valid data definition unit 40713 is a storage device that stores a reference for data contents that can be used effectively. FIG. 14 is an explanatory diagram illustrating an example of the criteria stored in the valid data definition unit 40713. The standard illustrated in FIG. 14 corresponds to the data illustrated in FIG. 4 and indicates the standard that should be satisfied by the date, vehicle ID, latitude, longitude, and speed. “Minimum” and “maximum” shown in FIG. 4 define the range of values of these items. If the value of an item included in the data is included in the range from “minimum” to “maximum”, the value of the item is valid. For example, in the example shown in FIG. 14, the date and time are valid if they are included in the range from “one day before the current time” to “the current time”. Similarly, the vehicle ID is valid if it is included in the range of “CID0001” to “CID9999”. Thus, when the value of the item is a combination of a character string and a numerical value, the numerical value range may be defined. As for latitude, it is effective if it falls within the range of 34.000 to 36.000. As for longitude, it is effective if it falls within the range of 134.000 to 136.000. Regarding the speed, it is effective if it is within the range of 0 to 120. In this example, “minimum” and “maximum” are defined, but only one of them may be defined.

The “difference” shown in FIG. 14 is a standard that defines the relationship with the immediately preceding data (the immediately preceding data with the same time-series data generation source). For example, in the example shown in FIG. 14, the date and time are valid if the date and time difference from the immediately preceding data with the same vehicle ID is within one hour. For the vehicle ID, “difference” is not defined. Regarding the latitude, it is effective if the difference in latitude from the immediately preceding data with the same vehicle ID is 0.01 or less. Regarding the longitude, it is effective if the difference in longitude from the immediately preceding data with the same vehicle ID is 0.01 or less. Regarding the speed, it is effective if the difference in speed from the immediately preceding data with the same vehicle ID is 120 or less.

The standards defined by “Minimum” and “Maximum” are absolute standards that should be satisfied by the items included in the data. “Difference” is a relative standard that items included in data should satisfy in relation to other data. In the example shown in FIG. 14, an absolute reference (minimum, maximum) and a relative reference (difference) are set, but only one of them may be set.

When the filtering determination target data is input from the stream data generation unit 401, the validity determination unit 40712 satisfies each criterion stored in the effective data definition unit 40713 for each item in the filtering determination target data. It is determined whether or not. For example, assume that the criteria illustrated in FIG. 14 are stored. The validity determination unit 40712 determines whether the date, vehicle ID, latitude, longitude, and speed in the filtering determination target data belong to a range from the minimum value to the maximum value. Further, for each of the date, latitude, longitude, and speed, the difference from the value in the immediately preceding filtering determination target data is calculated, and it is determined whether or not the calculation result satisfies the standard defined as “difference”.

In order to determine relative criteria, the effectiveness determination means 40712 determines the effectiveness of certain filtering determination target data, and if the filtering determination target data is generated at the same time-series data generation source, This is stored until the filtering determination target data is input. Alternatively, the relative reference may be determined with reference to the immediately preceding data stored in the transmission data buffer 402.

The data selection unit 40711 confirms the determination result by the validity determination unit 40712 for each filtering determination target data. Then, the data selection unit 40711 stores the filtering determination target data in the transmission data buffer 402 or discards it according to the confirmation result.

When it is determined that the criteria defined in the effective data definition unit 40713 are satisfied for each item of the filtering determination target data, the filtering target data satisfies the second condition. In this case, the data selection unit 40711 stores the filtering determination target data in the transmission data buffer 402. Then, the data selection unit 40711 notifies the analysis window generation unit 403 of the pointer of the memory area when the filtering determination target data is stored in the transmission data buffer 402.

On the other hand, when it is determined that any of the items of the filtering determination target data does not satisfy the standard defined in the valid data definition unit 40713, the filtering target data does not satisfy the second condition. It will be. In this case, the data selection unit 40711 discards the filtering determination target data. For example, if it is determined that any item does not satisfy the absolute criterion or the relative criterion, the data selection unit 40711 discards the filtering determination target data.

The data selection unit 40711 and the validity determination unit 40712 of the filtering unit 407 of the second embodiment are realized by, for example, a CPU of a computer that operates according to a pre-analysis processing program. In this case, the CPU may operate as the data selection unit 40711, the validity determination unit 40712, and other units according to the analysis preprocessing program. Further, the data selection means 40711 and the identity determination means 40712 may be realized by separate dedicated circuits.

The processing progress of the second embodiment is the same as that of the first embodiment (see FIG. 11). However, the process in the filtering process (step S203) is different. FIG. 15 is a flowchart illustrating an example of processing progress of filtering processing in the second embodiment. When the filtering determination target data is input from the stream data generation unit 401, the validity determination unit 40712 determines whether each item in the filtering determination target data satisfies an absolute criterion (step S711). . For example, when the standard illustrated in FIG. 14 is determined, it is determined whether date / time, vehicle ID, latitude, longitude, and speed are included in the range from the minimum value to the maximum value. When it is determined that the absolute standard is satisfied for all items (Yes in step S712), the validity determination unit 40712 determines whether each item in the filtering determination target data satisfies the relative standard. Is determined (step S713). For example, with respect to time, latitude, longitude, and speed, the difference from the previous filtering determination target data having the same vehicle ID is calculated, and the difference satisfies a predetermined standard ("difference" illustrated in FIG. 14). It is determined whether or not.

The data selection means 40711 confirms the determination result regarding the absolute reference and the determination result regarding the relative reference. Then, in the determination regarding the absolute reference (step S711) or the determination regarding the relative reference (step S713), when any item is determined not to satisfy the reference (No in step S712 or No in step S714). ), The data selection means 40711 discards the filtering determination target data (step S716). In addition, when each item is determined to satisfy the criterion in the determination regarding the absolute criterion (step S711) and the determination regarding the relative criterion (step S713) (Yes in step S714), the data selection unit 40711 The filtering determination target data is stored in the transmission data buffer 402, and the pointer of the memory area in which the filtering determination target data is stored is notified to the analysis window generation unit 403. (Step S715). As a result, data satisfying a predetermined condition (second condition in the present embodiment) is selected.

The operations after the filtering process (step S203, see FIG. 11) are the same as those in the first embodiment.

Also in the second embodiment, the same effect as in the first embodiment can be obtained.

Embodiment 3. FIG.
Next, as a third embodiment, an embodiment in which a condition “not a copy of any data already input from the stream data generation unit 401” is used in the filtering process will be described. This condition is referred to as a third condition.

The analysis preprocessing system according to the third embodiment of the present invention also includes a data receiving means 3 and a data stream generating means 4 (see FIG. 1), as in the above-described embodiments, and a time-series data generation source 1 is generated. When the received data is received from the data transmission means 2, the data is preprocessed and sent to the time series data analysis means 5.

Also in the third embodiment, the data stream generation unit 4 includes a stream data generation unit 401, a filtering unit 407, a transmission data buffer 402, an analysis window generation unit 403, as in the above-described embodiments. And stream data transmission means 404 (see FIG. 2). However, the operation of the filtering unit 407 is different from that of the first embodiment or the second embodiment. Other means are the same as those in the first embodiment.

In the process until the time series data generation source 1 generates data and the data reception means 3 receives the data, the time series data generation source 1 is duplicated, and the data reception means 3 receives a plurality of the same data. Sometimes. For example, this occurs when a plurality of data transmission means 2 receive the same data from the same time-series data generation source 1 and the plurality of data transmission means 2 transmit the data to the pre-analysis processing system. . FIG. 16 is an explanatory diagram showing a specific example of this situation. The time-series data generation source 1 is a sensor provided in the probe car, and the data transmission means 2a and 2b are base stations that relay data between the time-series data generation source 1 and the data reception means 3. . The base station is provided for each area, but is arranged so that corresponding areas partially overlap each other. When a probe car exists in a portion where the areas corresponding to the base stations overlap each other and data is transmitted wirelessly from the position, the

base stations

2a and 2b corresponding to the areas receive the same data. Since both the

base stations

2a and 2b transmit the received data to the pre-analysis processing system, the data receiving means 3 receives a plurality of the same data. The data replicated in this way is unnecessary in the analysis by the time series data analysis means 5 and is excluded by the filtering means 407.

FIG. 17 is a block diagram illustrating a configuration example of the filtering unit 407 according to the third embodiment. The filtering unit 407 according to the third embodiment includes a processed data storage unit 40723, an effectiveness determination unit 40722, and a data selection unit 40721.

The processed data storage unit 40723 is a storage device that stores data identification information for identifying each data input from the stream data generation unit 401. FIG. 18 shows an example of data identification information stored in the processed data storage unit 40723. When there are two or more data having the same data generation source and the same generation time, the second and subsequent data are duplicates. Therefore, as shown in FIG. 18, the combination of the date and time and the ID of the time series data generation source (for example, vehicle ID) may be used as the data identification information. The first record in FIG. 18 means that the data generated by the probe car “CID0001” on the date “2008/7/20 12:00:00” has already been received.

When the filtering determination target data is input from the stream data generation unit 401, the validity determination unit 40722 refers to the data identification information stored in the processed data storage unit 40723, and the filtering determination target data is still input. It is determined whether or not the data has not been received. If the filtering determination target data is data that has not yet been input, the validity determination unit 40722 processes the data identification information (for example, the combination of the date and vehicle ID) of the filtering determination target data, and the processed data storage unit 40723. Remember me.

The data selection unit 40721 confirms the determination result by the validity determination unit 40722 for each filtering determination target data. Then, the data selection unit 40721 stores or discards the filtering determination target data in the transmission data buffer 402 according to the confirmation result.

It is determined that the filtering determination target data is data that has not been input yet, that means that the filtering determination target data has been input for the first time, and the third condition is satisfied. In this case, the data selection unit 40721 stores the filtering determination target data in the transmission data buffer 402. Then, the data selection unit 40721 notifies the analysis window generation unit 403 of the pointer of the memory area when the filtering determination target data is stored in the transmission data buffer 402.

On the other hand, when it is determined that the filtering determination target data is already input data, the third condition is not satisfied. In this case, the data selection unit 40721 discards the filtering determination target data.

The data selection unit 40721 and the validity determination unit 40722 of the filtering unit 407 of the third embodiment are realized by, for example, a CPU of a computer that operates according to a pre-analysis processing program. In this case, the CPU may operate as the data selection unit 40721, the validity determination unit 40722, and other units according to the analysis preprocessing program. Further, the data selection means 40721 and the validity determination means 40722 may be realized by separate dedicated circuits.

The process progress of the third embodiment is the same as that of the first embodiment and the second embodiment (see FIG. 11). However, the process in the filtering process (step S203) is different. FIG. 19 is a flowchart illustrating an example of processing progress of filtering processing according to the third embodiment.

When the filtering determination target data is input from the stream data generation unit 401, the validity determination unit 40722 determines whether the filtering determination target data is data that has not yet been input (step S721). Specifically, it is determined whether or not data identification information (for example, a combination of date and vehicle ID) of the input filtering determination target data is already stored in the processed data storage unit 40723. If no data identification information is stored (No in step S722), the filtering determination target data is data that has not been input yet (data that has been input for the first time). On the other hand, if the data identification information is stored (Yes in step S722), the filtering determination target data has already been input.

If the filtering determination target data is data input for the first time (No in step S722), the validity determination unit 40722 additionally stores the data identification information of the filtering determination target data in the processed data storage unit 40723 (step S722). S723).

The data selection unit 40721 confirms the determination result of the validity determination unit 40722. If the input filtering determination target data has been input (Yes in step S722), the data selection unit 40721 discards the filtering determination target data (step S725). If the input filtering determination target data is data input for the first time (No in step S722), the data selection unit 40721 stores the filtering determination target data in the transmission data buffer 402, and the filtering determination target data. Is sent to the analysis window generation means 403 (step S724). As a result, data satisfying a predetermined condition (the third condition in the present embodiment) is selected.

The operations after the filtering process (step S203, see FIG. 11) are the same as those in the first embodiment and the second embodiment.

Also in the third embodiment, the same effect as in the first embodiment can be obtained.

Further, the filtering unit 407 combines a plurality of conditions from the first to third conditions described above, stores only data satisfying the plurality of conditions in the transmission data buffer 402, and discards other data. It may be a configuration. For example, only the data satisfying the first and second conditions may be stored in the transmission data buffer 402 and other data may be discarded. The method of combining conditions is not particularly limited.

In each of the above-described embodiments, the case where the time series data generation source 1 provided in the probe car generates data and performs preprocessing for performing filtering processing and creating an analysis window on the data is illustrated. Such an analysis window can be used for, for example, generating warning information using a near-miss map in addition to the generation of traffic jam information. Similarly, the present invention can be used for an analysis in which a person possesses a sensor serving as the time-series data generation source 1 and warns the person using a near-miss map. The type of data is not limited to the data used for the analysis as described above, and the present invention can be applied to preprocessing for various data to be analyzed.

Also, an embodiment in which filtering processing is not performed is conceivable, and this embodiment will be described below. Similar to the first embodiment shown in FIG. 1, the pre-analysis processing system of the present embodiment includes a data receiving unit 3 and a data stream generating unit 4. FIG. 20 is a block diagram illustrating a configuration example of the data stream generation unit 4 in the embodiment that does not perform the filtering process. In this embodiment, the data stream generation unit 4 includes a stream data generation unit 401, a transmission data buffer 402, an analysis window generation unit 403, and a stream data transmission unit 404. Each of these means is the same as in the first embodiment. However, the filtering unit 407 is not provided, and the stream data generation unit 401 stores all the cut out data in the transmission data buffer 402. In addition, when the data is stored in the transmission data buffer 402, the stream data generation unit 401 notifies the analysis window generation unit 403 of, for example, a pointer to the stored memory area as a notification to that effect.

In the case of this configuration, step S203 (filtering processing) is not performed in the data stream generation step (see step S2, FIG. 11), but the other points are the same as those in the first embodiment.

20, the data can be sent to the time-series data analyzing means 5 more quickly than when the data is stored as a database or a file. However, in order to prevent data overflow in the transmission data buffer 402, it is preferable to provide the filtering means 407 as shown in the first to third embodiments.

Next, the minimum configuration of the present invention will be described. FIG. 21 is an explanatory diagram showing the minimum configuration of the present invention. The analysis preprocessing system of the present invention includes data acquisition means 71, data cutout means 72, buffer 74, filtering means 73, analysis data determination means 75, and analysis data output means 76.

Data acquisition means 71 (for example, data reception means 3) acquires a data group generated by a plurality of data generation sources.

The data cutout unit 72 (for example, the stream data generation unit 401) cuts out individual data from the data group acquired by the data acquisition unit 71.

The buffer 74 (for example, the transmission data buffer 402) stores data used for analysis.

The filtering unit 73 (for example, the filtering unit 407) determines whether or not a predetermined condition is satisfied for each piece of data cut out by the data cutout unit 72, and stores data satisfying the predetermined condition in the buffer 74. Discard data that does not meet the prescribed conditions.

Analysis data determination means 75 (for example, analysis window generation means 403) determines an analysis data group (for example, analysis window), which is a set of data used for analysis, from the data stored in the buffer 74.

The analysis data output means 76 (for example, the stream data generation means 404) sends the analysis data group to the data analysis means for analyzing the data (for example, the time series data analysis means 5).

With such a configuration, even if a large amount of data is transmitted from a large number of data generation sources, the data can be passed to the means for analyzing the data at high speed while preventing the data from overflowing.

In the above-described embodiment, whether the filtering unit 73 satisfies the condition that the data content differs from any data already stored in the buffer 74 for each piece of data cut out by the data cutout unit 72. Content match / mismatch determination means (for example, identity determination means 40702), and data selection means (for example, data selection means 40701) for discarding data that does not satisfy the condition and storing data that satisfy the condition in the buffer 74. The structure which has is disclosed.

Further, in the above embodiment, the filtering unit 73 includes a reference storage unit (for example, valid data definition unit 40713) that stores a criterion indicating that the content included in the data is valid, and the data cutout unit 72 cut out the data. For each data, a reference determination unit (for example, validity determination unit 40712) that determines whether or not the content of the data satisfies the standard, and discards data that does not satisfy the standard and satisfies the standard A configuration having data selection means (for example, data selection means 40711) for storing data in the buffer 74 is disclosed.

In the above embodiment, the filtering unit 73 includes a data identification information storage unit (for example, processed data storage unit 40723) that stores data identification information of each data input from the data extraction unit 72, and a data extraction unit. When data is input from 72, it is determined whether or not the data identification information of the data is stored in the data identification information storage means. If not, the data identification information of the data is stored in the data identification information storage means. Duplicate determination means (for example, validity determination means 40722) to be stored and data determined that the data identification information is stored in the data identification information storage means are discarded, and the data identification information is stored in the data identification information storage means Data selection means for storing data determined to have not been stored in the buffer 74 (for example, Configuration is disclosed comprising data sorting means 40721) and.

Further, the above embodiment discloses a configuration in which the analysis data determining means 75 determines a set of data stored in the buffer 74 within a certain period as an analysis data group every certain period.

Further, the above embodiment discloses a configuration in which the analysis data determination means 75 determines a set of a predetermined number of data as an analysis data group every time the number of data stored in the buffer 74 reaches a predetermined number. Yes.

Further, the above embodiment discloses a configuration in which the analysis data output means 76 deletes each data belonging to the analysis data group sent to the data analysis means from the buffer 74.

Further, the above embodiment includes data analysis means (for example, time-series data analysis means 5) for analyzing data, and the data analysis means holds the analysis data group output by the analysis data output means 76 for analysis. A configuration is disclosed in which analysis is performed asynchronously with the analysis data output means 76 by deleting the analysis data group that has been completed.

In the above embodiment, the characteristic configuration of the analysis preprocessing system as shown in the following (1) to (9) is shown.

(1) A data acquisition unit that acquires a data group generated by a plurality of data generation sources, a data extraction unit that extracts individual data from the data group acquired by the data acquisition unit, and a buffer that stores data used for analysis For each piece of data cut out by the data cutout unit, it is determined whether or not a predetermined condition is satisfied, data that satisfies the predetermined condition is stored in a buffer, and data that does not satisfy the predetermined condition is discarded. For analysis that sends the analysis data group to the filtering unit, the analysis data determination unit that determines the analysis data group that is a set of data used for analysis from the data stored in the buffer, and the data analysis unit that analyzes the data An analysis preprocessing system comprising a data output unit.

(2) a content match / mismatch determination unit that determines whether or not the filtering unit satisfies a condition that the data content is different from any data already stored in the buffer for each data cut out by the data cutout unit; A pre-analysis system including a data selection unit that discards data that does not satisfy the condition and stores data that satisfies the condition in a buffer.

(3) A filtering unit stores a reference storage unit that stores a reference indicating that the content included in the data is valid, and whether or not the data content satisfies the reference for each piece of data cut out by the data cutout unit An analysis preprocessing system comprising: a reference determination unit for determining; and a data selection unit that discards data whose data content does not satisfy the criterion and stores data satisfying the criterion in a buffer.

(4) A data identification information storage unit in which the filtering unit stores data identification information of each data input from the data extraction unit, and data identification information of the data when the data is input from the data extraction unit It is determined whether or not the information is stored in the information storage unit. If not stored, the data determination information is stored in the data identification information storage unit, and the data identification information is stored in the data identification information storage unit. An analysis preprocessing system comprising: a data selection unit that discards data determined to have been stored and stores data determined to have not been stored in the data identification information storage unit in a buffer.

(5) An analysis preprocessing system in which the analysis data determination unit determines a set of data stored in the buffer within a certain period as an analysis data group for each certain period.

(6) An analysis preprocessing system in which the analysis data determination unit determines a set of a predetermined number of data as an analysis data group each time the number of data stored in the buffer reaches a predetermined number.

(7) An analysis preprocessing system in which the analysis data output unit deletes each data belonging to the analysis data group sent to the data analysis unit from the buffer.

(8) A data analysis unit for analyzing data is provided, the data analysis unit holds the analysis data group output by the analysis data output unit, and the analysis data output unit is deleted by deleting the analysis data group that has been analyzed Is an analysis preprocessing system that performs analysis asynchronously.

(9) Data acquisition means for acquiring a data group generated by a plurality of data generation sources, data cutout means for cutting out individual data from the data group acquired by the data acquisition means, and a buffer for storing data used for analysis For each piece of data cut out by the data cutout means, it is determined whether or not a predetermined condition is satisfied, data that satisfies the predetermined condition is stored in the buffer, and data that does not satisfy the predetermined condition is discarded. Filtering means, analysis data determining means for determining an analysis data group that is a set of data used for analysis from the data stored in the buffer, and sending the analysis data group to the data analysis means for analyzing the data An analysis preprocessing system comprising data output means.

The present invention has been described above with reference to the embodiments, but the present invention is not limited to the above embodiments. Various changes that can be understood by those skilled in the art can be made to the configuration and details of the present invention within the scope of the present invention.

This application claims priority based on Japanese Patent Application No. 2009-038413 filed on Feb. 20, 2009, the entire disclosure of which is incorporated herein.

Industrial applicability

The present invention is preferably applied to an analysis preprocessing system that collects data to be collected for analysis.

DESCRIPTION OF SYMBOLS 1 Time series data generation source 2 Data transmission means 3 Data reception means 4 Data stream generation means 5 Time series data analysis means 7 Analysis preprocessing system 401 Stream data generation means 402 Transmission data buffer 403 Analysis window generation means 404 Stream data transmission means 407 Filtering means 40701 Data selection means 40702 Identity determination means 40711, 40721 Data selection means 40712, 40722 Validity determination means 40713 Effective data definition means 40723 Processed data storage means

Claims

Data acquisition means for acquiring a data group generated by a plurality of data generation sources;
Data cutout means for cutting out individual data from the data group acquired by the data acquisition means;
A buffer for storing data used for analysis;
Filtering for determining whether or not a predetermined condition is satisfied for each piece of data cut out by the data cutout unit, storing data satisfying the predetermined condition in the buffer, and discarding data not satisfying the predetermined condition Means,
Analysis data determination means for determining an analysis data group that is a set of data used for analysis from the data stored in the buffer;
An analysis preprocessing system comprising: an analysis data output means for sending an analysis data group to a data analysis means for analyzing data.
Filtering means
Content match / mismatch determination means for determining whether or not the data content is different from any data already stored in the buffer for each data cut out by the data cutout means;
The analysis preprocessing system according to claim 1, further comprising: a data selection unit that discards data that does not satisfy the condition and stores data that satisfies the condition in a buffer.
Filtering means
Reference storage means for storing a reference indicating that the content included in the data is valid;
For each piece of data cut out by the data cutout means, a reference determination means for determining whether or not the content of the data satisfies the reference;
The analysis preprocessing system according to claim 1, further comprising: a data selection unit that discards data whose data content does not satisfy the standard and stores data satisfying the standard in a buffer.
The filtering means is
Data identification information storage means for storing data identification information of each data input from the data cutout means;
When data is input from the data cutout means, it is determined whether or not the data identification information of the data is stored in the data identification information storage means. If not, the data identification information of the data is stored in the data identification information storage. Duplicate determination means for storing in the means;
Data selecting means for discarding data determined that the data identification information is stored in the data identification information storage means and storing in the buffer the data determined that the data identification information is not stored in the data identification information storage means; The analysis preprocessing system according to any one of claims 1 to 3.
5. The analysis data determination unit determines a set of data stored in the buffer within the predetermined period as an analysis data group for each predetermined period. 5. Before analysis according to claim 1. Processing system.
The analysis data determination means determines the set of the predetermined number of data as an analysis data group every time the number of data stored in the buffer reaches the predetermined number. The analysis pretreatment system described.
The analysis preprocessing system according to any one of claims 1 to 6, wherein the analysis data output means deletes each data belonging to the analysis data group sent to the data analysis means from the buffer.
Equipped with data analysis means for analyzing data,
The data analysis means holds the analysis data group output by the analysis data output means, and performs analysis asynchronously with the analysis data output means by deleting the analysis data group that has been analyzed. The analysis preprocessing system according to any one of items 7 to 9.
Acquire data groups generated by multiple data sources,
Cut out individual data from the acquired data group,
For each cut out data, it is determined whether or not a predetermined condition is satisfied, data that satisfies the predetermined condition is stored in the buffer, and data that does not satisfy the predetermined condition is discarded,
From the data stored in the buffer, an analysis data group that is a set of data used for analysis is determined,
A pre-analysis method of analysis characterized by sending analysis data groups to a data analysis means for analyzing data.
For each piece of data that has been cut out, it is determined whether or not the condition that the data content differs from any data already stored in the buffer,
The analysis preprocessing method according to claim 9, wherein data that does not satisfy the condition is discarded and data that satisfies the condition is stored in a buffer.
For each piece of cut out data, determine whether the content of the data meets the criteria indicating that the content included in the data is valid,
The pre-analysis processing method according to claim 9 or 10, wherein data whose data content does not satisfy a criterion is discarded, and data that satisfies the criterion is stored in a buffer.
When each piece of data is cut out, it is determined whether or not the data identification information of the data is stored in the data identification information storage means. If not, the data identification information of the data is stored in the data identification information storage means. Let
The data determined that the data identification information is stored in the data identification information storage means are discarded, and the data determined that the data identification information is not stored in the data identification information storage means are stored in the buffer. The analysis pre-processing method according to claim 11.
On the computer,
Data acquisition processing to acquire data groups generated by multiple data sources,
Data cut-out process for cutting out individual data from the data group acquired in the data acquisition process,
A filtering process that determines whether or not a predetermined condition is satisfied for each data cut out by the data cut-out process, stores data that satisfies the predetermined condition in a buffer, and discards data that does not satisfy the predetermined condition ,
Analysis data determination processing for determining an analysis data group that is a set of data used for analysis from among the data stored in the buffer,
An analysis preprocessing program for executing analysis data output processing for sending analysis data groups to a data analysis means for analyzing data.
On the computer,
In the filtering process,
Content match / mismatch determination processing for determining whether or not the data content is different from any data already stored in the buffer for each data cut out by the data cut-out processing, and
The pre-analysis processing program according to claim 13, wherein a data selection process for discarding data not satisfying the condition and storing data satisfying the condition in a buffer is performed.
On the computer,
In the filtering process,
For each piece of data cut out by the data cut-out process, a reference determination process for determining whether or not the data contents satisfy a reference indicating that the contents included in the data are valid, and
The pre-analysis processing program according to claim 13 or 14, wherein a data selection process for discarding data whose data content does not satisfy a criterion and storing the data satisfying the criterion in a buffer is executed.
On the computer,
In the filtering process,
For each piece of data cut out by the data cut-out process, it is determined whether or not the data identification information of the data is stored in the data identification information storage means. If not, the data identification information of the data is stored in the data identification information storage means. Duplicate determination process to be stored in, and
Data selection processing for discarding data determined to have data identification information stored in the data identification information storage means and storing data determined to have not been stored in the data identification information storage means in a buffer The analysis preprocessing program according to any one of claims 13 to 15, which is executed.