WO2012131927A1 - 計算機システム及びデータ管理方法 - Google Patents
計算機システム及びデータ管理方法 Download PDFInfo
- Publication number
- WO2012131927A1 WO2012131927A1 PCT/JP2011/057940 JP2011057940W WO2012131927A1 WO 2012131927 A1 WO2012131927 A1 WO 2012131927A1 JP 2011057940 W JP2011057940 W JP 2011057940W WO 2012131927 A1 WO2012131927 A1 WO 2012131927A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data
- division
- key
- data set
- divided
- Prior art date
Links
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/08—Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
- H04L43/0876—Network utilisation, e.g. volume of load or congestion level
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/11—File system administration, e.g. details of archiving or snapshots
- G06F16/113—Details of archiving
Definitions
- the present invention relates to a technique for combining data in a computer system that processes a large amount of data.
- the sort / merge join technique is a method of sorting the tables to be joined based on the key values, reading the rows of each table from the top, and merging the rows having the corresponding key values.
- each table is divided at a position corresponding to the same key value to generate a divided region corresponding to each table, and a sort / merge join technique is performed for each divided region. It is described that tables are joined using. Further, Patent Document 1 describes that a divided area is allocated to processors so that the processor load in the system is not biased.
- Patent Document 2 describes a matrix index that associates data storage positions with combinations of two or more keys.
- Patent Document 3 describes a method of leveling the usage amount of each storage area while suppressing the amount of data movement from the existing storage area to the newly added storage area when adding the storage area.
- data acquired periodically is accumulated, and analysis processing is executed by combining the accumulated data as necessary.
- FIG. 20 is an explanatory diagram showing an example of data processed in a conventional data analysis system.
- FIG. 21 is an explanatory diagram showing an example of a schema in conventional data.
- 22A to 22C are explanatory diagrams illustrating an example of data processed in the conventional analysis processing.
- the movement history of the user is shown.
- the data includes a user ID for identifying the user, a position X and a position Y that are coordinate information specifying the position of the user, and a time stamp that is a time when the user moves to the position.
- the data is converted based on the schema as shown in FIG. Further, the converted data is grouped for each user ID as shown in FIG. 22A, and analysis processing such as tabulation is executed.
- data composed of one or more records is described as a data set. Further, a data set as shown in FIG. 20 is described as raw data, and data having a structure as shown in FIG. 21 is described as structured data.
- data in the format shown in FIG. 20 is collected periodically (for example, in units of months), converted into data in the format in FIG. 22A, and then accumulated in the data analysis system. For this reason, when a plurality of data are aggregated and analysis processing for data for one year and analysis processing for a specific month in each year are executed, it is necessary to combine a plurality of data in the format shown in FIG. 22A. .
- the data analysis system combines two pieces of data as shown in FIG. 22A and FIG. 22B to become data as shown in FIG. 22C.
- the data that is periodically accumulated may have a different size distribution for each data. For example, in the data of a user whose service usage count differs from month to month, a difference in the size distribution of the data for each month occurs.
- Patent Document 1 does not describe a method of determining a position (division position) for dividing a table. Generally, in order to divide a table equally, distribution information of keys included in the table is required. When acquiring the key distribution information, the method of scanning the entire table takes time to complete the processing.
- a typical example of the invention disclosed in the present application is as follows. That is, a computer system in which a plurality of computers execute an analysis process on a data set including a plurality of data composed of keys and data values, wherein each of the computers includes a processor, a memory connected to the processor, A storage device connected to the processor; and a network interface connected to the processor, wherein each of the computers is a key indicating a division position of a divided area obtained by dividing the data set for each predetermined key range.
- the division information for managing a certain division position key is held for each data set, and all the division position keys included in the division information of each data set are the same, and are stored on the storage area of the plurality of computers.
- the computer system includes the analysis process.
- a plurality of tasks are generated for each of the divided areas, the generated tasks are allocated to the respective computers, and the analysis processing is performed by combining the data included in the divided areas of the respective data sets.
- the divided area having a data size larger than a predetermined threshold is used. It is determined whether or not a certain target area exists, and when it is determined that the target area exists, the target area is divided into a plurality of new divided areas.
- FIG. 1 is a block diagram illustrating a system configuration of a data analysis system according to the first embodiment of the present invention.
- the data analysis system includes a client node 10, a master node 20, and a slave node 30, and the nodes are connected to each other via a network 40.
- the network 40 may be a SAN, LAN, WAN, or the like, but may be any network as long as each node can communicate. Each node may be directly connected.
- a node indicates a computer.
- the computer is referred to as a node.
- the client node 10 is a node used by a user of the data analysis system.
- the user transmits various instructions to the master node 20 and the slave node 30 using the client node 10.
- the master node 20 is a node that manages the entire data analysis system.
- the slave node 30 is a node that executes each process (task) in accordance with an instruction transmitted from the master node 20.
- This data analysis system is a kind of parallel distributed processing system, and the processing performance of the system can be improved by increasing the number of slave nodes 30.
- the hardware configuration of the client node 10, the master node 20, and the slave node 30 is the same, and details will be described later with reference to FIG.
- Each node is connected to storage devices 11, 21, 31 such as HDDs.
- Each of the storage devices 11, 21, and 31 stores a program for realizing a function included in each node such as an OS.
- Each program is read from the storage devices 11, 21, and 31 by the CPU (see FIG. 2) and executed by the CPU (see FIG. 2).
- FIG. 2 is a block diagram illustrating the hardware configuration of the node according to the first embodiment of the present invention.
- FIG. 2 illustrates the client node 10 as an example, but the master node 20 and the slave node 30 also have the same hardware configuration.
- the client node 10 includes a CPU 101, a network I / F 102, an input / output I / F 103, a memory 104, and a disk I / F 105, and the respective components are connected to each other via an internal bus or the like.
- CPU 101 executes a program stored in memory 104.
- the memory 104 stores a program executed by the CPU 101 and information necessary for executing the program. Note that the program stored in the memory 104 may be stored in the storage device 11. In this case, the data is read from the storage device 11 onto the memory 104 by the CPU 101.
- the network I / F 102 is an interface for connecting to other nodes via the network 40.
- the disk I / F 105 is an interface for connecting to the storage device 11.
- the input / output I / F 103 is an interface for connecting input / output devices such as a keyboard 106, a mouse 107, and a display 108.
- the user transmits an instruction to the data analysis system using the input / output device and confirms the analysis result.
- the master node 20 and the slave node 30 do not have to include the keyboard 106, the mouse 107, and the display 108.
- FIG. 3A is a block diagram illustrating the software configuration of the master node 20 in the first embodiment of the present invention.
- the master node 20 includes a data management unit 21, a process management unit 22, and a file server (master) 23.
- the data management unit 21, the processing management unit 22, and the file server (master) 23 are programs stored on the memory 104, and are executed by the CPU 101.
- the CPU 101 is executing the program.
- the data management unit 21 manages data processed by the data analysis system.
- the data management unit 21 includes a data management table T100, a partition table T200, and a key size table T400.
- the data management table T100 stores management information of data sets processed by the data analysis system. Details of the data management table T100 will be described later with reference to FIG.
- the data set indicates data composed of a plurality of records.
- the division table T200 stores management information of divided areas obtained by dividing the data set.
- the divided area represents a record group in which the data set is divided for each predetermined key range. Details of the division table T200 will be described later with reference to FIG.
- the key size table T400 stores management information on the data size of each divided area in the data set.
- One key size table T400 corresponds to one data set.
- a key size table T400 that manages the data size of the data set of the entire data analysis system. Details of the key size table T400 will be described later with reference to FIG.
- the process management unit 22 manages parallel processing executed in a distributed manner on each slave node 30.
- the process management unit 22 includes a program repository 24 that manages a program that generates processes (tasks) to be executed in parallel. That is, the process management unit 22 generates a task to be executed in each slave node 30 from the program repository 24 and instructs the slave node 30 to execute the generated task.
- the file server (master) 23 manages files that store actual data.
- the software configuration of the master node 20 may be realized using hardware.
- FIG. 3B is a block diagram illustrating the software configuration of the slave node 30 according to the first embodiment of the present invention.
- the slave node 30 includes a processing execution unit 31 and a file server (slave) 32.
- the process execution unit 31 and the file server (slave) 32 are programs stored on the memory 104 and are executed by the CPU 101.
- the process is described with the program as the subject, it is assumed that the CPU 101 is executing the program.
- the process execution unit 31 receives a process (task) execution instruction from the process management unit 22 of the master node 20 and executes a predetermined process (task). That is, the process execution unit 31 generates a process for executing the process (task) based on the received execution instruction of the process (task). By executing the generated process, a plurality of tasks are executed on each slave node 30, and parallel distributed processing is realized.
- the process execution unit 31 of the present embodiment includes a data addition unit (Map) 33 and a data addition unit (Reduce) 34 that execute the above-described task.
- the data adding unit (Map) 33 reads data in record units from the input raw data (see FIG. 20), and outputs the read raw data to the data adding unit (Reduce) 34 for each key range.
- a key range in charge of processing is set in advance.
- the data adding unit (Map) 33 includes a partition table T300.
- the data adding unit (Map) 33 specifies the data adding unit (Reduce) 34 that outputs the read data based on the partition table T300.
- the partition table T300 will be described later with reference to FIGS. 7A and 7B.
- the data adding unit (Reduce) 34 converts the input raw data into a predetermined format, that is, structured data (see FIG. 21), and outputs the structured data to the distributed file system.
- the data adding unit (Reduce) 34 includes a key size table T400.
- the key size table T400 is the same as the key size table T400 included in the data management unit 21. However, in the key size table T400, only the management information related to the divided area of the key range handled by the data adding unit (Reduce) 34 is stored.
- the file server (slave) 32 manages distributed files.
- the file server (master) 23 has a function of managing file metadata (directory structure, size, update date and time) and providing one file system in cooperation with the file server (slave) 32.
- the data adding unit (Map) 33 and the data adding unit (Reduce) 34 access the file server (master) 23 to use the files on the file system and execute various tasks. That is, the data adding unit (Map) 33 and the data adding unit (Reduce) 34 can access the same file system.
- FIG. 4 is an explanatory diagram illustrating an example of the data management table T100 according to the first embodiment of this invention.
- the data management table T100 includes a data ID (T101) and a division table name T102.
- the data ID (T101) stores the identifier of the data set.
- the division table name T102 stores the name of the division table T200 corresponding to the data set.
- Each entry of the data management table T100 corresponds to one data set managed by the data analysis system.
- the data set corresponds to one table (relation) in a normal database.
- 5A and 5B are explanatory diagrams illustrating an example of the division table T200 according to the first embodiment of the present invention.
- FIG. 5A shows an example of a partition table T200 of a data set whose partition table name T102 is “log01.part”.
- FIG. 5B shows an example of a partition table T200 whose partition table name T102 is “log02.part”.
- the division table T200 stores management information indicating the division method of each data set processed by the data analysis system.
- the division table T200 includes a division table name T201, a data file name T202, a key (T203), and an offset T204.
- the division table name T201 stores the name of the division table T200.
- the division table name T201 is the same as the division table name T102.
- the data file name T202 stores the name of the file that stores the data corresponding to the divided area.
- Key (T203) stores a key value indicating the key range of the divided area, that is, a key value indicating the division position of the data set.
- the key (T203) stores a key value representing the end point in the divided area.
- the offset T204 stores an offset corresponding to the value of the division position in the data set.
- the offset of the key corresponding to the key (T203) is stored in the offset T204.
- the start position of the divided area corresponds to the key (T203) and offset T204 of the previous entry. Since the key indicating the start position of the first divided area and the key indicating the end position of the last divided area are not defined, they are not described in the division table T200.
- Each entry of each division table T200 corresponds to one division area managed by the data analysis system.
- the division table name T101 is “log01.part”, and corresponds to the division table T200 shown in FIG. 5A.
- the first entry of the division table T200 shown in FIG. 5A corresponds to the first division area.
- the first entry indicates that the data of the divided area is stored in a file whose data file name T202 is “log01 / 001.dat”.
- the key (T203) of the first entry is “034a”, it indicates that the key range of the first divided area is less than “034a”. Further, since the offset T204 of the first entry is “280”, this indicates that the data of the first divided area is stored in the range of the offset on the file “0 to 279”.
- the key range of the corresponding division area is “034a” or more and less than “172d”, and the data file name T202 is “log01 / 002.dat”. Indicates that there is. Further, since the data file name T202 is different from the first entry, the offset is counted from “0”. Therefore, it indicates that the data of the divided area corresponding to the range where the offset is “0 to 218” is stored.
- the key range of the corresponding division area is “172d” or more and less than “328b”, and the data file name T202 is “log01 / 002.dat”. Indicates that there is. Further, since the data file name T202 matches the second entry, it indicates that the data of the divided area corresponding to the range where the offset on the file is “219 to 455” is stored.
- the division table name T101 is “log02.part”, and corresponds to the division table T200 shown in FIG. 5B.
- the data file name T202 and offset T204 of each entry stored in the division table T200 shown in FIG. 5B are different from the entries in the division table T200 shown in FIG. 5A.
- the keys (T203) representing the division positions of both division tables T200 coincide with each other.
- the division positions of the divided areas in the data sets that may be combined are managed so as to be consistent.
- two or more data sets can be combined in parallel. That is, it is possible to associate entries with the same key (T203) in the partition table T200 of the data set to be combined, and it is possible to execute the combining process in parallel for each divided region.
- the file includes a plurality of records composed of one key and one or more values as shown in FIG. 22A.
- Each file is stored in the distributed file system in a format sorted based on the key. As a result, when combining processing is performed for each divided region, it becomes possible to merge and combine the same keys.
- the files that store the data of different divided areas may be the same.
- the second entry and the third entry are the same file.
- the key range of each entry is different.
- the number of files is three, but the number of divided areas is four, which are different.
- the number of files matches the degree of parallelism of data addition processing in the data analysis system.
- the number of divided areas depends on the parallelism of data analysis processing. Accordingly, since the number of files and the number of divided areas depend on different processes, there is no dependency between them, and the number may be determined in any way.
- FIG. 6 is an explanatory diagram illustrating an example of the partition table T300 according to the first embodiment of this invention.
- the partition table T300 divides a newly added data set (raw data), and stores information used for distributing the data in a data adding unit (Reduce) 34 that executes a task.
- the partition table T300 includes a key (T301) and a destination T302.
- T301 stores a key value representing a division position of the input data set.
- the destination T302 stores destination information indicating the position of the data adding unit (Reduce) 34 in charge of processing the divided data set.
- the node and the data addition unit (Reduce) 34 are specified by destination information including an IP address and a port.
- 7A and 7B are explanatory diagrams illustrating an example of the key size table T400 in the first embodiment of the present invention.
- the key size table T400 stores the data size of the divided area.
- the key size table T400 includes a key (T401) and a size T402.
- T401 is the same as key (T203).
- the size T402 stores the data size of the divided area with the key (T401) as the division position.
- the size T402 stores the total value of the data sizes of the divided areas to be combined.
- the key size table T400 is dynamically generated at the time of executing a combination process and an analysis process described later, and a data addition process.
- FIG. 8 is a flowchart for explaining data combining processing and analysis processing in the first embodiment of the present invention.
- the join process is always executed together with the analysis process. That is, after the data for one record is combined by the combining process, the analysis process is executed on the data.
- the combination processing and analysis processing are executed by the data management unit 21 that has received an instruction from the user.
- the instruction from the user includes the data ID of the data set to be combined.
- the master node 20 creates a key size table T400 corresponding to the data set to be processed (step S101).
- the data management unit 21 searches the data management table T100 based on the data ID included in the instruction transmitted from the user, and acquires the division table name T102 from the corresponding entry.
- the data management unit 21 acquires a partition table T200 corresponding to the acquired partition table name T102.
- the data management unit 21 specifies a key value indicating a division position for each division area based on the obtained division table T200, and calculates a data size of a data set to be combined.
- the data management unit 21 creates a key size table T400 based on the above processing result.
- the corresponding division tables T200 are as shown in FIGS. 5A and 5B, respectively.
- the data management section 21 adds the data sizes of the two data sets for each divided region by executing the above-described processing, and creates a key size table T400 as shown in FIG. 7A.
- the master node 20 generates a plurality of tasks composed of a combination process and an analysis process, and activates each task by assigning each generated task to each slave node 30 (step S102).
- the process management unit 22 reads a program necessary for the process from the program repository 24 and generates tasks for the number of parallels designated by the user. Further, the process management unit 22 causes the generated task to be executed on each slave node 30.
- the parallel number is smaller than the number of entries in the key size table T400 created in step S101, the number of entries is set as the parallel number, and tasks corresponding to the number of entries are executed on the slave node 30.
- the master node 20 assigns a divided area to each task (step S103).
- the data management unit 21 assigns a divided area corresponding to each entry of the key size table T400 created in step S101 to each task generated in step S102.
- the data management unit 21 allocates a divided area to each task so that the data sizes are equal based on the size T402 of the key size table T400.
- the data management unit 21 sorts the entries in the key size table T400 based on the size T402, and the task having the smaller allocated data size is performed in order from the larger data size.
- An allocation method is conceivable.
- the data management unit 21 transmits the data file name and the offset position of the file to be combined to the slave node 30 to which the task is allocated after the allocation of the divided areas is completed.
- the master node 20 transmits a task execution instruction to each slave node 30 to which the task is assigned, and ends the process (step S104).
- the data management unit 21 transmits a task execution instruction to each slave node 30 to which the task is assigned.
- the slave node 30 that has received the instruction from the master node 20 accesses the file server (master) 23 and, based on the data file name and offset position received from the data management unit 21, designates the designated file with the designated offset. Read from position.
- Each slave node 30 matches the key of each read file and executes a combination process. Further, the slave node 30 outputs the result of the combination processing for each record to the analysis processing task being executed in the same slave node 30.
- a task is generated for each of the four divided areas, and the above-described combining process is executed by each task.
- the data addition process is a process for adding a new data set when an existing data set is stored in the data set in which the data management table T100 and the partition table T200 are created, that is, the distributed file system. .
- each divided area is different for each data set. For this reason, if the divided areas of the respective data sets are combined without correcting the dividing position, the data size varies between the divided areas. As a result, the processing amount of the task that executes the analysis processing varies, and the efficiency of parallel processing decreases.
- the division position is controlled so that the data size of each divided area is equal to or less than a predetermined reference value when all data sets that can be combined are combined.
- the data size of each divided area is below the reference value, and the difference in processing amount between tasks of analysis processing is leveled.
- the divided area is subdivided and task control overhead of the join processing and analysis processing occurs, and if the allocated divided area becomes smaller, multiple tasks are assigned to the task to which the divided area is assigned. Areas are allocated and the amount of processing executed by one task can be increased.
- the above-mentioned predetermined reference value affects the difference in task throughput, so it is desirable to determine it based on the allowable task throughput difference.
- the predetermined reference value may be a data amount such that an execution time when one task processes a predetermined data amount is equal to or less than a time allowed as a difference in processing time between tasks.
- the data added by the data addition process is input in a format as shown in FIG.
- data in the format as shown in FIG. 22A is converted into a format grouped by user ID and stored in the distributed file system.
- the data set in the format shown in FIG. 20 is referred to as raw data, and the data set in FIG.
- FIG. 9 is a flowchart for explaining data addition processing in the first embodiment of the present invention.
- the data addition processing is executed when the user inputs raw data to the distributed file system realized by the file server (master) 23 and the file server (slave) 32.
- the data management unit 21 samples the input raw data and analyzes the appearance frequency of the key (step S201).
- the data management unit 21 samples the records included in the raw data at random.
- the data management unit 21 creates a key list in which the first field of the read record is the key.
- the data management unit 21 can read data for one record by detecting a line feed code.
- the data management unit 21 may execute the sampling processes in parallel. In this case, the data management unit 21 divides the raw data into a plurality of pieces so that the data sizes are equal, and a sampling process is executed for each divided raw data.
- the data management unit 21 assigns the execution task of the sampling process to each slave node 30, and further assigns the raw data divided into the execution task.
- the data management unit 21 receives the result of the sampling process from the process execution unit 31 of each slave node 30 and totals the results of the sampling process received from all the slave nodes 30 to create a key list.
- the data management unit 21 determines a key value to be a division position of the raw data based on the created key list (step S202).
- the division process is a division process for outputting input raw data in step S204 described later, and is a process different from the division process in the division table T200.
- the existing division position is not changed in the process of step S204. Therefore, the division position of the raw data needs to match the division position of the division table T200 of the existing data set.
- the data management unit 21 refers to the division table T200 and creates a key size table T400 including division positions of all existing data sets. For example, a key size table T400 as shown in FIG. 7A is created. However, at this time, no value is stored in the size T402.
- the data management unit 21 specifies a divided area corresponding to each sampled key, and increments the data size of the data corresponding to the key to the size T402 of the corresponding entry in the key size table T400.
- the data management unit 21 can obtain the distribution of the sampled keys.
- the key is “125d” in the size T402 of the entry whose key (T401) is “172d”. Is incremented.
- the data management unit 21 merges adjacent divided areas of the key size table T400 so that the parallel number specified by the user matches the number of divided areas. At this time, it is desirable that the data size of each divided area after merging is equal.
- the key size table T400 whose key distribution is shown in FIG. 7B has four divided areas, so it is necessary to merge them into two divided areas. There is. Therefore, the data management unit 21 merges the entry whose key (T401) is “034a” and the entry “172d” into one divided area, and combines the entry whose key (T401) is “328b” and the blank entry. Merge as one divided area.
- the data management unit 21 stores the merge result in the key (T301) of the partition table T300.
- the merge process described above if the number of entries in the key size table T400 is equal to or greater than the parallel number specified by the user, the merge process is not executed and the number of entries is the parallel number.
- step S202 The above is the processing in step S202.
- the data management unit 21 calculates the data size of all data sets that may be combined in the analysis process (step S203). Further, the data management unit 21 creates a key size table T400 based on the calculation result.
- the data management unit 21 refers to the data management table T100 and acquires the division table name T102 of each data set. Furthermore, the data management unit 21 acquires a list of the corresponding partition table T200 based on the acquired partition table name T102.
- the data management unit 21 creates a key size table T400 including the key (T203) of the acquired division table T200. Further, the data management unit 21 calculates the data size of each divided area for each division table T200, and adds the calculated data size to the size (T402) of the created key size table T400.
- the key size table T400 regarding all the existing data sets existing on the distributed file system can be created by executing the same processing for all the obtained divided tables T200.
- a key size table T400 as shown in FIG. 7A is created by executing the above-described processing on the division table T200 shown in FIGS. 5A and 5B.
- step S203 The above is the processing in step S203.
- the data management unit 21 performs a grouping process on the raw data based on the partition table T300 representing the merge result in step S202 (step S204).
- the grouping process is a process of collecting records included in the raw data for each key (user ID in the example shown in FIG. 20).
- the data management unit 21, the data addition unit (Map) 33, and the data addition unit (Reduce) 34 execute processing in cooperation.
- the data addition unit (Map) 33 and the data addition unit (Reduce) 34 each execute parallel processing in accordance with an instruction from the data management unit 21.
- the number of entries in the partition table T300 is the degree of parallelism of the data adding unit (Reduce) 34 to which tasks are allocated.
- the degree of parallelism of the data adding unit (Map) 33 to which tasks are assigned is irrelevant to the number of entries in the partition table T300 and is specified by the user.
- Map Map
- Reduce Reduce
- the data management unit 21 divides the raw data so that the data size is constant according to the parallel number specified by the user. Further, the data management unit 21 calculates an offset position, which is each division position of the division area generated by dividing the raw data, and a data size of the division area. The offset position is adjusted by scanning a part of the raw data so as to coincide with the record boundary.
- the data management unit 21 generates Map tasks for the number of parallels designated by the user in cooperation with the process management unit 22 and assigns the generated Map tasks to each data addition unit (Map) 33. At this time, each data adding unit (Map) 33 is transmitted with the offset position of the divided area, the data size of the divided area, and the file name of the raw data.
- the data management unit 21 generates Reduce tasks for the number of entries in the partition table T300 in cooperation with the process management unit 22.
- the data management unit 21 associates each entry of the partition table T300 with the data addition unit (Reduce) 34.
- the data management unit 21 assigns a Reduce task for processing a divided region in the key range corresponding to the key (T301) to each associated data addition unit (Reduce) 34.
- the data management unit 21 transmits the entry corresponding to the transmitted key range in the key size table T400 created in step S202, to the data addition unit (Reduce) 34.
- the corresponding entry of the key size table T400 includes the first entry and the second entry of FIG. 7A. It is. Therefore, the data management unit 21 transmits the first entry and the second entry to the corresponding data addition unit (Reduce) 34.
- the data management unit 21 acquires the destination information (address: port number) of the data adding unit (Reduce) 34 and stores the acquired destination information in the destination T302 of the corresponding entry in the partition table T300.
- the process management unit 22 transmits the completed partition table T300 to all the data addition units (Maps) 33.
- step S204 The above is the processing in step S204.
- the data adding unit (Map) 33 and the data adding unit (Reduce) 34 in step S204 execute the data output process after the grouping process is executed. Details of the grouping process will be described later with reference to FIG. 10, and details of the data output process will be described later with reference to FIG.
- the data management unit 21 updates the division table T200 and ends the process (step S205).
- the data management unit 21 updates the division table T200 managed by itself based on the division table T200 received from each data addition unit (Reduce) 34.
- the received division table T200 is a table after the data adding unit (Reduce) 34 has executed processing (see FIGS. 10 and 11) described later.
- the data adding unit (Reduce) 34 processes only a data set in a part of the key range.
- the present embodiment is characterized in that all the division tables T200 in the data analysis system are updated based on the division table T200 updated by one data addition unit (Reduce) 34.
- the data management unit 21 merges the input raw data division table T200 received from each data addition unit (Reduce) 34 into one, and the merged table is input to the raw data division table T200. Manage as.
- the data management unit 21 adds an entry corresponding to the raw data division table T200 to the data management table T100.
- step S204 details of the grouping process in step S204 will be described.
- FIG. 10 is a flowchart for explaining the details of the grouping process in the first embodiment of the present invention.
- the slave node 30 performs a sort process on the input raw data (step S301).
- the data adding unit (Map) 33 reads records one by one from the raw data.
- the data adding unit (Map) 33 acquires the destination information of the data adding unit (Reduce) 34 from the partition table T300 based on the key of the read record. That is, the data adding unit (Reduce) 34 for processing the read record is specified.
- the data adding unit (Map) 33 classifies each record read for each destination.
- the record group classified for each destination is referred to as a segment.
- the data adding unit (Map) 33 reads all the records included in the divided raw data that it is in charge of, and then sorts the records included in each segment based on the key.
- step S301 The above is the processing in step S301.
- the slave node 30 transmits the sorted segments to the data adding unit (Reduce) 34 (step S302).
- the data adding unit (Map) 33 transmits the sorted segments to the data adding unit (Reduce) 34 corresponding to the destination information acquired in Step S301.
- Each data adding unit (Reduce) 34 receives a segment transmitted from the data adding unit (Map) 33 of each slave node 30.
- the slave node 30 that has received the segment from the data adding unit (Map) 33 merges the received segment based on the key, and ends the process (step S303).
- the data adding unit (Reduce) 34 sequentially reads all received segments, and merges and joins the segments having the same key.
- the data adding unit (Reduce) 34 converts the records included in the merged segment into structured data as shown in FIG. By the process described above, a plurality of records are collected into one record having the same key.
- FIG. 11 is a flowchart for explaining data output processing in the first embodiment of the present invention.
- the data adding unit (Reduce) 34 outputs structured data having a format as shown in FIG. 22A to the distributed file system by executing a data output process. Tasks are executed in the data adding unit (Reduce) by the number of parallelism. At this time, the file names output by the data adding unit (Reduce) 34 are different.
- the data adding unit (Reduce) 34 adds the data size of the raw data to the key size table T400, and calculates the data size of each divided area after the raw data is added.
- the data adding unit (Reduce) 34 executes a division process of the divided area when there is a divided area having a data size equal to or larger than a predetermined threshold.
- the data adding unit (Reduce) 34 also updates the division table T200 of the existing data set managed by itself when the division process of the divided area is executed. Further, the data adding unit (Reduce) 34 transmits the updated division table T200 to the data management unit 21. Based on the updated division table T200, the data management unit 21 executes an update process (step S205) of the division table T200.
- the data adding unit (Reduce) 34 creates a division table T200 for the input raw data, and transmits the division table T200 created after the processing is completed to the data management unit 21.
- the data adding unit (Reduce) 34 creates a key size table T400 in which only the key included in the key size table T400 received from the data management unit 21 in step S204 is stored.
- the created key size table T400 is a table in which the data size of a predetermined divided area of the raw data is stored.
- the created key size table T400 is also referred to as an additional key size table T400.
- the initial value of the size T402 is set to “0”.
- the key size table T400 received from the data management unit 21 is a table for managing the data sizes of all data sets on the distributed file system included in the key range handled by the data addition unit (Reduce) 34.
- the key size table T400 is referred to as an all data key size table T400.
- the data adding unit (Reduce) 34 When the data output process is started, the data adding unit (Reduce) 34 outputs the record created in step S303, and determines whether or not the record is included in a divided area different from the record output last time. (Step S401).
- the data adding unit (Reduce) 34 refers to the key (T402) of the key size table T400 for addition, and determines whether the output record is included in a different divided area from the record output last time. judge.
- the records sorted based on the keys are sequentially output, it can be determined whether or not the output records are included in a predetermined key range, that is, a predetermined divided area.
- the first output record is determined to be included in the same divided area.
- the data adding unit (Reduce) 34 executes a process for confirming the data size of the divided area to which the previous record has been added (step S405), and proceeds to step S402. .
- the data size confirmation process will be described later with reference to FIG.
- the data adding unit (Reduce) 34 writes the record created in Step S303 to the distributed file system (Step S402).
- the data adding unit (Reduce) 34 creates record statistical information including the key value of the written record, the offset position on the file in which the record is written, and the data size of the record, and the created record Save statistical information. This is record statistical information of raw data.
- the data adding unit (Reduce) 34 updates the key size table T400 (step S403).
- the data adding unit (Reduce) 34 specifies a divided area of the key range including the key of the record written in step S402.
- the data adding unit (Reduce) 34 searches the addition key size table T400 and the total data key size table T400 for an entry corresponding to the specified divided area. Further, the data adding unit (Reduce) 34 adds the data size of the written record to the size T402 of the corresponding entry in each key size table T400.
- the data adding unit (Reduce) 34 determines whether all records have been output (step S404).
- the data adding unit (Reduce) 34 returns to Step S401 and executes the same processing.
- step S406 the data size confirmation processing in step S406 is the same processing as step S405.
- FIG. 12 is a flowchart for explaining the data size confirmation processing in the first embodiment of the present invention.
- the data adding unit (Reduce) 34 refers to the entire data key size table T400 updated in step S403, and determines whether or not the data size of the target divided area is larger than a predetermined reference value (step S501). . That is, it is determined whether or not the divided area to which the raw data is added is larger than a predetermined reference value.
- the target divided area is a divided area including the record input last time.
- the target divided area is also referred to as a target area.
- the data adding unit (Reduce) 34 refers to the size T402 of the corresponding entry in the all data key size table T400, and determines whether or not the data size of the target area is larger than a predetermined reference value.
- the data adding unit (Reduce) 34 proceeds to step S506.
- the data adding unit (Reduce) 34 acquires the existing data set division table T200 from the master node 20 (step S502).
- the data adding unit (Reduce) 34 may store the division table T200 acquired from the master node 20 as a cache.
- the data adding unit (Reduce) 34 specifies an end position of the target area in each acquired division table T200, that is, an offset (step S503).
- the data adding unit (Reduce) 34 refers to each acquired division table T200 based on the key of the target area, and acquires an entry corresponding to the target area. That is, the data file name T202 and the offset T204 of the data corresponding to the target area are acquired. This process is executed for all the divided tables T200 acquired in step S502.
- the data adding unit (Reduce) 34 Information is acquired from the first entry of the division table T200 shown in 5A and 5B.
- (data file name, offset) (/ log01 / 001.dat, 280) in FIG. 5A, and (/log02/002.dat, 200) in FIG. 5B.
- the acquired offset is the end position of the target area in each division table T200.
- the data adding unit (Reduce) 34 analyzes the records included in the target area of each existing data set (step S504).
- the data adding unit (Reduce) 34 reads a record included in the target area of each existing data set. For example, when there is a data set whose data ID (T101) is “log01” and “log02”, a record is read from the target area of the data set of “log01”, and the target area of the data set of “log02” Records are read from.
- the data adding unit (Reduce) 34 acquires record statistical information including the key of the read record, the data size of the record, and the offset position of the record in the file.
- the analysis processing of the record may be executed in parallel for each data set.
- the data adding unit (Reduce) 34 combines the record statistical information of the raw data acquired in step S402 and the record statistical information of the existing data set to obtain record statistical information of all data sets on the distributed file system.
- the data adding unit (Reduce) 34 determines the value of the key to be a division position for re-division based on the record statistical information of all the created data sets (step S505).
- the data adding unit (Reduce) 34 calculates the data size in the target area based on the record statistical information of all data sets.
- the data adding unit (Reduce) 34 calculates the number of divisions in the target area based on the calculated data size and a predetermined reference value.
- the data adding unit (Reduce) 34 divides the data size of the target area by the calculated number of divisions, and calculates the data size of the divided areas after re-division.
- the data adding unit (Reduce) 34 sorts the record statistical information entries of all data sets by key, and then calculates the cumulative value distribution of the data size of the records. That is, the distribution of the data size of each record included in the predetermined key range in the distributed file system is calculated.
- the data adding unit (Reduce) 34 determines, based on the calculated cumulative value distribution, a point where the data size of the record is an integral multiple of the data size of the divided area after the division, as a division position for re-division. If it is not an integer multiple, the record closest to the data size is determined as the division position.
- the key at the re-division position may be a key that exists as data, or a key that does not exist as data.
- the data adding unit (Reduce) 34 specifies the offset corresponding to each determined key range with reference to the record statistical information of all data sets.
- the data adding unit (Reduce) 34 adds an entry corresponding to the divided area after the re-division to each division table T200. Further, the data adding unit (Reduce) 34 deletes the entry corresponding to the divided area before the re-division from each division table T200.
- a divided region whose key range is less than “034a” is divided into two divided regions: a divided region whose key range is less than “015d” and a divided region whose key range is “015d” or more and less than “034a”.
- the division table T200 shown in FIGS. 5A and 5B is changed as shown in FIGS. 14A and 14B.
- a portion indicated by a thick line in the figure is a changed portion.
- the data adding unit (Reduce) 34 also changes the adding key size table T400 and the total data key size table T400 based on the record statistical information.
- the table shown in FIG. 13 when the total data key size table T400 before re-division is the table shown in FIG. 13, the table is changed as shown in FIG. A portion indicated by a thick line in the figure is a changed portion.
- the data adding unit (Reduce) 34 updates the division table T200 (step S506).
- the data adding unit (Reduce) 34 stores the entry of the divided area corresponding to the raw data division table T200 based on the key size table for addition and the record statistical information of the raw data. That is, the raw data division table T200 is generated.
- the data adding unit (Reduce) 34 deletes the record statistical information used in the above-described processing, and ends the processing (step S507).
- the present invention can also support a storage method (column division storage method) in which data items are stored in different files.
- the configuration of the data analysis system is the same as that of the first embodiment, and thus the description thereof is omitted.
- the hardware configuration and software configuration of the master node 20 and the slave node 30 are the same as those in the first embodiment, and thus description thereof is omitted.
- FIG. 16 is an explanatory diagram showing a record schema in the second embodiment of the present invention.
- FIG. 17 is an explanatory diagram illustrating an example of a record according to the second embodiment of this invention.
- the record of the second embodiment newly includes the age of the user.
- the user ID is used as the key.
- 18A, 18B, and 18C are explanatory diagrams showing files in the second embodiment of the present invention.
- 18A, 18B, and 18C show examples in which the above-described data is stored in a file using the column division method.
- the user ID is log / 001. key. dat (FIG. 18A)
- the movement history is log / 001. rec. dat (FIG. 18B)
- age is log / 001. age.
- Each is stored in a file called dat (FIG. 18C).
- FIG. 19 is an explanatory diagram showing an example of a division table T200 according to the second embodiment of the present invention.
- the division table T200 in the second embodiment is different from the first embodiment in that the data file name T202 and the offset T204 are stored for each item (user ID, movement history, and age).
- a key value representing a division position is stored in the key (T203).
- step S101 when the key size table T400 is created, the data management unit 21 refers to the offset of the item used for the analysis processing in the division table T200 and calculates the size of each division area.
- the size of the key size table is obtained using only the “uid” offset and the “age” offset. At this time, the offset for “rec” is not used.
- each slave node 30 to which a task is assigned reads the files by the number of products of the number of files used for analysis processing and the number of items used for analysis processing.
- step S203 the data management unit 21 creates the key size table T400 of the existing data set from the offset for each item of the split table T200 of all data sets that may be combined.
- step S402 when each record is output to a file, each record is output to a separate file. Therefore, in step S402, record statistical information including the key value of the written record, the offset on the written file, and the data size is stored for each item.
- step S403 the sum of the sizes of the divided areas of all items is added to the corresponding entry in the key size table T400.
- the data adding unit (Reduce) 34 determines the key of the division position using the sum of the data sizes of the divided areas of all items as the data size of the data set.
- the data adding unit (Reduce) 34 calculates the offset of the division position for each item using the determined key and record statistical information, and updates the division table T200.
- the case where three items are processed has been described.
- the number of items managed in the division table T200 can be changed to any number of items.
- the data analysis system can execute the combining process in the analysis process in parallel because the division positions of the respective data sets are the same.
- the divided area can be subdivided so that the processing amount between tasks becomes uniform. As a result, it is possible to eliminate processing imbalance between tasks and to combine records for each distributed area during the combining process.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Environmental & Geological Engineering (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
Claims (16)
- キー及びデータ値から構成されるデータを複数含むデータセットに対する分析処理を複数の計算機が並列実行する計算機システムであって、
前記各計算機は、プロセッサと、前記プロセッサに接続されるメモリと、前記のプロセッサに接続される記憶装置と、前記プロセッサに接続されるネットワークインタフェースとを有し、
前記各計算機は、所定のキー範囲毎に前記データセットを分割した分割領域の分割位置を示すキーである分割位置キーを管理する分割情報を、前記データセット毎に保持し、
前記各データセットの前記分割情報に含まれるすべての前記分割位置キーは同一であり、
前記複数の計算機が有する記憶領域上には、前記データセットを格納するファイルシステムが構成され、
前記計算機システムは、
前記分析処理を実行する場合に、前記分割領域毎に複数のタスクを生成し、
前記生成されたタスクを前記各計算機に割り当てて、前記各データセットの分割領域に含まれる前記データを結合して前記分析処理を実行し、
前記ファイルシステムに新規データセットが格納された場合に、前記新規データセットが格納された後の各分割領域のデータサイズに基づいて、所定の閾値より大きいデータサイズの前記分割領域である対象領域が存在するか否かを判定し、
前記対象領域が存在すると判定された場合、前記対象領域を複数の新たな分割領域に分割することを特徴とする計算機システム。 - 前記ファイルシステムに新規データセットを格納する場合に、前記新規データセットのキー分布を解析し、
前記解析結果に基づいて、既存の前記データセットの分割情報に含まれるすべての前記分割位置キーと同一となるように、前記新規データセットの前記分割情報を生成することを特徴とする請求項1に記載の計算機システム。 - 前記対象領域が分割された後に、前記既存のデータセットの分割情報における前記分割位置キーを更新することを特徴とする請求項2に記載の計算機システム。
- 前記対象領域が存在するか否かを判定する場合に、すべての前記データセットの前記分割領域のデータサイズを合計して、前記計算機システムにおける前記分割領域のデータサイズである第1のデータサイズを算出し、
前記算出された第1のデータサイズが、前記所定の閾値より大きい前記分割領域が存在するか否か判定し、
前記対象領域を分割する場合に、前記すべてのデータセットの前記対象領域のデータサイズを合計することによって、前記計算機システムにおける前記対象領域のデータサイズである第2のデータサイズを算出し、
前記所定の閾値、及び前記算出された第2のデータサイズに基づいて、前記対象領域の分割数を算出し、
前記算出された分割数に基づいて、前記対象領域における新たな分割位置キーを決定し、
前記既存のデータセットの分割情報の前記分割位置キーを更新する場合に、前記既存のデータセットの分割情報から、前記対象領域に対応する情報を削除し、前記決定された分割位置キーと前記新たな分割領域とを対応づけた情報を追加し、
前記新データセットの分割情報を生成する場合に、前記更新された既存のデータセットの分割情報における前記分割キーと同一となるように前記新規データセットの分割情報を生成することを特徴とする請求項3に記載の計算機システム。 - 前記対象領域を分割する場合に、前記対象領域のデータサイズを前記算出された分割数で除算して第3のデータサイズを算出し、
前記算出された第3のデータサイズに対応する前記データにおける前記キーを、前記分割位置キーとして決定することを特徴とする請求項4に記載の計算機システム。 - 前記所定の閾値は、前記新たな分割領域が割り当てられるタスクの処理時間が予め設定された許容時間以下となるデータサイズであることを特徴とする請求項4に記載の計算機システム。
- 前記データは、複数の項目毎のデータ値を含み、
前記第1のデータサイズを算出する場合に、前記分割領域における全ての項目のデータサイズを合計することによって、前記第1のデータサイズを算出することを特徴とする請求項4に記載の計算機システム。 - 前記新規データセットのキー分布を解析する場合に、前記既存のデータセットの分割情報に含まれる前記分割位置キーのいずれかに一致する分割位置キーで前記新規データセットを分割して複数の処理用分割領域を生成し、
前記生成された処理用分割領域毎に、前記新規データセットのキー分布を解析するためのタスクを生成して、当該タスクを並列に実行することを特徴とする請求項2に記載の計算機システム。 - キー及びデータ値から構成されるデータを複数含むデータセットに対する分析処理を複数の計算機が並列実行する計算機システムにおけるデータ管理方法であって、
前記各計算機は、プロセッサと、前記プロセッサに接続されるメモリと、前記プロセッサに接続される記憶装置と、前記プロセッサに接続されるネットワークインタフェースとを有し、
前記各計算機は、所定のキー範囲毎に当該データセットを分割した分割領域の分割位置を示すキーである分割位置キーを管理する分割情報を、前記データセット毎に保持し、
前記各データセットの前記分割情報に含まれる全てのすべての前記分割位置キーは同一であり、
前記複数の計算機が有する記憶領域上には、前記データセットを格納するファイルシステムが構成され、
前記方法は、
少なくとも一つの前記計算機が、前記分析処理を実行する場合に、前記分割領域毎に複数のタスクを生成する第1のステップと、
前記タスクを生成した前記計算機が、前記生成されたタスクを前記各計算機に割り当てて、前記各データセットの分割領域に含まれる前記データを結合して前記分析処理を実行させる第2のステップと、を含み、
少なくとも一つの前記計算機が、前記ファイルシステムに新規データセットが格納された場合に、前記新規データセットが格納された後の各分割領域のデータサイズに基づいて、所定の閾値より大きいデータサイズの前記分割領域である対象領域が存在するか否かを判定する第3のステップと、
前述した判定処理を実行した前記計算機が、前記対象領域が存在すると判定された場合、前記対象領域を複数の新たな分割領域に分割する第4のステップと、
を含むことを特徴とするデータ管理方法。 - 前記第3のステップは、
前記新規データセットのキー分布を解析する第5のステップと、
前記解析結果に基づいて、既存の前記データセットの分割情報に含まれるすべての前記分割位置キーと同一となるように、前記新規データセットの前記分割情報を生成する第6のステップと、を含むことを特徴とする請求項9に記載のデータ管理方法。 - 前記第4のステップは、前記対象領域が分割された後に、前記既存のデータセットの分割情報における前記分割位置キーを更新する第7のステップを含むことを特徴とする請求項10に記載のデータ管理方法。
- 前記第3のステップは、
すべての前記データセットの前記分割領域のデータサイズを合計して、前記計算機システムにおける前記分割領域のデータサイズである第1のデータサイズを算出する第8のステップと、
前記算出された第1のデータサイズが、前記所定の閾値より大きい前記分割領域が存在するか否か判定する第9のステップと、を含み、
前記第4のステップは、
前記すべてのデータセットの前記対象領域のデータサイズを合計することによって、前記計算機システムにおける前記対象領域のデータサイズである第2のデータサイズを算出する第10のステップと、
前記所定の閾値、及び前記算出された第2のデータサイズに基づいて、前記対象領域の分割数を算出する第11のステップと、
前記算出された分割数に基づいて、前記対象領域における新たな分割位置キーを決定する第12のステップと、を含み、
前記第7のステップは、前記既存のデータセットの分割情報から、前記対象領域に対応する情報を削除し、前記決定された分割位置キーと前記新たな分割領域とを対応づけた情報を追加する第13のステップを含み、
前記第6のステップは、前記更新された既存のデータセットの分割情報における前記分割キーと同一となるように前記新規データセットの分割情報を生成する第14のステップを含むことを特徴とする請求項11に記載のデータ管理方法。 - 前記第12のステップは、
前記対象領域のデータサイズを前記算出された分割数で除算して第3のデータサイズを算出するステップと、
前記算出された第3のデータサイズに対応する前記データにおける前記キーを、前記分割位置キーとして決定するステップと、を含むことを特徴とする請求項12に記載のデータ管理方法。 - 前記所定の閾値は、前記新たな分割領域が割り当てられるタスクの処理時間が予め設定された許容時間以下となるデータサイズであることを特徴とする請求項12に記載のデータ管理方法。
- 前記データは、複数の項目毎のデータ値を含み、
前記第8のステップでは、前記分割領域における全ての項目のデータサイズを合計することによって、前記第1のデータサイズを算出することを特徴とする請求項12に記載のデータ管理方法。 - 前記第5のステップは、
前記既存のデータセットの分割情報に含まれる前記分割位置キーのいずれかに一致する分割位置キーで前記新規データセットを分割して複数の処理用分割領域を生成するステップと、
前記生成された処理用分割領域毎に、前記新規データセットのキー分布を解析するためのタスクを生成して、当該タスクを前記各計算機上で並列に実行させるステップと、を含むことを特徴とする請求項10に記載のデータ管理方法。
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2011/057940 WO2012131927A1 (ja) | 2011-03-30 | 2011-03-30 | 計算機システム及びデータ管理方法 |
US13/977,849 US20130297788A1 (en) | 2011-03-30 | 2011-03-30 | Computer system and data management method |
JP2013506934A JP5342087B2 (ja) | 2011-03-30 | 2011-03-30 | 計算機システム及びデータ管理方法 |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2011/057940 WO2012131927A1 (ja) | 2011-03-30 | 2011-03-30 | 計算機システム及びデータ管理方法 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2012131927A1 true WO2012131927A1 (ja) | 2012-10-04 |
Family
ID=46929753
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2011/057940 WO2012131927A1 (ja) | 2011-03-30 | 2011-03-30 | 計算機システム及びデータ管理方法 |
Country Status (3)
Country | Link |
---|---|
US (1) | US20130297788A1 (ja) |
JP (1) | JP5342087B2 (ja) |
WO (1) | WO2012131927A1 (ja) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2014195818A1 (en) * | 2013-06-07 | 2014-12-11 | International Business Machines Corporation | Method and system for efficient sorting in a relational database |
JP2018036885A (ja) * | 2016-08-31 | 2018-03-08 | ヤフー株式会社 | 情報処理装置、情報処理システム、情報処理プログラムおよび情報処理方法 |
JP2020107010A (ja) * | 2018-12-27 | 2020-07-09 | 富士通株式会社 | 情報処理プログラム、情報処理装置及び情報処理方法 |
CN118060744A (zh) * | 2024-04-16 | 2024-05-24 | 成都沃特塞恩电子技术有限公司 | 用于物料切割的可视化系统及方法 |
Families Citing this family (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP5825122B2 (ja) * | 2012-01-31 | 2015-12-02 | 富士通株式会社 | 生成プログラム、生成方法、および生成システム |
US10218775B2 (en) * | 2013-08-28 | 2019-02-26 | Usablenet Inc. | Methods for servicing web service requests using parallel agile web services and devices thereof |
US20150286409A1 (en) * | 2014-04-08 | 2015-10-08 | Netapp, Inc. | Storage system configuration analysis |
US10223379B2 (en) | 2014-09-04 | 2019-03-05 | International Business Machines Corporation | Parallel processing of a keyed index file system |
US10157218B2 (en) | 2015-06-30 | 2018-12-18 | Researchgate Gmbh | Author disambiguation and publication assignment |
US10133807B2 (en) * | 2015-06-30 | 2018-11-20 | Researchgate Gmbh | Author disambiguation and publication assignment |
US9928291B2 (en) * | 2015-06-30 | 2018-03-27 | Researchgate Gmbh | Author disambiguation and publication assignment |
CN106201673B (zh) * | 2016-06-24 | 2019-07-09 | 中国石油天然气集团公司 | 一种地震数据处理方法及装置 |
JP6844414B2 (ja) * | 2017-05-23 | 2021-03-17 | 富士通株式会社 | 分散データ管理プログラム、分散データ管理方法及び分散データ管理装置 |
US9934287B1 (en) * | 2017-07-25 | 2018-04-03 | Capital One Services, Llc | Systems and methods for expedited large file processing |
US10873529B2 (en) * | 2017-12-01 | 2020-12-22 | Futurewei Technologies, Inc. | Method and apparatus for low latency data center network |
US10715499B2 (en) * | 2017-12-27 | 2020-07-14 | Toshiba Memory Corporation | System and method for accessing and managing key-value data over networks |
US10855767B1 (en) * | 2018-03-05 | 2020-12-01 | Amazon Technologies, Inc. | Distribution of batch data to sharded readers |
CN109033355B (zh) * | 2018-07-25 | 2021-07-06 | 北京易观智库网络科技有限公司 | 进行漏斗分析的方法、装置以及存储介质 |
CN111045825A (zh) * | 2019-12-12 | 2020-04-21 | 深圳前海环融联易信息科技服务有限公司 | 批处理性能优化方法、装置、计算机设备及存储介质 |
CN113934361B (zh) * | 2020-06-29 | 2024-05-03 | 伊姆西Ip控股有限责任公司 | 用于管理存储系统的方法、设备和计算机程序产品 |
CN112799820B (zh) * | 2021-02-05 | 2024-06-11 | 拉卡拉支付股份有限公司 | 数据处理方法、装置、电子设备、存储介质及程序产品 |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH1097544A (ja) * | 1996-09-20 | 1998-04-14 | Hitachi Ltd | データベース処理システム |
JP2010092222A (ja) * | 2008-10-07 | 2010-04-22 | Internatl Business Mach Corp <Ibm> | 更新頻度に基づくキャッシュ機構 |
JP2010128721A (ja) * | 2008-11-26 | 2010-06-10 | Nippon Telegr & Teleph Corp <Ntt> | 分散インデックス結合方法及びシステム |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5307485A (en) * | 1991-05-31 | 1994-04-26 | International Business Machines Corporation | Method and apparatus for merging sorted lists in a multiprocessor shared memory system |
US5671405A (en) * | 1995-07-19 | 1997-09-23 | International Business Machines Corporation | Apparatus and method for adaptive logical partitioning of workfile disks for multiple concurrent mergesorts |
US5842208A (en) * | 1997-04-09 | 1998-11-24 | International Business Machines Corporation | High performance recover/build index system by unloading database files in parallel |
US6728694B1 (en) * | 2000-04-17 | 2004-04-27 | Ncr Corporation | Set containment join operation in an object/relational database management system |
JP5238219B2 (ja) * | 2007-10-29 | 2013-07-17 | 株式会社東芝 | 情報処理システム及びパイプライン処理制御方法 |
CN101916261B (zh) * | 2010-07-28 | 2013-07-17 | 北京播思软件技术有限公司 | 一种分布式并行数据库系统的数据分区方法 |
-
2011
- 2011-03-30 WO PCT/JP2011/057940 patent/WO2012131927A1/ja active Application Filing
- 2011-03-30 JP JP2013506934A patent/JP5342087B2/ja not_active Expired - Fee Related
- 2011-03-30 US US13/977,849 patent/US20130297788A1/en not_active Abandoned
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH1097544A (ja) * | 1996-09-20 | 1998-04-14 | Hitachi Ltd | データベース処理システム |
JP2010092222A (ja) * | 2008-10-07 | 2010-04-22 | Internatl Business Mach Corp <Ibm> | 更新頻度に基づくキャッシュ機構 |
JP2010128721A (ja) * | 2008-11-26 | 2010-06-10 | Nippon Telegr & Teleph Corp <Ntt> | 分散インデックス結合方法及びシステム |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2014195818A1 (en) * | 2013-06-07 | 2014-12-11 | International Business Machines Corporation | Method and system for efficient sorting in a relational database |
JP2016524758A (ja) * | 2013-06-07 | 2016-08-18 | インターナショナル・ビジネス・マシーンズ・コーポレーションInternational Business Machines Corporation | リレーショナル・データベースにおける効率的なソートのための方法およびシステム |
US9916339B2 (en) | 2013-06-07 | 2018-03-13 | International Business Machines Corporation | Efficient sorting in a relational database |
JP2018036885A (ja) * | 2016-08-31 | 2018-03-08 | ヤフー株式会社 | 情報処理装置、情報処理システム、情報処理プログラムおよび情報処理方法 |
JP2020107010A (ja) * | 2018-12-27 | 2020-07-09 | 富士通株式会社 | 情報処理プログラム、情報処理装置及び情報処理方法 |
JP7174245B2 (ja) | 2018-12-27 | 2022-11-17 | 富士通株式会社 | 情報処理プログラム、情報処理装置及び情報処理方法 |
CN118060744A (zh) * | 2024-04-16 | 2024-05-24 | 成都沃特塞恩电子技术有限公司 | 用于物料切割的可视化系统及方法 |
Also Published As
Publication number | Publication date |
---|---|
US20130297788A1 (en) | 2013-11-07 |
JPWO2012131927A1 (ja) | 2014-07-24 |
JP5342087B2 (ja) | 2013-11-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP5342087B2 (ja) | 計算機システム及びデータ管理方法 | |
Taylor | An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics | |
Ramakrishnan et al. | Balancing reducer skew in MapReduce workloads using progressive sampling | |
US12105712B2 (en) | Distinct value estimation for query planning | |
US8417991B2 (en) | Mitigating reduction in availability level during maintenance of nodes in a cluster | |
US9886441B2 (en) | Shard aware near real time indexing | |
US10381106B2 (en) | Efficient genomic read alignment in an in-memory database | |
US11012806B2 (en) | Multi-adapter support in the cloud | |
US12026159B2 (en) | Transient materialized view rewrite | |
US20090030880A1 (en) | Model-Based Analysis | |
US20200065415A1 (en) | System For Optimizing Storage Replication In A Distributed Data Analysis System Using Historical Data Access Patterns | |
CN118318230A (zh) | 使用混合查询执行计划的列式缓存查询 | |
JP6204753B2 (ja) | 分散クエリ処理装置、処理方法及び処理プログラム | |
US11449521B2 (en) | Database management system | |
JP4747213B2 (ja) | 文書を収集するためのシステムおよびプログラム | |
CN110851515B (zh) | 一种基于Spark分布式环境下的大数据ETL模型执行方法及介质 | |
US11200213B1 (en) | Dynamic aggregation of data from separate sources | |
CN108595552A (zh) | 数据立方体发布方法、装置、电子设备和存储介质 | |
Liu et al. | Planning your sql-on-hadoop deployment using a low-cost simulation-based approach | |
JP6401617B2 (ja) | データ処理装置、データ処理方法及び大規模データ処理プログラム | |
US9158767B2 (en) | Lock-free indexing of documents | |
US8990612B2 (en) | Recovery of a document serving environment | |
Chaudhary et al. | AdMap: a framework for advertising using MapReduce pipeline | |
Shi et al. | PECC: parallel expansion based on clustering coefficient for efficient graph partitioning | |
JP2011103106A (ja) | ジョブ管理計算機、及び、ジョブ管理方法 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 11862701 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2013506934 Country of ref document: JP Kind code of ref document: A |
|
WWE | Wipo information: entry into national phase |
Ref document number: 13977849 Country of ref document: US |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 11862701 Country of ref document: EP Kind code of ref document: A1 |