WO2021090357A1 - Dispositif d'analyse de données, procédé d'analyse de données et programme - Google Patents
Dispositif d'analyse de données, procédé d'analyse de données et programme Download PDFInfo
- Publication number
- WO2021090357A1 WO2021090357A1 PCT/JP2019/043242 JP2019043242W WO2021090357A1 WO 2021090357 A1 WO2021090357 A1 WO 2021090357A1 JP 2019043242 W JP2019043242 W JP 2019043242W WO 2021090357 A1 WO2021090357 A1 WO 2021090357A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- attribute
- samples
- data
- subset
- items
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
Definitions
- the present invention relates to a data analyzer, a data analysis method and a program.
- Multidimensional data analysis is one of the common functions used in business intelligence technology.
- the feature of this multidimensional data analysis is that it is possible to switch between multiple dimensions and analyze at various angles. Therefore, it is indispensable to construct a unique database called a MOLAP (Multimetrical OnLine Analytical Processing) cube (hereinafter referred to as “cube”) in which all patterns are aggregated in advance.
- MOLAP Multimetrical OnLine Analytical Processing
- the present invention has been made in view of the above circumstances, and an object of the present invention is a data analysis device, a data analysis method, and a data analysis method capable of reducing the burden required for prior aggregation processing in multidimensional data analysis. To provide a program.
- One aspect of the present invention is the first aspect of obtaining the required number of subset samples from multidimensional data including a plurality of attribute values for a plurality of attribute items and the allowable sampling error with respect to the number of subset samples.
- the subset that matches the attribute values of multiple attribute items are known.
- the estimated number of subsets of the second process for calculating the estimated number of samples of the above, and the estimated number of subsets of the subsets matching the attribute values of the plurality of attribute items calculated in the second process are the required subsets obtained in the first process.
- a data processing unit that executes a third process that executes aggregation of multidimensional data for a combination of attribute values of a plurality of attribute items that satisfies the number of samples of, and a storage that stores the result of aggregation by the data processing unit. It has a part and.
- FIG. 1 is a block diagram showing a configuration of an entire system including a data analysis server according to an embodiment.
- FIG. 2 is a flowchart showing the processing contents in the pre-aggregation phase according to the embodiment.
- FIG. 3 is a flowchart showing the processing contents in the analysis phase according to the embodiment.
- FIG. 4 is a diagram illustrating the relationship between the error with respect to the number of samples of the population according to the same embodiment, the population ratio, and the like.
- FIG. 5 is a diagram illustrating the correct answer of the sample number of the population of the two attribute items “age” and “residential place (region)” according to the embodiment and the estimation result obtained by the estimation.
- FIG. 1 is a block diagram showing the configuration of the entire system including the data analysis server 10 according to the embodiment.
- the data analysis server 10 is connected to the network NW including the Internet.
- the data analysis server 10 is a database (DB) 12 that stores multidimensional data including a plurality of attribute items and attribute values, a summary table data obtained by pre-aggregation, and the like, centering on a data processing unit 11 including a processor. It has an input unit 13 for inputting multidimensional data and an instruction command for data analysis, and a display unit 14 for displaying a summary table or the like as a result of data analysis.
- DB database
- the specific circuit configuration of the data analysis server 10 as a hardware circuit is the same as that of a general database server, and its illustration and description will be omitted.
- the input / output of various data and the like in the data analysis server 10 is not limited to the input unit 13 and the display unit 14, and naturally includes the case where the input / output is performed by a client terminal (not shown) such as a personal computer or a smartphone via the network NW. I'm out.
- the operation of the data analysis server 10 in this embodiment will be described.
- the operation will be described below assuming that the operation is divided into two phases, a “pre-aggregation phase” and an “analysis phase”.
- FIG. 2 is a flowchart showing the processing contents mainly centered on the data processing unit 11 including the processor in the pre-aggregation phase.
- the data processing unit 11 inputs, for example, the multidimensional data to be analyzed and the sampling error indicating an acceptable range via the network NW (step S101).
- the data processing unit 11 calculates the lower limit number of samples in the population, which is a subset, based on the input multidimensional data and the sampling error indicating an acceptable range (step S102). If the purpose of the data analysis is to obtain the population ratio, when the confidence coefficient is 95%, the number of samples n of the population, which is the (lower limit) subset required to set the sampling error to ⁇ , is calculated by the following equation. Given in. That is,
- n the required number of population samples
- P Population ratio
- ⁇ Sample error
- Zx The upper 100% point of the standard distribution.
- the required population sample number n is largest when the population ratio P, that is, the ratio of the attribute values to be analyzed is 50%.
- the sampling error ⁇ is an error with respect to the number of samples of the population, there is a problem that the error becomes relatively large when the population ratio P is small and the number of attribute values to be analyzed is small. ..
- the required population sample size can be expressed by the following equation using the error ⁇ with respect to the sample number m having the specified attribute value in the specified attribute item. That is,
- the number of samples n in the population is 1,536 or less, it is not necessary to create a summary table for data analysis.
- the sample error ⁇ for the sample number n of the population instead of the sample error ⁇ for the sample number n of the population, the sample error ⁇ for the sample number m having the attribute value specified in the specified attribute item is used.
- the sampling error becomes relatively large when the population ratio P is small and the number of attribute values to be analyzed is small.
- FIG. 4 shows the sample error ( ⁇ / ⁇ ), the population ratio P, the required number of samples of the population n, the number of samples m having the attribute value specified in the specified attribute item, and the sample corresponding to the sample error for the population. It is a figure which illustrates the relationship of the number ⁇ .
- FIG. 4B also shows the error ⁇ with respect to the number of samples of the population as a reference with respect to the error ⁇ with respect to the number of samples m having the specified attribute value in the specified attribute item.
- the error ⁇ with respect to the number of samples m having the specified attribute value in the specified attribute item is increasing.
- the operator of the data analysis server 10 appropriately adapts to the situation by setting the input according to the situation by the input unit 13 connected to the data processing unit 11. Aggregation processing of multidimensional data can be executed.
- step S102 after obtaining the number of samples of the population as the lower limit, the data processing unit 11 selects one combination of a plurality of attribute items for the multidimensional data to be analyzed (step S103).
- the data processing unit 11 estimates the number of samples of the population from the multidimensional data in the combination of the selected attribute items (step S104).
- the population estimate can be expressed by the following formula.
- FIG. 5 shows the correct answer and the estimation result obtained by estimating the sample size of the population when the subset A'of the total set A of the attribute items is "age” and "residential (local)". It is a figure exemplifying.
- FIG. 5A shows the attribute values “0-19”, “20-39”, “40-59”, “60-79”, and “80-” in the attribute item “age” and the attribute item “residential area (region)”.
- FIG. 5 (B) shows the result of estimating the number of samples of the population, which is a subset whose correct answer is unknown, shown by the pointillistic shading of FIG. 5 (A) by the equation (4).
- the number of population samples corresponding to the attribute value "Kanto" in the attribute item "residential place” and the attribute value "40-59” in the attribute value attribute item “age” is shown by hatching that rises to the right in the figure.
- the result obtained as the maximum value is shown in. That is, when the number of "40-59 (years)" in “Kanto” corresponds to 11.825 million (11,825 x 1000) and exceeds the required population sample number n.
- other attribute items used in the original multidimensional data may be added for estimation.
- step S104 the data processing unit 11 that has completed the estimation of the population sample number determines whether or not the estimated population sample number is smaller than the lower limit population sample number n calculated in step S102. Therefore, it is determined whether or not to execute the aggregation process regarding the combination of the plurality of attribute items selected at this time (step S105).
- the data processing unit 11 selects the multidimensional data at that time.
- the aggregation process for the combination of the plurality of attribute items is executed, and the aggregation table data as a result of the processing is stored in the database 12 (step S106).
- step S105 when it is determined that the estimated population sample number is smaller than the lower limit population sample number n or more (YES in step S105), the data processing unit 11 handles the multidimensional data. As the aggregation process related to the combination of the plurality of attribute items selected at that time is omitted, the aggregation process in step S106 and the storage of the aggregation result in the database 12 are not executed.
- step S107 is it necessary to continue pre-aggregation of multidimensional data depending on whether or not there are combinations of multiple attribute items that have not yet been selected in addition to the combination of multiple attribute items that are selected at that time? It is determined whether or not (step S107).
- the data processing unit 11 is multidimensional. Assuming that the pre-aggregation of data needs to be continued, the process returns to the process from step S103.
- steps S103 to S106 are repeatedly executed, and it is determined that the estimated population sample number is n or more, which is the lower limit of the population sample number, with respect to the combination of all the plurality of attribute items of the multidimensional data. Only in case, the aggregation process is executed sequentially.
- step S107 when the processing for all the combinations of the plurality of attribute items of the dimensional data is completed and it is determined that there are no combinations of the plurality of attribute items that have not been selected yet (NO in step S107), the data processing unit 11 This completes the process in the pre-aggregation phase of FIG.
- the number of attribute items is 10
- the number of attribute values is 50
- FIG. 3 is a flowchart showing the processing contents in the analysis phase, which is mainly executed by the data processing unit 11.
- the data processing unit 11 of the data analysis server 10 receives, for example, a request for multidimensional data analysis via the network NW, and receives and inputs a combination of a plurality of attribute items (step S201).
- the data processing unit 11 searches the summary table data stored in the database 12 and determines whether or not there is summary table data including a combination of a plurality of input attribute items (step S202).
- the data processing unit 11 When it is determined that the summary table data including the combination of the input plurality of attribute items does not exist (NO in step S202), the data processing unit 11 inputs the combination of the plurality of attribute items in step S201, for example, the network NW.
- the display data for displaying an error is output and transmitted to the client-side terminal (not shown) that has sent the data analysis request via the above (step S206), and the process of FIG. 3 is temporarily completed. And prepare for the next data analysis request.
- step S201 when the input unit 13 in the data analysis server 10 inputs a combination of a plurality of attribute items as a request for multidimensional data analysis, in step S206, the inside of the data analysis server 10 is input. As a process, the display unit 14 executes an error display.
- step S202 when it is determined that the summary table data including the combination of the plurality of input attribute items exists (YES in step S202), the data processing unit 11 inputs the summary table data including the combination of the input attribute items. Read from database 12 (step S203).
- the data processing unit 11 newly creates the summary table data limited only to the input attribute items based on the summary table data read from the database 12 (step S204), and then displays the created summary table data. (Step S205).
- the data processing unit 11 outputs the created summary table data. , The data is transmitted to the terminal, and the process of FIG. 3 is temporarily completed to prepare for the next data analysis request.
- step S201 when the input unit 13 in the data analysis server 10 inputs a combination of a plurality of attribute items as a request for multidimensional data analysis, in step S205, the inside of the data analysis server 10 is input. As a process, after displaying the created summary table data on the display unit 14, the process of FIG. 3 is once terminated to prepare for the next data analysis request.
- the burden required for the preliminary aggregation processing in multidimensional data analysis. can be reduced.
- the present invention is not limited to the above-described embodiment, and can be variously modified at the implementation stage without departing from the gist thereof.
- the operation control process according to the embodiment is stored in advance in a storage medium (not shown) inside the data analysis server 10 or the data processing unit 11 of the data analysis server 10, and the data analysis server 10 or data is stored according to the installed program.
- the processor inside the data processing unit 11 of the analysis server 10 may execute the data.
- each embodiment may be carried out in combination as appropriate as possible, in which case the combined effect can be obtained.
- the embodiments include inventions at various stages, and various inventions can be extracted by an appropriate combination in a plurality of disclosed constitutional requirements. For example, even if some constituent requirements are deleted from all the constituent requirements shown in the embodiment, the problem described in the column of the problem to be solved by the invention can be solved, and the effect described in the column of effect of the invention can be solved. If is obtained, a configuration in which this configuration requirement is deleted can be extracted as an invention.
- 10 Data analysis server, 11 ... Data processing unit, 12 ... Database (DB), 13 ... Input section, 14 ... Display, NW ... Network.
- DB Database
- Input section 14 ... Display, NW ... Network.
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Le but de la présente invention est de réduire la charge de construction d'une base de données d'agrégation qui est efficace pour une analyse de données multidimensionnelle. Ce dispositif d'analyse de données comporte : une unité de traitement de données (11) qui obtient une taille d'échantillon de population requise à partir d'une erreur d'échantillonnage admissible et des données comprenant des valeurs d'attribut destinées à une pluralité d'éléments d'attribut, calcule une taille d'échantillon de population destinée à chaque combinaison d'une pluralité d'éléments d'attribut, et agrège des données des combinaisons d'une pluralité d'éléments d'attribut qui satisfont la taille d'échantillon de population requise ; et une base de données (12) qui stocke les résultats de l'agrégation.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2019/043242 WO2021090357A1 (fr) | 2019-11-05 | 2019-11-05 | Dispositif d'analyse de données, procédé d'analyse de données et programme |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2019/043242 WO2021090357A1 (fr) | 2019-11-05 | 2019-11-05 | Dispositif d'analyse de données, procédé d'analyse de données et programme |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021090357A1 true WO2021090357A1 (fr) | 2021-05-14 |
Family
ID=75849654
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2019/043242 WO2021090357A1 (fr) | 2019-11-05 | 2019-11-05 | Dispositif d'analyse de données, procédé d'analyse de données et programme |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2021090357A1 (fr) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2007052754A (ja) * | 2005-08-18 | 2007-03-01 | Shizuo Nagashima | データ分析集計表示処理制御装置 |
WO2019026134A1 (fr) * | 2017-07-31 | 2019-02-07 | 三菱電機株式会社 | Dispositif et procédé de traitement d'informations |
-
2019
- 2019-11-05 WO PCT/JP2019/043242 patent/WO2021090357A1/fr active Application Filing
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2007052754A (ja) * | 2005-08-18 | 2007-03-01 | Shizuo Nagashima | データ分析集計表示処理制御装置 |
WO2019026134A1 (fr) * | 2017-07-31 | 2019-02-07 | 三菱電機株式会社 | Dispositif et procédé de traitement d'informations |
Non-Patent Citations (1)
Title |
---|
OGASAWARA, ASATO ET AL.: "Efficient search algorithm for local exception partial data using statistical confidence intervals", THE 10TH FORUM ON DATA ENGINEERING AND INFORMATION MANAGEMENT (THE 16THANNUAL MEETING OF THE DATABASE SOCIETY OF JAPAN ), IEICE TECHNICAL COMMITTEE ON DATA ENGINEERING, THE DATABASE SOCIETY OF JAPAN, IPSJ SIG NOTES, 6 March 2018 (2018-03-06), XP055821901, Retrieved from the Internet <URL:http://db-event.jpn.org/deim2018/data/papers/31.pd> [retrieved on 20200514] * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108733639B (zh) | 一种配置参数调整方法、装置、终端设备及存储介质 | |
US11403303B2 (en) | Method and device for generating ranking model | |
CN108280091B (zh) | 一种任务请求执行方法和装置 | |
CN109903105B (zh) | 一种完善目标商品属性的方法和装置 | |
US20180239496A1 (en) | Clustering and analysis of commands in user interfaces | |
US20170300584A1 (en) | Customized and Automated Dynamic Infographics | |
US20160149948A1 (en) | Automated Cyber Threat Mitigation Coordinator | |
EP3706012A1 (fr) | Système et procédé de sélection de données | |
CN109857791B (zh) | 一种数据批量处理方法与装置 | |
CN113656315B (zh) | 数据测试方法、装置、电子设备和存储介质 | |
CN112947919A (zh) | 构建业务模型和处理业务请求的方法和装置 | |
WO2021090357A1 (fr) | Dispositif d'analyse de données, procédé d'analyse de données et programme | |
US20230409929A1 (en) | Methods and apparatuses for training prediction model | |
US20180365341A1 (en) | Three-Dimensional Cad System Device, and Knowledge Management Method Used in Three-Dimensional Cad | |
CN109902196B (zh) | 一种商标类别推荐方法、装置、计算机设备及存储介质 | |
WO2016053382A1 (fr) | Normalisation et déduplication d'annonce d'emploi | |
JP2010020617A (ja) | 設計事例検索装置,設計事例検索プログラム | |
CN111078671A (zh) | 数据表字段的修改方法、装置、设备和介质 | |
CN116303461A (zh) | 一种部件库创建方法、装置、电子设备和存储介质 | |
CN115329150A (zh) | 生成搜索条件树的方法、装置、电子设备及存储介质 | |
CN115687717A (zh) | Grok表达式获取方法、装置、设备及计算机可读存储介质 | |
CN112887426B (zh) | 信息流的推送方法、装置、电子设备以及存储介质 | |
CN115169316A (zh) | 数据处理模板生成方法、装置、电子设备及存储介质 | |
CN104778253B (zh) | 一种提供数据的方法和装置 | |
JP6491806B1 (ja) | 定量的レシピ情報作成支援サーバ、情報処理端末、定量的レシピ情報作成支援システム、定量的レシピ情報作成支援方法及び定量的レシピ情報作成支援プログラム |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19952032 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19952032 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: JP |