WO2021090357A1 - Dispositif d'analyse de données, procédé d'analyse de données et programme - Google Patents

Dispositif d'analyse de données, procédé d'analyse de données et programme Download PDF

Info

Publication number
WO2021090357A1
WO2021090357A1 PCT/JP2019/043242 JP2019043242W WO2021090357A1 WO 2021090357 A1 WO2021090357 A1 WO 2021090357A1 JP 2019043242 W JP2019043242 W JP 2019043242W WO 2021090357 A1 WO2021090357 A1 WO 2021090357A1
Authority
WO
WIPO (PCT)
Prior art keywords
attribute
samples
data
subset
items
Prior art date
Application number
PCT/JP2019/043242
Other languages
English (en)
Japanese (ja)
Inventor
雄貴 蔵内
治 松田
瀬下 仁志
Original Assignee
日本電信電話株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電信電話株式会社 filed Critical 日本電信電話株式会社
Priority to PCT/JP2019/043242 priority Critical patent/WO2021090357A1/fr
Publication of WO2021090357A1 publication Critical patent/WO2021090357A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models

Definitions

  • the present invention relates to a data analyzer, a data analysis method and a program.
  • Multidimensional data analysis is one of the common functions used in business intelligence technology.
  • the feature of this multidimensional data analysis is that it is possible to switch between multiple dimensions and analyze at various angles. Therefore, it is indispensable to construct a unique database called a MOLAP (Multimetrical OnLine Analytical Processing) cube (hereinafter referred to as “cube”) in which all patterns are aggregated in advance.
  • MOLAP Multimetrical OnLine Analytical Processing
  • the present invention has been made in view of the above circumstances, and an object of the present invention is a data analysis device, a data analysis method, and a data analysis method capable of reducing the burden required for prior aggregation processing in multidimensional data analysis. To provide a program.
  • One aspect of the present invention is the first aspect of obtaining the required number of subset samples from multidimensional data including a plurality of attribute values for a plurality of attribute items and the allowable sampling error with respect to the number of subset samples.
  • the subset that matches the attribute values of multiple attribute items are known.
  • the estimated number of subsets of the second process for calculating the estimated number of samples of the above, and the estimated number of subsets of the subsets matching the attribute values of the plurality of attribute items calculated in the second process are the required subsets obtained in the first process.
  • a data processing unit that executes a third process that executes aggregation of multidimensional data for a combination of attribute values of a plurality of attribute items that satisfies the number of samples of, and a storage that stores the result of aggregation by the data processing unit. It has a part and.
  • FIG. 1 is a block diagram showing a configuration of an entire system including a data analysis server according to an embodiment.
  • FIG. 2 is a flowchart showing the processing contents in the pre-aggregation phase according to the embodiment.
  • FIG. 3 is a flowchart showing the processing contents in the analysis phase according to the embodiment.
  • FIG. 4 is a diagram illustrating the relationship between the error with respect to the number of samples of the population according to the same embodiment, the population ratio, and the like.
  • FIG. 5 is a diagram illustrating the correct answer of the sample number of the population of the two attribute items “age” and “residential place (region)” according to the embodiment and the estimation result obtained by the estimation.
  • FIG. 1 is a block diagram showing the configuration of the entire system including the data analysis server 10 according to the embodiment.
  • the data analysis server 10 is connected to the network NW including the Internet.
  • the data analysis server 10 is a database (DB) 12 that stores multidimensional data including a plurality of attribute items and attribute values, a summary table data obtained by pre-aggregation, and the like, centering on a data processing unit 11 including a processor. It has an input unit 13 for inputting multidimensional data and an instruction command for data analysis, and a display unit 14 for displaying a summary table or the like as a result of data analysis.
  • DB database
  • the specific circuit configuration of the data analysis server 10 as a hardware circuit is the same as that of a general database server, and its illustration and description will be omitted.
  • the input / output of various data and the like in the data analysis server 10 is not limited to the input unit 13 and the display unit 14, and naturally includes the case where the input / output is performed by a client terminal (not shown) such as a personal computer or a smartphone via the network NW. I'm out.
  • the operation of the data analysis server 10 in this embodiment will be described.
  • the operation will be described below assuming that the operation is divided into two phases, a “pre-aggregation phase” and an “analysis phase”.
  • FIG. 2 is a flowchart showing the processing contents mainly centered on the data processing unit 11 including the processor in the pre-aggregation phase.
  • the data processing unit 11 inputs, for example, the multidimensional data to be analyzed and the sampling error indicating an acceptable range via the network NW (step S101).
  • the data processing unit 11 calculates the lower limit number of samples in the population, which is a subset, based on the input multidimensional data and the sampling error indicating an acceptable range (step S102). If the purpose of the data analysis is to obtain the population ratio, when the confidence coefficient is 95%, the number of samples n of the population, which is the (lower limit) subset required to set the sampling error to ⁇ , is calculated by the following equation. Given in. That is,
  • n the required number of population samples
  • P Population ratio
  • Sample error
  • Zx The upper 100% point of the standard distribution.
  • the required population sample number n is largest when the population ratio P, that is, the ratio of the attribute values to be analyzed is 50%.
  • the sampling error ⁇ is an error with respect to the number of samples of the population, there is a problem that the error becomes relatively large when the population ratio P is small and the number of attribute values to be analyzed is small. ..
  • the required population sample size can be expressed by the following equation using the error ⁇ with respect to the sample number m having the specified attribute value in the specified attribute item. That is,
  • the number of samples n in the population is 1,536 or less, it is not necessary to create a summary table for data analysis.
  • the sample error ⁇ for the sample number n of the population instead of the sample error ⁇ for the sample number n of the population, the sample error ⁇ for the sample number m having the attribute value specified in the specified attribute item is used.
  • the sampling error becomes relatively large when the population ratio P is small and the number of attribute values to be analyzed is small.
  • FIG. 4 shows the sample error ( ⁇ / ⁇ ), the population ratio P, the required number of samples of the population n, the number of samples m having the attribute value specified in the specified attribute item, and the sample corresponding to the sample error for the population. It is a figure which illustrates the relationship of the number ⁇ .
  • FIG. 4B also shows the error ⁇ with respect to the number of samples of the population as a reference with respect to the error ⁇ with respect to the number of samples m having the specified attribute value in the specified attribute item.
  • the error ⁇ with respect to the number of samples m having the specified attribute value in the specified attribute item is increasing.
  • the operator of the data analysis server 10 appropriately adapts to the situation by setting the input according to the situation by the input unit 13 connected to the data processing unit 11. Aggregation processing of multidimensional data can be executed.
  • step S102 after obtaining the number of samples of the population as the lower limit, the data processing unit 11 selects one combination of a plurality of attribute items for the multidimensional data to be analyzed (step S103).
  • the data processing unit 11 estimates the number of samples of the population from the multidimensional data in the combination of the selected attribute items (step S104).
  • the population estimate can be expressed by the following formula.
  • FIG. 5 shows the correct answer and the estimation result obtained by estimating the sample size of the population when the subset A'of the total set A of the attribute items is "age” and "residential (local)". It is a figure exemplifying.
  • FIG. 5A shows the attribute values “0-19”, “20-39”, “40-59”, “60-79”, and “80-” in the attribute item “age” and the attribute item “residential area (region)”.
  • FIG. 5 (B) shows the result of estimating the number of samples of the population, which is a subset whose correct answer is unknown, shown by the pointillistic shading of FIG. 5 (A) by the equation (4).
  • the number of population samples corresponding to the attribute value "Kanto" in the attribute item "residential place” and the attribute value "40-59” in the attribute value attribute item “age” is shown by hatching that rises to the right in the figure.
  • the result obtained as the maximum value is shown in. That is, when the number of "40-59 (years)" in “Kanto” corresponds to 11.825 million (11,825 x 1000) and exceeds the required population sample number n.
  • other attribute items used in the original multidimensional data may be added for estimation.
  • step S104 the data processing unit 11 that has completed the estimation of the population sample number determines whether or not the estimated population sample number is smaller than the lower limit population sample number n calculated in step S102. Therefore, it is determined whether or not to execute the aggregation process regarding the combination of the plurality of attribute items selected at this time (step S105).
  • the data processing unit 11 selects the multidimensional data at that time.
  • the aggregation process for the combination of the plurality of attribute items is executed, and the aggregation table data as a result of the processing is stored in the database 12 (step S106).
  • step S105 when it is determined that the estimated population sample number is smaller than the lower limit population sample number n or more (YES in step S105), the data processing unit 11 handles the multidimensional data. As the aggregation process related to the combination of the plurality of attribute items selected at that time is omitted, the aggregation process in step S106 and the storage of the aggregation result in the database 12 are not executed.
  • step S107 is it necessary to continue pre-aggregation of multidimensional data depending on whether or not there are combinations of multiple attribute items that have not yet been selected in addition to the combination of multiple attribute items that are selected at that time? It is determined whether or not (step S107).
  • the data processing unit 11 is multidimensional. Assuming that the pre-aggregation of data needs to be continued, the process returns to the process from step S103.
  • steps S103 to S106 are repeatedly executed, and it is determined that the estimated population sample number is n or more, which is the lower limit of the population sample number, with respect to the combination of all the plurality of attribute items of the multidimensional data. Only in case, the aggregation process is executed sequentially.
  • step S107 when the processing for all the combinations of the plurality of attribute items of the dimensional data is completed and it is determined that there are no combinations of the plurality of attribute items that have not been selected yet (NO in step S107), the data processing unit 11 This completes the process in the pre-aggregation phase of FIG.
  • the number of attribute items is 10
  • the number of attribute values is 50
  • FIG. 3 is a flowchart showing the processing contents in the analysis phase, which is mainly executed by the data processing unit 11.
  • the data processing unit 11 of the data analysis server 10 receives, for example, a request for multidimensional data analysis via the network NW, and receives and inputs a combination of a plurality of attribute items (step S201).
  • the data processing unit 11 searches the summary table data stored in the database 12 and determines whether or not there is summary table data including a combination of a plurality of input attribute items (step S202).
  • the data processing unit 11 When it is determined that the summary table data including the combination of the input plurality of attribute items does not exist (NO in step S202), the data processing unit 11 inputs the combination of the plurality of attribute items in step S201, for example, the network NW.
  • the display data for displaying an error is output and transmitted to the client-side terminal (not shown) that has sent the data analysis request via the above (step S206), and the process of FIG. 3 is temporarily completed. And prepare for the next data analysis request.
  • step S201 when the input unit 13 in the data analysis server 10 inputs a combination of a plurality of attribute items as a request for multidimensional data analysis, in step S206, the inside of the data analysis server 10 is input. As a process, the display unit 14 executes an error display.
  • step S202 when it is determined that the summary table data including the combination of the plurality of input attribute items exists (YES in step S202), the data processing unit 11 inputs the summary table data including the combination of the input attribute items. Read from database 12 (step S203).
  • the data processing unit 11 newly creates the summary table data limited only to the input attribute items based on the summary table data read from the database 12 (step S204), and then displays the created summary table data. (Step S205).
  • the data processing unit 11 outputs the created summary table data. , The data is transmitted to the terminal, and the process of FIG. 3 is temporarily completed to prepare for the next data analysis request.
  • step S201 when the input unit 13 in the data analysis server 10 inputs a combination of a plurality of attribute items as a request for multidimensional data analysis, in step S205, the inside of the data analysis server 10 is input. As a process, after displaying the created summary table data on the display unit 14, the process of FIG. 3 is once terminated to prepare for the next data analysis request.
  • the burden required for the preliminary aggregation processing in multidimensional data analysis. can be reduced.
  • the present invention is not limited to the above-described embodiment, and can be variously modified at the implementation stage without departing from the gist thereof.
  • the operation control process according to the embodiment is stored in advance in a storage medium (not shown) inside the data analysis server 10 or the data processing unit 11 of the data analysis server 10, and the data analysis server 10 or data is stored according to the installed program.
  • the processor inside the data processing unit 11 of the analysis server 10 may execute the data.
  • each embodiment may be carried out in combination as appropriate as possible, in which case the combined effect can be obtained.
  • the embodiments include inventions at various stages, and various inventions can be extracted by an appropriate combination in a plurality of disclosed constitutional requirements. For example, even if some constituent requirements are deleted from all the constituent requirements shown in the embodiment, the problem described in the column of the problem to be solved by the invention can be solved, and the effect described in the column of effect of the invention can be solved. If is obtained, a configuration in which this configuration requirement is deleted can be extracted as an invention.
  • 10 Data analysis server, 11 ... Data processing unit, 12 ... Database (DB), 13 ... Input section, 14 ... Display, NW ... Network.
  • DB Database
  • Input section 14 ... Display, NW ... Network.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Le but de la présente invention est de réduire la charge de construction d'une base de données d'agrégation qui est efficace pour une analyse de données multidimensionnelle. Ce dispositif d'analyse de données comporte : une unité de traitement de données (11) qui obtient une taille d'échantillon de population requise à partir d'une erreur d'échantillonnage admissible et des données comprenant des valeurs d'attribut destinées à une pluralité d'éléments d'attribut, calcule une taille d'échantillon de population destinée à chaque combinaison d'une pluralité d'éléments d'attribut, et agrège des données des combinaisons d'une pluralité d'éléments d'attribut qui satisfont la taille d'échantillon de population requise ; et une base de données (12) qui stocke les résultats de l'agrégation.
PCT/JP2019/043242 2019-11-05 2019-11-05 Dispositif d'analyse de données, procédé d'analyse de données et programme WO2021090357A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2019/043242 WO2021090357A1 (fr) 2019-11-05 2019-11-05 Dispositif d'analyse de données, procédé d'analyse de données et programme

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2019/043242 WO2021090357A1 (fr) 2019-11-05 2019-11-05 Dispositif d'analyse de données, procédé d'analyse de données et programme

Publications (1)

Publication Number Publication Date
WO2021090357A1 true WO2021090357A1 (fr) 2021-05-14

Family

ID=75849654

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2019/043242 WO2021090357A1 (fr) 2019-11-05 2019-11-05 Dispositif d'analyse de données, procédé d'analyse de données et programme

Country Status (1)

Country Link
WO (1) WO2021090357A1 (fr)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007052754A (ja) * 2005-08-18 2007-03-01 Shizuo Nagashima データ分析集計表示処理制御装置
WO2019026134A1 (fr) * 2017-07-31 2019-02-07 三菱電機株式会社 Dispositif et procédé de traitement d'informations

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007052754A (ja) * 2005-08-18 2007-03-01 Shizuo Nagashima データ分析集計表示処理制御装置
WO2019026134A1 (fr) * 2017-07-31 2019-02-07 三菱電機株式会社 Dispositif et procédé de traitement d'informations

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
OGASAWARA, ASATO ET AL.: "Efficient search algorithm for local exception partial data using statistical confidence intervals", THE 10TH FORUM ON DATA ENGINEERING AND INFORMATION MANAGEMENT (THE 16THANNUAL MEETING OF THE DATABASE SOCIETY OF JAPAN ), IEICE TECHNICAL COMMITTEE ON DATA ENGINEERING, THE DATABASE SOCIETY OF JAPAN, IPSJ SIG NOTES, 6 March 2018 (2018-03-06), XP055821901, Retrieved from the Internet <URL:http://db-event.jpn.org/deim2018/data/papers/31.pd> [retrieved on 20200514] *

Similar Documents

Publication Publication Date Title
CN108733639B (zh) 一种配置参数调整方法、装置、终端设备及存储介质
US11403303B2 (en) Method and device for generating ranking model
CN108280091B (zh) 一种任务请求执行方法和装置
CN109903105B (zh) 一种完善目标商品属性的方法和装置
US20180239496A1 (en) Clustering and analysis of commands in user interfaces
US20170300584A1 (en) Customized and Automated Dynamic Infographics
US20160149948A1 (en) Automated Cyber Threat Mitigation Coordinator
EP3706012A1 (fr) Système et procédé de sélection de données
CN109857791B (zh) 一种数据批量处理方法与装置
CN113656315B (zh) 数据测试方法、装置、电子设备和存储介质
CN112947919A (zh) 构建业务模型和处理业务请求的方法和装置
WO2021090357A1 (fr) Dispositif d&#39;analyse de données, procédé d&#39;analyse de données et programme
US20230409929A1 (en) Methods and apparatuses for training prediction model
US20180365341A1 (en) Three-Dimensional Cad System Device, and Knowledge Management Method Used in Three-Dimensional Cad
CN109902196B (zh) 一种商标类别推荐方法、装置、计算机设备及存储介质
WO2016053382A1 (fr) Normalisation et déduplication d&#39;annonce d&#39;emploi
JP2010020617A (ja) 設計事例検索装置,設計事例検索プログラム
CN111078671A (zh) 数据表字段的修改方法、装置、设备和介质
CN116303461A (zh) 一种部件库创建方法、装置、电子设备和存储介质
CN115329150A (zh) 生成搜索条件树的方法、装置、电子设备及存储介质
CN115687717A (zh) Grok表达式获取方法、装置、设备及计算机可读存储介质
CN112887426B (zh) 信息流的推送方法、装置、电子设备以及存储介质
CN115169316A (zh) 数据处理模板生成方法、装置、电子设备及存储介质
CN104778253B (zh) 一种提供数据的方法和装置
JP6491806B1 (ja) 定量的レシピ情報作成支援サーバ、情報処理端末、定量的レシピ情報作成支援システム、定量的レシピ情報作成支援方法及び定量的レシピ情報作成支援プログラム

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19952032

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19952032

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: JP