CN110471954B

CN110471954B - Data mining method and device

Info

Publication number: CN110471954B
Application number: CN201910693030.8A
Authority: CN
Inventors: 刘译璟; 苏萌; 代其锋; 肖洋; 徐林杰; 刘钰
Original assignee: Beijing Percent Technology Group Co ltd
Current assignee: Beijing Percent Technology Group Co ltd
Priority date: 2019-07-29
Filing date: 2019-07-29
Publication date: 2022-05-03
Anticipated expiration: 2039-07-29
Also published as: CN110471954A

Abstract

The application discloses a data mining method and device. The method comprises the following steps: acquiring metadata of a data table and structural data of a chart generated in advance according to data stored in the data table, wherein the structural data is used for describing the structure of the chart; generating at least one target metadata according to the metadata of the data table and the structure data of the diagram; and screening out corresponding sub data tables from the data tables respectively through each target metadata for data mining. Because the structural data of the chart and the metadata of the data table are combined in the process of generating the target metadata, compared with the prior art that the analysis and mining are directly carried out on the data table, the sub-data table screened by the target metadata can embody the mined information more carefully and comprehensively, and therefore the problems in the prior art can be solved.

Description

Data mining method and device

Technical Field

The present application relates to the field of computer technologies, and in particular, to a data mining method and apparatus.

Background

A large amount of data is generated in the internet development process, and the data can be stored through corresponding fields in a data table, for example, transaction data generated by a user in an online shopping process can be stored under corresponding fields of the data table as a record.

In practical applications, corresponding information may be mined and applied by analyzing data stored in relevant fields in the data table, for example, by analyzing user data in regional fields in the data table, a target market for a product is mined, and the like. However, this data mining approach is often inefficient and may result in a large amount of missing information.

Disclosure of Invention

The embodiment of the application provides a data mining method and device, which can be used for solving the problems in the prior art.

The embodiment of the application provides a data mining method, which comprises the following steps:

acquiring metadata of a data table and structural data of a chart generated in advance according to data stored in the data table, wherein the structural data is used for describing the structure of the chart;

generating at least one target metadata according to the metadata of the data table and the structure data of the diagram;

and screening out corresponding sub data tables from the data tables respectively through each target metadata for data mining.

The embodiment of the present application further provides a data mining apparatus, including: the device comprises an acquisition unit, a generation unit and a screening unit, wherein:

an acquisition unit that acquires metadata of a data table and structure data of a chart generated in advance from data stored in the data table, wherein the structure data is used to describe a structure of the chart;

the generating unit is used for generating at least one target metadata according to the metadata of the data table and the structure data of the diagram;

and the screening unit is used for screening the corresponding sub data tables from the data tables respectively through each target metadata for data mining.

The embodiment of the application adopts at least one technical scheme which can achieve the following beneficial effects:

by adopting the data mining method provided by the embodiment of the application, after the metadata of the data table and the structural data of the chart generated in advance according to the data stored in the data table are obtained, at least one target metadata is generated according to the metadata of the data table and the structural data of the chart, and the corresponding sub-data tables are respectively screened from the data table through the target metadata, so that the data mining method is used for data mining. Because the structural data of the chart and the metadata of the data table are combined in the process of generating the target metadata, compared with the prior art that the analysis and mining are directly carried out on the data table, the sub-data table screened by the target metadata can embody the mined information more carefully and comprehensively, and therefore the problems in the prior art can be solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1 is a schematic flowchart of a specific implementation of a data mining method provided in embodiment 1 of the present application;

fig. 2 is a schematic diagram of a specific scenario of a data mining method provided in embodiment 1 of the present application;

fig. 3 is a schematic structural diagram of a data mining device according to embodiment 2 of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The technical solutions provided by the embodiments of the present application are described in detail below with reference to the accompanying drawings.

Example 1

As mentioned above, the prior art directly analyzes the data stored in the relevant fields in the data table, so as to mine the corresponding information for application. For example, the user data in the region field (the region for storing the user) is analyzed, so as to count the number of users in each region, and thus the target market of the product can be excavated. However, such a data mining method is generally limited by personal experience of data mining personnel, and cannot fully mine information therein, resulting in omission of a lot of information and poor mining effect.

Based on this, embodiment 1 of the present application provides a data mining method that can be used to solve the above-described problems. The specific flow diagram of the method is shown in fig. 1, and the method comprises the following steps:

step S11: metadata of a data table and structure data of a pre-generated chart are acquired.

The metadata of the data table may be used to describe a structure of the data table, including field names and field types of fields in the data table, where the field types may generally include three types, which are respectively a numeric field, a time field, and a text field, the data type stored in the numeric field is a specific numeric value (e.g., 20 or other numeric values), the data type stored in the time field is time (e.g., 2019, 5/20), and the data type stored in the text field is text (e.g., north Hu).

For example, the field name is "province", the field is used for storing the region data of the user, and the type of the stored data is text, such as Hubei, Zhejiang, and the like.

The structure data of the chart can be used to describe the structure of the chart, including dimensions of the chart (such as abscissa), metrics (such as ordinate), aggregation functions acting on the metrics field, and the like.

In practical applications, a chart may be generated in advance according to data stored in the data table, for example, the data table includes a region field such as "province" and a user number field, the region field such as "province" may be used as an abscissa, and the user number field may be used as an ordinate, so as to generate a chart of a histogram, where the height of a column in the chart of the histogram reflects the number of users in each region.

In addition, the metadata of the data table and the structure data of the chart may be obtained in real time, or may be obtained periodically (for example, the obtaining period is 10 minutes or other), or may be obtained by specifying a certain time point (for example, 9 am or other time points every day), or may be obtained when it is monitored that a trigger condition is satisfied, where the trigger condition may be that the user clicks or moves a mouse to a trigger button, or the like.

Step S12: at least one target metadata is generated according to the metadata of the data table and the structure data of the diagram.

The specific generation method of the target metadata may be multiple, and several generation methods may be listed below for specific description:

in the first mode, new filter conditions are added on the basis of the structure data of the chart in combination with the metadata of the data table. For example, based on the structure data of the diagram, the target field and the target enumeration value in the target field are determined from the data table as the filter condition, so as to generate the target metadata by combining the structure data of the diagram, the target field and the target enumeration value.

Since the chart is generated from the data in the data table, when determining the target field, the target field is selected from fields other than the field used for generating the chart. For example, one field may be arbitrarily selected as the target field from fields other than the fields used for generating the chart, or a field may be selected as the target field from fields other than the fields used for generating the chart according to actual needs.

In practical applications, the target field may be a text-type field, for example, the target field may be "province", and the "province" field and some enumerated values thereof (e.g., Hubei) may be used as a filtering condition, and combined with the structure data of the chart to generate the target metadata.

In the generation mode of the target metadata, the chart structure data is reserved to a great extent, and a new filtering condition is added on the basis of the chart structure data, so that the efficiency of generating the target metadata is high, and the dimension and the index of the chart are reserved.

In the second mode, on the basis of the structure data of the chart, in combination with the metadata of the data table, the field corresponding to the dimension (such as the abscissa) of the chart (i.e., the field corresponding to the dimension in the data table) is extended in a drill-down analysis mode, so as to determine the sub-field, and then the target metadata is generated based on the sub-field. In this process, the value of the field corresponding to the dimension of the chart may be determined as a specific value, and then a subdivided field (sub-field) is added for the specific value in combination with the metadata of the data table. This way, the target metadata is generated by keeping the index (such as ordinate) of the chart, performing drill-down analysis only for the dimension of the chart under a specific value, and combining the metadata of the data table to generate the dimension granularity thinner field and data.

For example, the dimension of the chart is "province", the metadata of the data table is subjected to drill-down analysis according to the value of "Hubei" of the data table, so that the sub-field "city" with finer dimension granularity is generated, and data such as "Wuhan", "yellow stone" and the like are generated, so that target metadata are generated.

Step S13: and screening out corresponding sub data tables from the data tables through each target metadata for data mining.

After the target metadata is determined, a query statement applied to the data table may be generated from the metadata, thereby screening the data table for sub-data tables that can be used for data mining. There are various ways to generate a query statement from metadata that is applied to a data table, such as by running a script to generate the metadata into the query statement.

In addition, in the process of screening out the corresponding sub data tables from the data tables through the target metadata, the target metadata can be used as the current target metadata, and the corresponding query statement is generated through the current target metadata, so as to screen out the corresponding sub data tables from the data tables.

By using the data mining method provided in embodiment 1 of the present application, after metadata of a data table and structure data of a chart generated in advance according to data stored in the data table are obtained, at least one target metadata is generated according to the metadata of the data table and the structure data of the chart, and corresponding sub-data tables are screened from the data table through each target metadata, so as to be used for data mining. The structure data of the chart and the metadata of the data table are combined in the process of generating the target metadata, and compared with the prior art that analysis and mining are directly carried out on the data table, the sub-data table screened out through the target metadata can embody the mined information more carefully and comprehensively, so that the information omission in data mining is reduced, the data mining effect is improved, and the problems in the prior art can be solved.

For example, the target metadata generated in the first step of S12, in combination with the structure data of the diagram, the target field, and the target enumeration value, can provide mining information from other aspects, compared with the prior art. The target metadata generated in the second mode of step S12 is used to perform drill-down analysis on the fields corresponding to the dimensions of the graph, so that mining information can be provided in more detail than in the prior art.

In practical applications, the method may further include step S14: and carrying out data mining on each sub data table. The specific data mining method may be to screen the sub data tables meeting the preset conditions from the sub data tables, and perform data mining on the sub data tables meeting the preset conditions.

The sub-data table meeting the preset condition is specifically a sub-data table meeting any one of the following preset conditions (that is, when a certain sub-data table meets any one of the following preset conditions 1 to 4, the sub-data table is a sub-data table meeting the preset conditions): the ratio of the maximum numerical value in the sub-data table to other numerical values in the sub-data table is greater than a preset threshold (called as a preset condition 1); the ratio of the maximum numerical value in the sub-data table to the sum of the numerical values in the sub-data table is greater than a second preset threshold (called as preset condition 2); at least one correlation coefficient in correlation coefficients between every two fields of the sub-data table is larger than a third preset threshold (called as a preset condition 3); and determining that the rising or falling amplitude of the numerical value in the sub-data table exceeds a fourth preset threshold (called as a preset condition 4) by a nonparametric inspection method.

When determining whether the sub-data table meets the preset condition 1, sorting the values in the sub-data table according to size, and then determining whether a ratio of a maximum value to a second maximum value in the sub-data table is greater than a preset threshold, if so, determining that the preset condition 1 is met. In the method, only the ratio of the maximum value to the second maximum value and the preset threshold value are judged, so that the efficiency is high. In addition, the preset threshold may be set according to actual conditions, and when the sub data table satisfies the preset condition 1, it is described that an abnormal value is included in the data table.

When determining whether the sub-data table meets the preset condition 2, summing the values in the sub-data table, calculating a ratio of the maximum value to the summed value, and determining whether the ratio is greater than a second threshold. The size of the second threshold value can be set according to actual conditions, and when the sub data table meets the preset condition 2, the ratio of the maximum numerical value in the data table is abnormal.

When determining whether the sub-data table meets the preset condition 3, it may be determined whether a correlation coefficient between every two fields in the sub-data table is greater than a third preset threshold, and if the correlation coefficient between two fields in the sub-data table is greater than the third preset threshold (the size may be set according to actual needs), it is indicated that the fields of the sub-data table have strong correlation.

When determining whether the sub-data table satisfies the preset condition 4, the method is generally directed to a time-series sub-data table (including a field of a time type), and according to the sub-data table, a non-parameter checking method such as an MK test is performed to determine whether the magnitude of the increase or decrease of the value in the sub-data table exceeds a fourth preset threshold, and when the magnitude of the increase or decrease exceeds the fourth preset threshold, the preset condition 4 is satisfied. The size of the fourth preset threshold may be set according to an actual situation, and when the sub data table satisfies the preset condition 4, it indicates that there is a significant increase or decrease in the overall trend of the data in the sub data table.

It should be noted that, data mining may be performed on the sub data tables meeting the preset condition, and there may be a variety of ways, for example, the sub data tables meeting the preset condition may be generated into a corresponding chart, and the generated chart may be displayed. Since the sub data table satisfying the preset condition has an abnormal value, an abnormal proportion, a correlation or an overall trend, the information can be intuitively reflected on the basis of the chart generated by the sub data table.

Of course, the corresponding natural language description information may also be generated according to the type of the preset condition satisfied by the sub data table satisfying the preset condition, and the generated natural language description information may be displayed. For example, if a certain sub-data table satisfies the preset condition 1, the natural language description information may be generated according to the template corresponding to the preset condition 1 and displayed. For the sub data tables of the types meeting other preset conditions, the natural language description information can be generated according to the corresponding templates so as to be displayed.

Of course, the corresponding chart generated by the sub-data table meeting the preset condition and the generated natural language description information can be displayed at the same time, so that the information required by the user can be better provided in a mode of combining pictures and texts.

To facilitate understanding of the data mining method provided in this embodiment 1, the method may be further described with reference to the specific scenario in fig. 2, where the steps included in fig. 2 are as follows:

step S21: metadata of the data table and structure data of the chart are obtained.

Step S22: based on the above metadata and structural data, at least one target metadata is generated by adding a filter condition or drill-down analysis.

Step S23: and generating corresponding retrieval statements according to the generated target metadata.

Step S24: and querying the data table according to each generated retrieval statement, and screening out corresponding sub data tables from the data table respectively.

Step S25: and preprocessing, such as data cleaning and the like, is performed on each sub data table.

Step S26: and judging whether the preprocessed sub data table comprises abnormal conditions, such as abnormal values, abnormal proportion, correlation or overall trend rising or falling and the like.

Step S27: and displaying the abnormal conditions in a chart mode and adding corresponding natural language description information.

Example 2

Based on the same inventive concept as that of embodiment 1 of the present application, embodiment 2 of the present application provides a data mining apparatus that can also be used to solve the problems in the prior art. In addition, for unclear points in this example 2, reference may be made to example 1. As shown in fig. 2, the apparatus 20 includes: an acquisition unit 201, a generation unit 202, and a screening unit 203, wherein:

an acquisition unit 201 that acquires metadata of a data table and structure data of a chart generated in advance from data stored in the data table, the structure data being used to describe a structure of the chart;

a generating unit 202 for generating at least one target metadata according to the metadata of the data table and the structure data of the chart;

and the screening unit 203 screens out corresponding sub data tables from the data tables through each target metadata for data mining.

Since the apparatus 20 adopts the same inventive concept as the data mining in embodiment 1, in the case that embodiment 1 can solve the technical problem, the apparatus 20 in embodiment 2 can also solve the technical problem, and thus the description thereof is omitted.

In practice, the apparatus 20 may also include a digging unit. The mining unit screens sub data tables meeting preset conditions from all the sub data tables and performs data mining on the sub data tables meeting the preset conditions;

the sub data table meeting the preset condition specifically includes that the sub data table meets any one of the following preset conditions: the ratio of the maximum numerical value in the sub data table to other numerical values in the sub data table is larger than a preset threshold value; the ratio of the maximum numerical value in the sub data table to the sum of the numerical values in the sub data table is larger than a second preset threshold; at least one correlation coefficient in correlation coefficients between every two fields of the sub data table is larger than a third preset threshold; and determining that the rising or falling amplitude of the numerical value in the sub data table exceeds a fourth preset threshold value by a nonparametric inspection method.

Also, the excavation unit may specifically include an excavation subunit. The mining subunit generates a corresponding chart based on the sub-data sheet meeting the preset conditions, and displays the generated chart; and generating corresponding natural language description information according to the type of the preset condition met by the sub data table meeting the preset condition, and displaying the generated natural language description information.

In practical applications, the generating unit 202 may specifically include a generating subunit. The generation subunit determines a sub-field of a field corresponding to a dimension of the data table by performing drill-down analysis on the field corresponding to the dimension in combination with the metadata of the data table, and generates target metadata based on the sub-field.

The generating unit 202 may further specifically include a second generating subunit. The second generation subunit determines a target field and a target enumeration value in the target field according to metadata of the data table, wherein the target field is selected from fields other than a field in the data table used for generating the chart, and then generates target metadata according to the target field, the target enumeration value and the structure data of the chart.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.

The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A data mining method is used for mining transaction data generated by users in the online shopping process, wherein at least the region and gender data of the users are stored under the corresponding fields of a data table as a record, and the method is characterized by comprising the following steps:

acquiring metadata of a data table and structural data of a chart generated in advance according to data stored in the data table, wherein the structural data is used for describing the structure of the chart; the metadata of the data table is used for describing the structure of the data table, and comprises field names and field types of all fields in the data table; the structure data of the chart is used for describing the structure of the chart, and comprises the dimension, the index and the aggregation function acting on the index field of the chart;

screening out corresponding sub data tables from the data tables respectively through each target metadata for data mining;

generating at least one target metadata according to the metadata of the data table and the structure data of the diagram, specifically comprising:

determining a target field and a target enumerated value in the target field according to metadata of the data table, wherein the target field is selected from fields other than a field in the data table used for generating the chart;

generating target metadata according to the target field, the target enumeration value and the structure data of the chart;

or

determining sub-fields of fields corresponding to the dimensions of the data table by combining metadata of the data table and performing drill-down analysis on the fields corresponding to the dimensions;

target metadata is generated based on the subfields.

2. The method of claim 1, wherein the method further comprises: screening the sub data tables meeting preset conditions from the sub data tables, and performing data mining on the sub data tables meeting the preset conditions, wherein the sub data tables meeting the preset conditions specifically comprise that the sub data tables meet any one of the following preset conditions:

the ratio of the maximum numerical value in the sub data table to other numerical values in the sub data table is larger than a preset threshold value;

the ratio of the maximum numerical value in the sub data table to the sum of the numerical values in the sub data table is larger than a second preset threshold;

at least one correlation coefficient in correlation coefficients between every two fields of the sub data table is larger than a third preset threshold;

and determining that the rising or falling amplitude of the numerical value in the sub data table exceeds a fourth preset threshold value by a nonparametric inspection method.

3. The method of claim 2, wherein the data mining of the sub-data table satisfying the preset condition specifically comprises: and generating a corresponding chart based on the sub-data sheet meeting the preset conditions, and displaying the generated chart.

4. The method according to any one of claims 2 and 3, wherein data mining the sub-data table satisfying the preset condition specifically includes: and generating corresponding natural language description information according to the type of the preset condition met by the sub data table meeting the preset condition, and displaying the generated natural language description information.

5. A data mining device is used for mining transaction data generated by a user in an online shopping process, wherein at least region and gender data of the user are stored under corresponding fields of a data table as a record, and the data mining device is characterized by comprising the following steps: the device comprises an acquisition unit, a generation unit and a screening unit, wherein:

an acquisition unit that acquires metadata of a data table and structure data of a chart generated in advance from data stored in the data table, wherein the structure data is used to describe a structure of the chart; the metadata of the data table is used for describing the structure of the data table, and comprises field names and field types of all fields in the data table; the structure data of the chart is used for describing the structure of the chart, and comprises the dimension, the index and the aggregation function acting on the index field of the chart;

the screening unit is used for screening corresponding sub data tables from the data tables through each target metadata for data mining;

wherein, the generating unit specifically comprises:

a generating subunit, which determines the sub-field of the field corresponding to the dimension by performing drill-down analysis on the field corresponding to the dimension of the data table in combination with the metadata of the data table; generating target metadata based on the subfields;

a second generation subunit, configured to determine a target field and a target enumeration value in the target field according to metadata of the data table, where the target field is selected from fields other than a field in the data table used for generating the chart; and generating target metadata according to the target field, the target enumeration value and the structure data of the chart.

6. The apparatus of claim 5, wherein the apparatus further comprises:

the mining unit is used for screening the sub data tables meeting preset conditions from the sub data tables and mining the data of the sub data tables meeting the preset conditions;

7. The device according to claim 6, characterized in that said digging unit comprises in particular:

the mining subunit generates a corresponding chart based on the sub-data sheet meeting the preset conditions, and displays the generated chart; and generating corresponding natural language description information according to the type of the preset condition met by the sub data table meeting the preset condition, and displaying the generated natural language description information.