CN116756615A - Data analysis method, data analysis device, computer readable medium and electronic equipment - Google Patents

Data analysis method, data analysis device, computer readable medium and electronic equipment Download PDF

Info

Publication number
CN116756615A
CN116756615A CN202310764862.0A CN202310764862A CN116756615A CN 116756615 A CN116756615 A CN 116756615A CN 202310764862 A CN202310764862 A CN 202310764862A CN 116756615 A CN116756615 A CN 116756615A
Authority
CN
China
Prior art keywords
data
target
dimension
determining
type
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310764862.0A
Other languages
Chinese (zh)
Inventor
柯珍梅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zitiao Network Technology Co Ltd
Original Assignee
Beijing Zitiao Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zitiao Network Technology Co Ltd filed Critical Beijing Zitiao Network Technology Co Ltd
Priority to CN202310764862.0A priority Critical patent/CN116756615A/en
Publication of CN116756615A publication Critical patent/CN116756615A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Algebra (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Computational Mathematics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Complex Calculations (AREA)

Abstract

The present disclosure relates to a data analysis method, apparatus, computer readable medium and electronic device, the method comprising: acquiring test data and preset analysis indexes aiming at a service system; processing the test data into two kinds of data under different data dimensions; determining the distribution difference degree between each type of data in each type of data, and determining the category interpretation degree of each type of data in each type of data according to the preset analysis index; and determining a target data subgroup in the test data according to the distribution difference degree and the category interpretation degree. Through the technical scheme, the different data dimensions of each test data can be automatically analyzed, so that the target data subgroup meeting the preset analysis index can be accurately determined from the different data dimensions.

Description

Data analysis method, data analysis device, computer readable medium and electronic equipment
Technical Field
The present disclosure relates to the field of computer technology, and in particular, to a data analysis method, apparatus, computer readable medium, and electronic device.
Background
In an actual business scenario, data of different dimensions is typically subjected to differential processing effect (Heterogeneity of treatment effect, HTE) analysis based on test data to locate subgroups (people/things) that meet certain target metrics, thereby validating a specific business strategy for that subgroup.
In the related art, a manual disassembly analysis method is generally adopted to disassemble the dimensions of test data, and differentiation processing effect analysis is performed according to the disassembled dimensions so as to position subgroups meeting certain target indexes. However, manual disassembly analysis methods are inefficient, time consuming, and difficult to find the best subgroups through an exhaustive cross-over combination of different dimensions.
Disclosure of Invention
This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
In a first aspect, the present disclosure provides a data analysis method, the method comprising:
acquiring test data and preset analysis indexes aiming at a service system, wherein the preset analysis indexes are used for representing the relation between target data expected to be obtained from the test data and target indexes, the test data comprise experimental group data and comparison group data, the target indexes are obtained by dividing first data indexes by second data indexes, the first data indexes are obtained based on first difference data of the experimental group data and the comparison group data under first sub-data indexes, and the second data indexes are obtained based on second difference data of the experimental group data and the comparison group data under second sub-data indexes;
Processing the test data into two kinds of data under different data dimensions;
determining the distribution difference degree between each type of data in each type of data, and determining the class interpretation degree of each type of data in each type of data according to the preset analysis index, wherein the class interpretation degree is used for representing the data duty ratio of the first difference value data or the second difference value data corresponding to each type of data in the test data;
and determining a target data subgroup in the test data according to the distribution difference degree and the category interpretation degree.
In a second aspect, the present disclosure provides a data analysis apparatus comprising:
the system comprises an acquisition module, a comparison module and a comparison module, wherein the acquisition module is used for acquiring test data and preset analysis indexes aiming at a service system, the preset analysis indexes are used for representing the relation between target data expected to be obtained from the test data and target indexes, the test data comprise experimental group data and comparison group data, the target indexes are obtained by dividing first data indexes by second data indexes, the first data indexes are obtained based on first difference data of the experimental group data and the comparison group data under first sub-data indexes, and the second data indexes are obtained based on second difference data of the experimental group data and the comparison group data under second sub-data indexes;
The processing module is used for processing the test data into two kinds of data under different data dimensions;
the first determining module is used for determining the distribution difference degree between each type of data in each type of data, and determining the category interpretation degree of each type of data in each type of data according to the preset analysis index, wherein the category interpretation degree is used for representing the data duty ratio of the first difference value data or the second difference value data corresponding to each type of data in the test data;
and the second determining module is used for determining a target data subgroup in the test data according to the distribution difference degree and the category interpretation degree.
In a third aspect, the present disclosure provides a computer readable medium having stored thereon a computer program which when executed by a processing device performs the steps of the method of the first aspect.
In a fourth aspect, the present disclosure provides an electronic device comprising:
a storage device having a computer program stored thereon;
processing means for executing said computer program in said storage means to carry out the steps of the method of the first aspect.
According to the technical scheme, the test data in different data dimensions can be processed into the two-class data according to the target index, so that the distribution difference degree and the class interpretation degree between each class of data can be determined according to the two-class data, and the target data subgroup can be determined in the test data according to the distribution difference degree and the class interpretation degree between each class of data. The target index is obtained by dividing the first data index by the second data index, so that the target index can represent the difference processing effect between the first data index and the second data index, and therefore, the automatic difference processing effect analysis can be carried out on the test data from different data dimensions, the target data subgroup which accords with the preset analysis index can be obtained more accurately, and the data analysis efficiency is improved.
Additional features and advantages of the present disclosure will be set forth in the detailed description which follows.
Drawings
The above and other features, advantages, and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. The same or similar reference numbers will be used throughout the drawings to refer to the same or like elements. It should be understood that the figures are schematic and that elements and components are not necessarily drawn to scale. In the drawings:
FIG. 1 is a flow chart of a method of data analysis provided in accordance with an exemplary embodiment of the present disclosure;
FIG. 2 is a schematic diagram of a target data decision tree provided in accordance with an exemplary embodiment of the present disclosure;
FIG. 3 is a flow chart of another data analysis method provided in accordance with an exemplary embodiment of the present disclosure;
FIG. 4 is a block diagram of a data analysis device provided in accordance with an exemplary embodiment of the present disclosure;
fig. 5 is a schematic structural view of an electronic device according to an exemplary embodiment of the present disclosure.
Detailed Description
Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure have been shown in the accompanying drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.
It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.
The term "including" and variations thereof as used herein are intended to be open-ended, i.e., including, but not limited to. The term "based on" is based at least in part on. The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments. Related definitions of other terms will be given in the description below.
It should be noted that the terms "first," "second," and the like in this disclosure are merely used to distinguish between different devices, modules, or units and are not used to define an order or interdependence of functions performed by the devices, modules, or units.
It should be noted that references to "one", "a plurality" and "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those of ordinary skill in the art will appreciate that "one or more" is intended to be understood as "one or more" unless the context clearly indicates otherwise.
The names of messages or information interacted between the various devices in the embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.
It will be appreciated that prior to using the technical solutions disclosed in the embodiments of the present disclosure, the user should be informed and authorized of the type, usage range, usage scenario, etc. of the personal information related to the present disclosure in an appropriate manner according to the relevant legal regulations.
For example, in response to receiving an active request from a user, a prompt is sent to the user to explicitly prompt the user that the operation it is requesting to perform will require personal information to be obtained and used with the user. Thus, the user can autonomously select whether to provide personal information to software or hardware such as an electronic device, an application program, a server or a storage medium for executing the operation of the technical scheme of the present disclosure according to the prompt information.
As an alternative but non-limiting implementation, in response to receiving an active request from a user, the manner in which the prompt information is sent to the user may be, for example, a popup, in which the prompt information may be presented in a text manner. In addition, a selection control for the user to select to provide personal information to the electronic device in a 'consent' or 'disagreement' manner can be carried in the popup window.
It will be appreciated that the above-described notification and user authorization process is merely illustrative and not limiting of the implementations of the present disclosure, and that other ways of satisfying relevant legal regulations may be applied to the implementations of the present disclosure.
Meanwhile, it can be understood that the data (including but not limited to the data itself, the acquisition or the use of the data) related to the technical scheme should conform to the requirements of the corresponding laws and regulations and related regulations.
As mentioned in the introduction, in an actual business scenario, differentiation processing effects (Heterogeneity of treatment effect, HTE) analysis are typically performed on data of different dimensions based on test data to locate a subgroup (crowd/thing) meeting certain target criteria, thereby validating a specific business strategy for that subgroup. Wherein, the differentiation effect refers to that the influence of one index on another index may be different from individual to individual. For example, when one index is the pay amount of a product and the other index is the frequency of use of the product, the frequency of use of the product may be greatly reduced for user a, but may remain unchanged or be slightly reduced for user B as the pay amount of the product increases.
In the related art, a manual disassembly analysis method is generally adopted to disassemble the dimensions of test data, and differentiation processing effect analysis is performed according to the disassembled dimensions so as to position subgroups meeting certain target indexes. However, manual disassembly analysis methods are inefficient, time consuming, and difficult to find the best subgroups through an exhaustive cross-over combination of different dimensions.
In view of the above, embodiments of the present disclosure provide a data analysis method, apparatus, computer readable medium and electronic device, so as to solve the above technical problems.
Embodiments of the present disclosure are further explained below with reference to the drawings.
Fig. 1 is a flowchart illustrating a data analysis method according to an exemplary embodiment of the present disclosure, and referring to fig. 1, the method may include the steps of:
s101: the method comprises the steps of obtaining test data and preset analysis indexes aiming at a service system, wherein the preset analysis indexes are used for representing the relation between target data expected to be obtained from the test data and target indexes, the test data comprise experimental group data and comparison group data, the target indexes are obtained by dividing first data indexes by second data indexes, the first data indexes are obtained based on first difference data of the experimental group data and the comparison group data under first sub-data indexes, and the second data indexes are obtained based on second difference data of the experimental group data and the comparison group data under second sub-data indexes.
It should be appreciated that a business system typically needs to verify the feasibility of a particular business strategy based on test data before executing that business strategy. That is, it is generally necessary to construct experimental group data and test group data and execute the service policy with the experimental group data, and not execute the service policy with the control group data, so that the feasibility of the service policy can be determined according to the result of the experimental group data and the result of the control group data. Experimental group data and test group data may be set according to actual conditions, which is not subject to any limitation by the embodiments of the present disclosure.
It should be further understood that the relationship between the target data and the target index may be determined according to an actual business scenario, which is not limited in any way by the embodiments of the present disclosure. In a possible embodiment, the relationship between the target data and the target index may be a positive correlation or a negative correlation.
It should be further understood that the target index, the first data index, the second data index, the first sub-data index, and the second sub-data index may also be determined according to an actual service scenario, which is not limited in any way by the embodiments of the present disclosure. In a possible embodiment, the target index may be a ratio of the amounts of change of two preset indexes between the experimental group data and the control group data. That is, the target index may be expressed as:
a=A 1 -A 2
b=B 1 -B 2
Wherein O represents the target index, a represents the variation of the first preset index between the experimental group data and the control group data, namely the first difference value data, b represents the variation of the second preset index between the experimental group data and the control group data, namely the second difference value data, A 1 Representing experimental group data corresponding to a first preset index, A 2 Representing the data of the control group corresponding to the first preset index, B 1 Representing experimental group data corresponding to a second preset index, B 2 And representing the data of the control group corresponding to the second preset index.
For example, the target index may be a ratio of a virtual resource variation amount to an active duration loss amount, and the corresponding first data index is a virtual resource variation amount, the first data index is a virtual resource, the second data index is an active duration loss amount, the second data index is an active duration, the virtual resource variation amount is determined by a difference value between the virtual resource in the experimental group data and the virtual resource in the control group data, and the active duration loss amount is determined by a difference value between the active duration in the experimental group data and the active duration in the control group data.
In a possible implementation manner, the target index may also be a ratio of a variation ratio of experimental group data to control group data of two preset indexes. That is, the target index may be expressed as:
a=A 1 -A 2
b=B 1 -B 2
Wherein O represents a target index, a/A2 represents the ratio of the variation ratio of experimental group data of a first preset index to control group data, namely a first data index, B/B 2 The ratio of the variation ratio of the experimental group data to the control group data representing the second preset index, namely the second data index, A 1 Representing experimental group data corresponding to a first preset index, A 9 Representing the data of the control group corresponding to the first preset index, B 1 Representing experimental group data corresponding to a second preset index, B 2 The reference group data corresponding to the second preset index is represented by a, the variation of the experimental group data of the first preset index relative to the reference group data is represented by a first difference value data, and the variation of the experimental group data of the second preset index relative to the reference group data is represented by b, which is represented by a second difference value data.
For example, the target index may be a ratio of a virtual resource change ratio to an active duration loss ratio, and the corresponding first data index is a virtual resource change ratio, the first data index is a virtual resource, the second data index is an active duration loss ratio, the second data index is an active duration, the virtual resource change ratio is determined by a difference value (i.e., first difference value data) between the virtual resource in the experimental group data and the virtual resource in the control group data, and the active duration loss ratio is determined by a difference value (i.e., second difference value data) between the active duration in the experimental group data and the active duration in the control group data, and the active duration in the control group data.
S102: test data is processed into two classes of data in different data dimensions.
It should be appreciated that the test data may include a plurality of data dimensions, and that each data dimension may be used to characterize a feature of the test data or to characterize factors that have an impact on the test data. For example, the data dimension corresponding to a test data may have an age, a geographic location, and the like.
It should also be appreciated that each data dimension may contain multiple dimension elements, and that the number of dimension elements contained in different data dimensions may be the same or different. In order to avoid analysis errors caused by different numbers of dimension elements contained in different data dimensions, the embodiment of the disclosure performs two-classification processing on test data to ensure that the number of dimension elements contained in each data dimension of the test data is consistent, so as to avoid analysis errors caused by different numbers of dimension elements contained in different data dimensions. Wherein dimension elements are used to characterize element categories under each data dimension. For example, when the data dimension is age, its corresponding dimension element may be a different age, for example, may be 20 years old, 30 years old, 40 years old, 50 years old, and the like.
In a possible implementation, processing test data into binary data under different data dimensions may include:
determining the type of the data dimension aiming at each data dimension corresponding to the test data, and determining the target division information corresponding to the data dimension according to the type of the data dimension; according to the target division information, the test data are processed into the classified data from different data dimensions.
It should be appreciated that the data dimension can be divided into two types, a qualitative dimension and a quantitative dimension, depending on the data type of the dimension element in the data dimension. The data dimension in which the data type of the dimension element is character type or text type is qualitative dimension, such as geographic position. The data dimension in which the data type of the dimension element is numerical is a quantitative dimension, for example, age or the like. Thus, in a possible implementation, a data dimension can be judged to be either a qualitative dimension or a quantitative dimension by identifying the data type of the dimension element under that data dimension.
After the type of the data dimension is determined, corresponding target division information can be determined according to the type of the data dimension, and then two classification processes are performed according to the target division information.
In a possible implementation manner, determining the target division information corresponding to the data dimension according to the type of the data dimension may include:
according to the type of the data dimension, candidate partition information corresponding to the data dimension is determined; determining the distribution difference degree between each type of data after the test data are divided according to the candidate division information; and determining target division information in the candidate division information according to the distribution difference degree between each type of data after division according to the candidate division information.
It should be understood that the candidate division information may be set according to actual circumstances, and the embodiment of the present disclosure does not impose any limitation on this. In one possible implementation, the candidate partition information corresponding to the data dimension may be a partition information of each dimension element in the data dimension. In another possible implementation manner, the candidate partition information corresponding to the data dimension may be obtained by screening the dimension elements in the data dimension based on a preset selection rule, so as to obtain a plurality of target dimension elements, and then using each target dimension as a partition information.
It should be further understood that the greater the degree of distribution difference between each type of data after division according to the candidate division information, the greater the degree of discrimination between the data, and the more explanatory the resulting data sub-group. Therefore, in the embodiment of the disclosure, the analysis accuracy of the target data subgroup can be improved by comparing the distribution difference degree between each type of data after being divided according to the candidate division information and determining the candidate division information which makes the distribution difference degree between each type of data maximum as the target division information.
In a possible implementation manner, determining the distribution difference degree between each type of data after the test data is divided according to the candidate division information may include:
determining the molecular duty ratio of first difference data corresponding to each type of data in the test data after the test data are divided according to the candidate division information and the denominator duty ratio of second difference data corresponding to each type of data in the test data; converting the molecular duty ratio and the denominator duty ratio according to the positive and negative conditions of the molecular duty ratio and the denominator duty ratio so that the molecular duty ratio and the denominator duty ratio are positive numbers, and obtaining a target molecular duty ratio and a target denominator duty ratio; and determining the distribution difference degree between each type of data according to the target molecular duty ratio and the target denominator duty ratio.
It should be understood that the first difference data and the second difference data are determined by the difference between the experimental group data and the control group data, and the indexes corresponding to the first difference data and the second difference data are different, so that the dimensions of the first difference data and the second difference data obtained therefrom may be different, which is inconvenient for comparing the distribution difference degrees. Therefore, the embodiment of the disclosure facilitates comparison of the distribution difference degree by solving the molecular duty ratio of the first difference data in the test data and the denominator duty ratio of the second difference data in the test data to convert the first difference data into the same dimension.
It should be further understood that the first difference data and the second difference data are determined by the difference between the experimental group data and the control group data, and thus, the value corresponding to the first difference data and the value corresponding to the second difference data may be a positive value or a negative value. Accordingly, the molecular duty ratio of the first difference data in the test data and the denominator duty ratio of the second difference data in the test data may be positive or negative. In order to avoid deviation of distribution difference degree between each type of data determined according to the molecular duty ratio and the denominator duty ratio due to difference of positive and negative conditions of the molecular duty ratio and the denominator duty ratio, the embodiment of the disclosure converts the molecular duty ratio and the denominator duty ratio so that the molecular duty ratio and the denominator duty ratio are positive numbers, thereby avoiding the problems.
For example, it may be determined whether the positive and negative conditions of the numerator duty cycle and the denominator duty cycle are the same by calculating the product of the numerator duty cycle and the denominator duty cycle. If the product is positive, the positive and negative conditions of the numerator duty ratio and the denominator duty ratio are the same, that is, the numerator duty ratio and the denominator duty ratio may be positive, or the numerator duty ratio and the denominator duty ratio may be negative, at this time, the absolute value of the numerator duty ratio and the denominator duty ratio can be simultaneously calculated to ensure that the numerator duty ratio and the denominator duty ratio are positive, and then the distribution difference degree between each type of data is determined according to the numerator duty ratio and the denominator duty ratio after the absolute value is calculated.
If the product is negative, it is indicated that the positive and negative conditions of the numerator duty ratio and the denominator duty ratio are different, and if the absolute value is calculated for the numerator duty ratio and the denominator duty ratio at this time, the distribution difference degree between each type of data is determined according to the numerator duty ratio and the denominator duty ratio after the absolute value is calculated, and if the numerator duty ratio and the denominator duty ratio after the absolute value is calculated are the same, two types of data having originally distributed differences are determined to have no distributed differences, and errors are caused. In order to overcome the technical problem, under the condition that the product of the molecular duty ratio and the target duty ratio is negative, the positive and negative conditions of the molecular duty ratio and the denominator duty ratio can be determined first; secondly, converting the positive duty ratio by solving the difference between the positive duty ratio and the negative duty ratio to obtain a first target duty ratio; then, a second target duty ratio is obtained by converting the negative duty ratio in a way of giving a positive value, wherein the assignment should be as small as possible, for example, set to 0.0001 or 0.00005, etc., in order to avoid the assignment from affecting the distribution variability; and finally, determining the distribution difference degree between each type of data according to the two converted target duty ratios.
For example, if the numerator duty ratio is 0.2 and the denominator duty ratio is-0.2, the target numerator duty ratio is obtained by obtaining the difference between the numerator duty ratio and the denominator duty ratio when converting the numerator duty ratio, that is, the target numerator duty ratio is 0.4; when the denominator duty ratio is converted, a target denominator duty ratio is obtained by giving a smaller positive value, for example, 0.0001, to the denominator duty ratio; and finally, calculating the distribution difference degree according to the target molecular duty ratio and the target denominator duty ratio, wherein the distribution difference degree is 0.4-0.0001= 0.3999 and is approximately 0.4, and the influence on the final distribution difference degree is negligible.
In a possible implementation manner, according to the type of the data dimension, determining the candidate partition information corresponding to the data dimension may include:
when the type of the data dimension is a quantitative dimension type, each dimension element in the data dimension is used as partition information to obtain candidate partition information corresponding to the data dimension.
For example, when the data dimension is the age and the dimension elements included are 20 years old, 21 years old and 22 years old, then the first candidate partition information may be obtained with the age of 20 years old as the first partition point: less than or equal to 20 years old and greater than 20 years old. Then taking 21 years old as a second dividing point, obtaining second candidate dividing information as follows: less than or equal to 21 years old and greater than 21 years old. Finally, taking 22 years old as a third dividing point, and obtaining third candidate dividing information as follows: less than or equal to 22 years old and greater than 22 years old. Thus, a plurality of candidate division information corresponding to the data dimension of the age can be obtained.
In a possible implementation manner, according to the type of the data dimension, determining the candidate partition information corresponding to the data dimension may include:
when the type of the data dimension is a qualitative dimension type, determining the element interpretation degree of the data of each dimension element in the data dimension in the test data according to a preset analysis index, and sorting the dimension elements in the data dimension according to the element interpretation degree to obtain sorted dimension elements, wherein the element interpretation degree is used for representing the data occupation ratio of the first difference data or the second difference data corresponding to the dimension elements in the test data; and determining partition information according to every two dimension elements in the sequence dimension elements to obtain candidate partition information corresponding to the data dimension.
As described above, the preset analysis index is used to characterize the relationship between the target data and the target index that are expected to be obtained from the test data, and the relationship between the target data and the target index may be a positive correlation relationship or a negative correlation relationship. When the relation between the target data and the target index is positive correlation, the target data is the test data which enables the target index to be as high as possible, and the target index is obtained by dividing the first data index by the second data index, so when the relation between the target data and the target index is positive correlation, a target data subgroup with larger first data index value and larger number needs to be found from the test data. Because the first data index is related to the first difference data, when the relation between the target data and the target index is a positive correlation relation, the element interpretation degree can be determined by calculating the duty ratio of the first difference data corresponding to each dimension element to the first difference data corresponding to the test data. Namely: the element interpretation degree can be determined by the molecular index duty ratio as follows:
wherein EP ij Element interpretation degree, a, representing the j-th dimension element in data dimension i ij Representing first difference data corresponding to a j-th dimension element in a data dimension i, S 1 And the first difference value data corresponding to the test data is represented.
For example, when the target index is a ratio of a virtual resource change ratio to an active duration loss ratio, the first difference data is a difference between a virtual resource in the experimental group data and a virtual resource in the control group data, the second difference data is a difference between an active duration in the experimental group data and an active duration in the control group data, and when the preset analysis index is used for representing that the target data expected to be obtained from the test data and the target index are in a positive correlation relationship, according to the above formula, the element interpretation degree of a certain dimension element can be obtained by dividing the virtual resource difference corresponding to the dimension element by the virtual resource difference corresponding to all dimensions in the test data.
When the relationship between the target data and the target index is a negative correlation, the target data is represented as test data which enables the target index to be as low as possible, and the target index is obtained by dividing the first data index by the second data index, so when the relationship between the target data and the target index is a negative correlation, a target data subgroup with a larger second data index value and a larger data amount needs to be found from the test data. Because the second data index is related to the second difference data, when the relationship between the target data and the target index is a negative correlation relationship, the element interpretation degree can be determined by calculating the duty ratio of the second difference data corresponding to each dimension element to the second difference data corresponding to the test data, namely: the element interpretation degree can be determined by the denominator index ratio as follows:
Wherein EP ij Element interpretation degree, b, representing the j-th dimension element in data dimension i ij Representing second difference data corresponding to the j-th dimension element in the data dimension i, S 2 And representing second difference data corresponding to the test data.
For example, when the target index is a ratio of a virtual resource change ratio to an active duration loss ratio, the first difference data is a difference between a virtual resource in the experimental group data and a virtual resource in the control group data, the second difference data is a difference between an active duration in the experimental group data and an active duration in the control group data, and when the preset analysis index is used for representing that the target data expected to be obtained from the test data and the target index are in a negative correlation relationship, according to the above formula, the element solution degree of a certain dimension element can be obtained by dividing the active duration difference corresponding to the dimension element by the active duration difference corresponding to all dimensions in the test data.
After determining the element interpretation degree of the data of each dimension element in the data dimension in the test data, sorting according to the element interpretation degree to obtain sorted dimension elements, and determining partition information according to every two dimension elements in the sorted dimension elements in turn to obtain candidate partition information corresponding to the data dimension.
Illustratively, the data dimension is a geographic location, and the ranking dimension elements are: the determining the partition information according to each two dimension elements in the ordering dimension elements in sequence in the area A, the area D, the area B and the area C, and the obtaining the candidate partition information corresponding to the data dimension may include:
firstly, taking the area A as a first dividing point, and obtaining first candidate dividing information as follows: { region A } and { region D, region B, region C }. Secondly, taking the D area as a second dividing point, and obtaining second candidate dividing information as follows: { region A, region D } and { region B, region C }. Then taking the area B as a third dividing point to obtain third candidate dividing information as follows: { region A, region D, region B } and { region C }. Therefore, dividing information is determined according to every two dimension elements in the sequence dimension elements, and a plurality of candidate dividing information can be obtained.
In a possible embodiment, to avoid determining some sparse dimension elements according to the element interpretation, the data analysis method may further include:
determining an index difference value corresponding to the dimension element according to a preset analysis index, the data proportion of the first difference data corresponding to the dimension element in the test data and the data proportion of the second difference data corresponding to the dimension element in the test data; screening the dimension elements in the data dimension according to the index difference value and the element interpretation degree to obtain target dimension elements;
Accordingly, ordering the dimension elements in the data dimension according to the element interpretation degree may include:
and sorting the target dimension elements according to the element interpretation degree.
It should be understood that the index difference value refers to a difference value between two data indexes, and can be generally determined by a ratio between the two data indexes. For example, where one data index is a and the other data index is B, the index difference value of the data index a with respect to the data index B may be denoted as a/B.
It should be further appreciated that the predetermined analysis index is used to characterize the relationship between the target data and the target index that is desired to be derived from the test data, and that the relationship between the target data and the target index may be a positive correlation or a negative correlation. When the relation between the target data and the target index is positive correlation, the target data is represented as test data which enables the target index to be as high as possible, the target index is obtained by dividing the first data index by the second data index, and the first data index is related to the first difference data, so that when the relation between the target data and the target index is positive correlation, the index difference value corresponding to each dimension element can be determined by calculating the data ratio of the first difference data in the test data and the data ratio of the second difference data in the test data. Namely: the index difference value may be determined by:
Wherein TGI ij Index difference value, q, representing the j-th dimension element in data dimension i ij Data ratio, p, of first difference data representing jth dimension element in data dimension i in test data ij A data ratio, b, of the second difference data representing the j-th dimension element in the data dimension i in the test data ij Representing second difference data corresponding to the j-th dimension element in the data dimension i, a ij Representing first difference data corresponding to the j-th dimension element in the data dimension i, S 1 Representing first difference data corresponding to the test data S 2 And representing second difference data corresponding to the test data.
Similarly, when the relationship between the target data and the target index is a negative correlation, the index difference value corresponding to each dimension element can be determined by calculating the data duty ratio of the second difference data in the test data and the data duty ratio of the first difference data in the test data. Namely: the index difference value may be determined by:
wherein TGI ij Index difference value, p, representing the j-th dimension element in data dimension i ij Data ratio, q, of second difference data representing jth dimension element in data dimension i in test data ij A data ratio, b, of the first difference data representing the j-th dimension element in the data dimension i in the test data ij Representing second difference data corresponding to the j-th dimension element in the data dimension i, a ij Representing first difference data corresponding to the j-th dimension element in the data dimension i, S 1 Representing first difference data corresponding to the test data S 2 And representing second difference data corresponding to the test data.
S103: and determining the distribution difference degree between each two kinds of data in each two kinds of classified data, and determining the class interpretation degree of each kind of data in each two kinds of classified data according to a preset analysis index, wherein the class interpretation degree is used for representing the data duty ratio of the first difference data or the second difference data corresponding to each kind of data in the test data.
In a possible implementation manner, determining the distribution difference degree between each class of data in each two classes of classified data may include:
determining JS divergence between each class of data in each two classes of data; and determining the distribution difference degree between each class of data in each two classes of classified data according to the JS divergence.
It should be appreciated that JS (Jenson's Shannon) divergence can be used to represent the difference between two probability distributions, the greater the JS divergence, the greater the difference between the two probability distributions, and the higher the corresponding degree of differentiation. Moreover, the JS dispersion has symmetry characteristics, so that the interpretation of actual business data can be improved, and the distribution difference degree between each two categories of data in each two categories of data can be determined based on the JS dispersion in the embodiment of the disclosure.
Illustratively, the degree of distribution difference between each class of data in each two-class data may be determined by:
wherein JS i Representing the distribution difference degree between each type of data in the two types of data obtained based on the division of the data dimension i, j representing the dimension elements in the data dimension i, n representing the number of dimension elements in the data dimension i, and P ij Representing the data duty ratio of the data of the comparison group in the first sub-data index in the j-th dimension element in the data dimension i in the test data, Q ij Representing the data ratio of the data of the comparison group in the second sub-data index in the j-th dimension element in the data dimension i in the test data, A ij Data of a comparison group of the jth dimension element in the data dimension i under the first sub-data index is represented, B ij Representing the data of the comparison group of the jth dimension element in the data dimension i under the second sub-data index, S 3 Representing the data of the control group corresponding to the first sub-data index in the test data, S 4 And representing the control group data corresponding to the second sub-data index in the test data.
S104: and determining a target data subgroup in the test data according to the distribution difference degree and the category interpretation degree.
In a possible implementation manner, determining the target data subgroup in the test data according to the distribution difference degree and the category interpretation degree may include:
Determining a target data dimension with the largest corresponding distribution difference degree and the corresponding category interpretation degree larger than or equal to a preset threshold value; taking a calculation result of the test data under the target index as a root node, and taking each type of data corresponding to the target data dimension as a child node of the root node to generate a target data decision tree; and determining a target data subgroup in the test data according to the target data decision tree.
It should be appreciated that the greater the degree of distribution difference between the data dimensions, the greater the degree of discrimination between the data dimensions, and the more explanatory the target data subgroup thus determined. Therefore, the target data dimension with the largest distribution difference degree can be found according to the distribution difference degree among the data dimensions, and the target data subgroup can be determined in the test data according to the target data dimension. Meanwhile, in order to avoid that the number of target data subgroups determined according to the data dimension is small, the embodiment of the disclosure further introduces a category interpretation degree, and the distribution difference degree and the category interpretation degree are combined to obtain the target data subgroup meeting the requirement. The preset threshold may be set according to practical situations, which is not limited in any way by the embodiments of the present disclosure.
It should be further understood that the decision tree is a result of representing data classification in a tree structure, and may include a root node and a child node, where the root node may be used to characterize a feature or attribute of data, the child node may be used to characterize a class of a corresponding root node, and the data classification result may be intuitively seen by constructing the decision tree through the feature (or attribute) of the data and the corresponding class. In view of this, in the embodiment of the present disclosure, the calculation result of the test data under the target index is taken as the root node, and each type of data corresponding to the target data dimension is taken as the child node of the root node to generate the target data decision tree, so that the visualized acquisition of the target data subgroup from the test data can be realized.
In a possible implementation manner, taking a calculation result of the test data under the target index as a root node, and taking each type of data corresponding to the target data dimension as a child node of the root node, generating the target data decision tree may include:
taking a calculation result of the test data under the target index as a root node, taking each type of data corresponding to the target data dimension as a child node of the root node, and repeatedly executing the following steps to obtain a target data decision tree:
The method comprises the steps of taking a child node as a father node, taking data corresponding to the child node as target data, processing the target data into target two-class data under different data dimensions, determining target distribution difference degree between each class of data in each target two-class data, determining target class interpretation degree of each class of data in each target two-class data in the target data according to a preset analysis index, determining a new target data dimension which is the largest in the corresponding target distribution difference degree and is greater than or equal to a preset threshold value and corresponds to the target class interpretation degree, and taking each class of data corresponding to the new target data dimension as the child node of the father node until a preset stop condition is reached.
It should be understood that the preset stop condition may be set according to actual circumstances, and the embodiment of the present disclosure does not impose any limitation on this. In a possible embodiment, the preset stop condition may include at least one of: the set tree depth is reached, the interpretation degree of the nodes is smaller than a first set threshold value, the distribution difference degree between the node data of the same level is smaller than a second set threshold value, and no more data dimension can be divided. Wherein the first set threshold is different from the second set threshold.
It should further be appreciated that the target data dimension contains different dimension elements, and each dimension element corresponds to different test data, which in turn may be related to a new different dimension element. Therefore, in order to intuitively embody the target data subgroup according to the target decision tree, the present embodiment may recursively execute the above process when generating the target data decision tree to obtain the target data decision tree.
In a possible embodiment, in order to intuitively embody the determination condition of the target data subgroup, a corresponding value of the determination index may also be displayed at each child node of the target data decision tree. That is, the data analysis method may further include:
according to the data subgroup corresponding to each node in the target data decision tree, determining at least one of the following data indexes: the method comprises the steps of data duty ratio corresponding to a data subgroup, distribution difference degree between the data subgroup and other data subgroups corresponding to the same-level nodes, calculation results of the data subgroup under target indexes, first data index values corresponding to the data subgroup and second data index values corresponding to the data subgroup; and outputting and displaying a target data decision tree, wherein each node in the target data decision tree is associated and displayed with at least one data index.
The data duty ratio corresponding to the data subgroup can be set according to practical situations, and the embodiment of the disclosure does not limit the data duty ratio. In a possible embodiment, the data duty ratio corresponding to the data subgroup may include a data duty ratio of the first difference data in the test data, a data duty ratio of the second difference data in the test data, a data duty ratio of the control group data in the first sub-data index in the test data, a data duty ratio of the control group data in the second sub-data index in the test data, and so on.
For example, referring to fig. 2, a root node of the target data decision tree may display a calculation result T of the test data under the target index, and the target data dimension has a dimension element i1 and a dimension element i2, so that the root node may be divided into two child nodes, where one child node includes the test data corresponding to the dimension element i1, and is illustrated as a in fig. 2; the other child node contains test data corresponding to dimension element i2, illustrated in fig. 2 as b. In the above manner, the child node a may be used as a parent node, and the test data in the child node a may be further split into two new child nodes according to the new target data dimension, where one new child node includes the test data corresponding to the dimension element j1 in the new target data dimension, and fig. 2 is schematically shown as c; the other new child node contains test data corresponding to the dimension element j2 in the new target data dimension, illustrated as d in fig. 2.
Further, to embody the determining condition of the target data subgroup, the value of the corresponding determining index, for example, the value of the analysis index, the value of the class interpretation degree, the value of the distribution difference degree, the data duty ratio of the first difference data in the test data, the data duty ratio of the second difference data in the test data, the data duty ratio of the control group data in the first sub-data index in the test data, the data duty ratio of the control group data in the second sub-data index in the test data, the average effect difference of the control group data in the first data index relative to the experimental group data, the average effect difference of the control group data in the second data index relative to the experimental group data, etc. may be displayed at each sub-node. For example, each child node in FIG. 2 is displayed with the following index values: target_ indicator, js, observed _ weight, control _ weight, observed _ convert, control _ convert, observed _rate and control_rate. The target_indicator represents a target index value, js represents a distribution difference degree, the compared_weight represents a data ratio of the comparison group data corresponding to the first sub-data index in the test data, the control_weight represents a data ratio of the comparison group data corresponding to the second sub-data index in the test data, the compared_control represents an average effect difference of the comparison group data relative to the experiment group data in the first data index, the control_control represents an average effect difference of the comparison group data relative to the experiment group data in the second data index, the compared_rate represents a data ratio of the first difference data in the test data, and the control_rate represents a data ratio of the second difference data in the test data. It should be understood that the specific numerical values of the various indicators in fig. 2 are for illustration only and are not intended to limit the present disclosure.
By means of any data analysis method, test data in different data dimensions can be processed into two kinds of data according to the target indexes, so that the distribution difference degree and the class interpretation degree between each kind of data can be determined according to the two kinds of data, and further the target data subgroup can be determined in the test data according to the distribution difference degree and the class interpretation degree between each kind of data. The target index is obtained by dividing the first data index by the second data index, so that the target index can represent the difference processing effect between the first data index and the second data index, automatic difference processing effect analysis can be carried out on test data from different data dimensions, a target data subgroup which accords with a preset analysis index can be more accurately determined, and the data analysis efficiency is improved. And moreover, the decision tree can be output and displayed, so that the analysis process and the analysis result are visualized, and the analysis process and the analysis result are convenient to view.
The data analysis method provided by the present disclosure is explained below by another exemplary embodiment.
Referring to fig. 3, the data analysis method may include:
step 1: obtaining test data;
test data are acquired and randomly divided into experimental group data and control group data in order to ensure the accuracy of the test. And determining corresponding first data indexes, second data indexes, first difference data and second difference data according to preset analysis indexes and analysis dimensions.
Step 2: data cleaning;
it should be appreciated that there may be some invalid test data in the test data, and in order to avoid that the invalid test data affects the accuracy of the test, the test data may be subjected to data cleaning to remove the invalid test data. The data cleaning of the test data can be performed according to actual conditions. Illustratively, the null value in the test data may be deleted, the null value in the test data may be assigned, etc., which is not limited in any way by the embodiments of the present disclosure. In addition, in order to facilitate distinguishing between different types of input indexes, the input indexes of the same type can be named uniformly, wherein the naming mode can be set according to actual conditions, and the embodiment of the disclosure does not limit the naming mode. For example, the control group data corresponding to the first sub data index may be named as observed_value, the control group data corresponding to the second sub data index may be named as control_value, the first difference data may be named as observed_value_molecular, and the second difference data may be named as control_value_molecular.
Step 3: configuring decision tree parameters;
the decision tree parameters may include a maximum depth of the decision tree, a stop condition, an analysis index, and the like, and values of the decision tree parameters may be set according to actual situations, which are not limited in the embodiments of the present disclosure. In a possible implementation, the decision tree maximum depth may be set to: 4, the stop condition is set as: the interpretation degree is equal to 0.05, and the preset analysis index is set as follows: the target data and the target index are in positive correlation.
Step 4: classifying;
it should be appreciated that any one test data may include multiple data dimensions, each of which may contain multiple dimension elements, and the number of dimension elements contained in different data dimensions may be the same or different. In order to avoid analysis errors caused by different numbers of dimension elements contained in different data dimensions, the test data can be subjected to two-class processing so as to ensure that the number of dimension elements in each data dimension is consistent, thereby avoiding analysis errors caused by different numbers of dimension elements contained in different data dimensions. In addition, according to the data types of the data elements in the data dimension, the data dimension can be divided into two types of qualitative dimension and quantitative dimension, so that when the two-classification processing is performed, the types of the data dimension can be judged first, and then different classification processing modes are adopted according to different data dimension types. The above related description may be referred to for different classification manners for different data dimension types, which will not be described in detail in the embodiments of the present disclosure.
After the test data in different data dimensions are processed into the two-class data, the distribution difference degree and the class interpretation degree between each class of data can be determined according to the two-class data, and then the target data subgroup can be determined in the test data according to the distribution difference degree and the class interpretation degree between each class of data.
Step 5: constructing a decision tree;
it should be understood that the decision tree represents the result of data classification in a tree structure, and the data classification result can be intuitively seen according to the decision tree. Therefore, in order to intuitively display and determine the target data subgroup from the test data, the embodiment of the disclosure may further use a calculation result of the test data under the target index as a root node, use each type of data corresponding to the target data dimension as a child node of the root node, generate a target data decision tree based on the decision tree parameters configured in the step 3, and output the target data decision tree.
The specific embodiments of the above steps are illustrated in detail above, and will not be repeated here. It should be further understood that for the purposes of simplicity of explanation of the above method embodiments, all of them are depicted as a series of acts in combination, but it should be appreciated by those skilled in the art that the present disclosure is not limited by the order of acts described above. Further, it should also be appreciated by those skilled in the art that the embodiments described above are preferred embodiments and that the steps involved are not necessarily required by the present disclosure.
By the method, different data dimensions of each test data can be automatically analyzed, so that a target data subgroup meeting a preset analysis index can be accurately analyzed from the different data dimensions. And moreover, a decision tree can be output, and an analysis process and an analysis result are visualized, so that the analysis process and the analysis result are convenient to view.
Based on the same concept, the embodiments of the present disclosure also provide a data analysis apparatus, as shown in fig. 4, the data analysis apparatus 400 may include:
the obtaining module 401 is configured to obtain test data and a preset analysis index for a service system, where the preset analysis index is used to characterize a relationship between target data expected to be obtained from the test data and a target index, the test data includes experimental group data and control group data, the target index is obtained by dividing a first data index by a second data index, the first data index is obtained based on first difference data of the experimental group data and the control group data under a first sub-data index, and the second data index is obtained based on second difference data of the experimental group data and the control group data under a second sub-data index;
a processing module 402, configured to process the test data into two kinds of data in different data dimensions;
a first determining module 403, configured to determine a distribution difference degree between each type of data in each of the two types of data, and determine a category interpretation degree of each type of data in each of the two types of data according to the preset analysis index, where the category interpretation degree is used to characterize a data ratio of the first difference data or the second difference data corresponding to each type of data in the test data;
A second determining module 404, configured to determine a target data subgroup in the test data according to the distribution variance and the category interpretation.
In a possible implementation, the processing module 402 may include:
the first determining submodule is used for determining the type of the data dimension aiming at each data dimension corresponding to the test data and determining target division information corresponding to the data dimension according to the type of the data dimension;
and the processing sub-module is used for processing the test data into classified data from different data dimensions according to the target division information.
In a possible embodiment, the first determining sub-module may include:
the first determining unit is used for determining candidate partition information corresponding to the data dimension according to the type of the data dimension;
the second determining unit is used for determining the distribution difference degree between each type of data after the test data are divided according to the candidate division information;
and a third determining unit configured to determine target division information among the candidate division information according to a degree of distribution difference between each type of data after division according to the candidate division information.
In a possible embodiment, the first determining unit may include:
and the first determination subunit is used for respectively taking each dimension element in the data dimension as partition information when the type of the data dimension is a quantitative dimension type, and obtaining candidate partition information corresponding to the data dimension.
In a possible embodiment, the first determining unit may include:
the sorting subunit is configured to determine, according to the preset analysis index, an element interpretation degree of data of each dimension element in the data dimension in the test data when the type of the data dimension is a qualitative dimension type, and sort the dimension elements in the data dimension according to the element interpretation degree, so as to obtain a sorted dimension element, where the element interpretation degree is used to characterize a data ratio of the first difference data or the second difference data corresponding to the dimension element in the test data;
and the second determining subunit is used for sequentially determining partition information according to every two dimension elements in the ordering dimension elements to obtain candidate partition information corresponding to the data dimension.
In a possible embodiment, the data analysis device 400 may further include:
A third determining module, configured to determine an index difference value corresponding to the dimension element according to the preset analysis index, a data duty ratio of the first difference data corresponding to the dimension element in the test data, and a data duty ratio of the second difference data corresponding to the dimension element in the test data;
the screening module is used for screening the dimension elements in the data dimension according to the index difference value and the element interpretation degree to obtain target dimension elements;
accordingly, the ordering subunit may be configured to order the target dimension elements according to the element interpretation degree.
In a possible embodiment, the second determining unit may include:
a third determining subunit, configured to determine a molecular duty ratio of the first difference data corresponding to each type of data in the test data after the test data is divided according to the candidate division information, and a denominator duty ratio of the second difference data corresponding to each type of data in the test data;
the conversion subunit is used for converting the molecular duty ratio and the denominator duty ratio according to the positive and negative conditions of the molecular duty ratio and the denominator duty ratio so that the molecular duty ratio and the denominator duty ratio are positive numbers, and a target molecular duty ratio and a target denominator duty ratio are obtained;
And a fourth determining subunit, configured to determine a distribution difference degree between the data of each class according to the target numerator duty ratio and the target denominator duty ratio.
In a possible implementation manner, the first determining module 403 may include:
the second determining submodule is used for determining JS divergence between each type of data in each two types of data;
and the third determination submodule is used for determining the distribution difference degree between each type of data in each two types of data according to the JS divergence.
In a possible implementation manner, the second determining module 404 may include:
a fourth determining submodule, configured to determine a target data dimension in which the corresponding distribution difference degree is the largest and the corresponding category interpretation degree is greater than or equal to a preset threshold;
the decision tree generation sub-module is used for taking a calculation result of the test data under the target index as a root node, and taking each type of data corresponding to the target data dimension as a sub-node of the root node to generate a target data decision tree;
and a fifth determining sub-module, configured to determine a target data subgroup in the test data according to the target data decision tree.
In a possible implementation, the decision tree generation sub-module may include:
The decision tree generating unit is used for taking a calculation result of the test data under the target index as a root node, taking each type of data corresponding to the target data dimension as a child node of the root node, and repeatedly executing the steps of: and using the child node as a father node, using data corresponding to the child node as target data, processing the target data into target two-class data under different data dimensions, determining target distribution difference degree between each class of data in each target two-class data, determining target class interpretation degree of each class of data in the target data according to the preset analysis index, determining new target data dimension which is the largest in the corresponding target distribution difference degree and is greater than or equal to a preset threshold value and corresponds to each class of data corresponding to the new target data dimension, and using each class of data as child node of the father node until a preset stop condition is reached, thereby obtaining a target data decision tree.
In a possible embodiment, the data analysis device 400 may further include:
the fourth determining module is configured to determine, according to the data subgroup corresponding to each node in the target data decision tree, at least one of the following data indexes: the data ratio corresponding to the data subgroup, the distribution difference degree between the data subgroup and other data subgroups corresponding to the same-level nodes, the calculation result of the data subgroup under the target index, the first data index value corresponding to the data subgroup and the second data index value corresponding to the data subgroup;
And the output module is used for outputting and displaying the target data decision tree, wherein each node in the target data decision tree is associated with and displayed with the at least one data index.
Based on the same conception, the embodiments of the present disclosure also provide a computer readable medium having stored thereon a computer program which, when executed by a processing device, implements the steps of the above-described data analysis method.
Based on the same concept, the embodiments of the present disclosure also provide an electronic device including:
a storage device having a computer program stored thereon;
processing means for executing said computer program in said storage means to implement the steps of the data analysis method described above.
Referring now to fig. 5, a schematic diagram of an electronic device 500 suitable for use in implementing embodiments of the present disclosure is shown. The terminal devices in the embodiments of the present disclosure may include, but are not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and stationary terminals such as digital TVs, desktop computers, and the like. The electronic device shown in fig. 5 is merely an example and should not be construed to limit the functionality and scope of use of the disclosed embodiments.
As shown in fig. 5, the electronic device 500 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 501, which may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 502 or a program loaded from a storage means 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data required for the operation of the electronic apparatus 500 are also stored. The processing device 501, the ROM 502, and the RAM 503 are connected to each other via a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.
In general, the following devices may be connected to the I/O interface 505: input devices 506 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 507 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 508 including, for example, magnetic tape, hard disk, etc.; and communication means 509. The communication means 509 may allow the electronic device 500 to communicate with other devices wirelessly or by wire to exchange data. While fig. 5 shows an electronic device 500 having various means, it is to be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead.
In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a non-transitory computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 509, or from the storage means 508, or from the ROM 502. The above-described functions defined in the methods of the embodiments of the present disclosure are performed when the computer program is executed by the processing device 501.
It should be noted that the computer readable medium described in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.
In some implementations, communications may be made using any currently known or future developed network protocol, such as HTTP (HyperText Transfer Protocol ), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the internet (e.g., the internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed networks.
The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device.
The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring test data and preset analysis indexes aiming at a service system, wherein the preset analysis indexes are used for representing the relation between target data expected to be obtained from the test data and target indexes, the test data comprise experimental group data and comparison group data, the target indexes are obtained by dividing first data indexes by second data indexes, the first data indexes are obtained based on first difference data of the experimental group data and the comparison group data under first sub-data indexes, and the second data indexes are obtained based on second difference data of the experimental group data and the comparison group data under second sub-data indexes; processing the test data into two kinds of data under different data dimensions; determining the distribution difference degree between each class of data in each two classes of data, and determining the class interpretation degree of each class of data in each two classes of data according to a preset analysis index, wherein the class interpretation degree is used for representing the data duty ratio of the first difference data or the second difference data corresponding to each class of data in the test data; and determining a target data subgroup in the test data according to the distribution difference degree and the category interpretation degree.
Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including, but not limited to, an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules described in the embodiments of the present disclosure may be implemented in software or hardware. The name of a module does not in some cases define the module itself.
The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a Complex Programmable Logic Device (CPLD), and the like.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The foregoing description is only of the preferred embodiments of the present disclosure and description of the principles of the technology being employed. It will be appreciated by persons skilled in the art that the scope of the disclosure referred to in this disclosure is not limited to the specific combinations of features described above, but also covers other embodiments which may be formed by any combination of features described above or equivalents thereof without departing from the spirit of the disclosure. Such as those described above, are mutually substituted with the technical features having similar functions disclosed in the present disclosure (but not limited thereto).
Moreover, although operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the present disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are example forms of implementing the claims. The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

Claims (14)

1. A method of data analysis, the method comprising:
acquiring test data and preset analysis indexes aiming at a service system, wherein the preset analysis indexes are used for representing the relation between target data expected to be obtained from the test data and target indexes, the test data comprise experimental group data and comparison group data, the target indexes are obtained by dividing first data indexes by second data indexes, the first data indexes are obtained based on first difference data of the experimental group data and the comparison group data under first sub-data indexes, and the second data indexes are obtained based on second difference data of the experimental group data and the comparison group data under second sub-data indexes;
Processing the test data into two kinds of data under different data dimensions;
determining the distribution difference degree between each type of data in each type of data, and determining the class interpretation degree of each type of data in each type of data according to the preset analysis index, wherein the class interpretation degree is used for representing the data duty ratio of the first difference value data or the second difference value data corresponding to each type of data in the test data;
and determining a target data subgroup in the test data according to the distribution difference degree and the category interpretation degree.
2. The method of claim 1, wherein the processing the test data into the bi-classified data in different data dimensions comprises:
determining the type of the data dimension according to each data dimension corresponding to the test data, and determining target division information corresponding to the data dimension according to the type of the data dimension;
and processing the test data into classified data from different data dimensions according to the target division information.
3. The method according to claim 2, wherein determining the target division information corresponding to the data dimension according to the type of the data dimension includes:
According to the type of the data dimension, determining candidate partition information corresponding to the data dimension;
determining the distribution difference degree between each type of data after the test data are divided according to the candidate division information;
and determining target division information in the candidate division information according to the distribution difference degree between each type of data after division according to the candidate division information.
4. A method according to claim 3, wherein said determining candidate partition information corresponding to said data dimension according to the type of said data dimension comprises:
when the type of the data dimension is a quantitative dimension type, each dimension element in the data dimension is used as partition information to obtain candidate partition information corresponding to the data dimension.
5. A method according to claim 3, wherein said determining candidate partition information corresponding to said data dimension according to the type of said data dimension comprises:
when the type of the data dimension is a qualitative dimension type, determining an element interpretation degree of data of each dimension element in the data dimension in the test data according to the preset analysis index, and sorting the dimension elements in the data dimension according to the element interpretation degree to obtain sorted dimension elements, wherein the element interpretation degree is used for representing a data ratio of the first difference value data or the second difference value data corresponding to the dimension elements in the test data;
And determining partition information according to every two dimension elements in the ordering dimension elements in sequence to obtain candidate partition information corresponding to the data dimension.
6. The method of claim 5, wherein the method further comprises:
determining an index difference value corresponding to the dimension element according to the preset analysis index, the data proportion of the first difference data corresponding to the dimension element in the test data and the data proportion of the second difference data corresponding to the dimension element in the test data;
screening the dimension elements in the data dimension according to the index difference value and the element interpretation degree to obtain target dimension elements;
the ordering the dimension elements in the data dimension according to the element interpretation degree includes:
and sorting the target dimension elements according to the element interpretation degree.
7. The method according to any one of claims 3-6, wherein determining a distribution difference between each type of data after the test data is divided according to the candidate division information includes:
determining the molecular duty ratio of the first difference value data corresponding to each type of data in the test data after the test data are divided according to the candidate division information and the denominator duty ratio of the second difference value data corresponding to each type of data in the test data;
Converting the molecular duty ratio and the denominator duty ratio according to the positive and negative conditions of the molecular duty ratio and the denominator duty ratio so that the molecular duty ratio and the denominator duty ratio are positive numbers, and obtaining a target molecular duty ratio and a target denominator duty ratio;
and determining the distribution difference degree between each type of data according to the target molecular duty ratio and the target denominator duty ratio.
8. The method of any one of claims 1-6, wherein said determining a degree of distribution difference between each of said two classes of data comprises:
determining JS divergence between each type of data in each two types of data;
and determining the distribution difference degree between each type of data in each two types of data according to the JS divergence.
9. The method of any of claims 1-6, wherein said determining a target data subgroup in the test data based on the distribution variability and the class interpretation, comprises:
determining a target data dimension of which the corresponding distribution difference degree is the largest and the corresponding category interpretation degree is greater than or equal to a preset threshold;
taking a calculation result of the test data under the target index as a root node, and taking each type of data corresponding to the target data dimension as a child node of the root node to generate a target data decision tree;
And determining a target data subgroup in the test data according to the target data decision tree.
10. The method of claim 9, wherein generating the target data decision tree using the calculation result of the test data under the target index as a root node and each type of data corresponding to the target data dimension as a child node of the root node comprises:
taking a calculation result of the test data under the target index as a root node, taking each type of data corresponding to the target data dimension as a child node of the root node, and repeatedly executing the following steps to obtain a target data decision tree:
and using the child node as a father node, using data corresponding to the child node as target data, processing the target data into target two-class data under different data dimensions, determining target distribution difference degree between each class of data in each target two-class data, determining target class interpretation degree of each class of data in the target data according to the preset analysis index, determining new target data dimension which is the largest in the corresponding target distribution difference degree and is greater than or equal to a preset threshold value and corresponds to each class of data corresponding to the new target data dimension, and using each class of data as the child node of the father node until a preset stop condition is reached.
11. The method according to claim 9, wherein the method further comprises:
according to the data subgroup corresponding to each node in the target data decision tree, determining at least one of the following data indexes: the data ratio corresponding to the data subgroup, the distribution difference degree between the data subgroup and other data subgroups corresponding to the same-level nodes, the calculation result of the data subgroup under the target index, the first data index value corresponding to the data subgroup and the second data index value corresponding to the data subgroup;
and outputting and displaying the target data decision tree, wherein each node in the target data decision tree is associated and displayed with the at least one data index.
12. A data analysis device, comprising:
the system comprises an acquisition module, a comparison module and a comparison module, wherein the acquisition module is used for acquiring test data and preset analysis indexes aiming at a service system, the preset analysis indexes are used for representing the relation between target data expected to be obtained from the test data and target indexes, the test data comprise experimental group data and comparison group data, the target indexes are obtained by dividing first data indexes by second data indexes, the first data indexes are obtained based on first difference data of the experimental group data and the comparison group data under first sub-data indexes, and the second data indexes are obtained based on second difference data of the experimental group data and the comparison group data under second sub-data indexes;
The processing module is used for processing the test data into two kinds of data under different data dimensions;
the first determining module is used for determining the distribution difference degree between each type of data in each type of data, and determining the category interpretation degree of each type of data in each type of data according to the preset analysis index, wherein the category interpretation degree is used for representing the data duty ratio of the first difference value data or the second difference value data corresponding to each type of data in the test data;
and the second determining module is used for determining a target data subgroup in the test data according to the distribution difference degree and the category interpretation degree.
13. A computer readable medium on which a computer program is stored, characterized in that the program, when being executed by a processing device, carries out the steps of the method according to any one of claims 1-11.
14. An electronic device, comprising:
a storage device having a computer program stored thereon;
processing means for executing said computer program in said storage means to carry out the steps of the method according to any one of claims 1-11.
CN202310764862.0A 2023-06-26 2023-06-26 Data analysis method, data analysis device, computer readable medium and electronic equipment Pending CN116756615A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310764862.0A CN116756615A (en) 2023-06-26 2023-06-26 Data analysis method, data analysis device, computer readable medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310764862.0A CN116756615A (en) 2023-06-26 2023-06-26 Data analysis method, data analysis device, computer readable medium and electronic equipment

Publications (1)

Publication Number Publication Date
CN116756615A true CN116756615A (en) 2023-09-15

Family

ID=87947585

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310764862.0A Pending CN116756615A (en) 2023-06-26 2023-06-26 Data analysis method, data analysis device, computer readable medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN116756615A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117690499A (en) * 2023-12-08 2024-03-12 苏州腾迈医药科技有限公司 Molecular test prediction processing method and device

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117690499A (en) * 2023-12-08 2024-03-12 苏州腾迈医药科技有限公司 Molecular test prediction processing method and device

Similar Documents

Publication Publication Date Title
CN114422267B (en) Flow detection method, device, equipment and medium
CN110390493B (en) Task management method and device, storage medium and electronic equipment
CN115757400B (en) Data table processing method, device, electronic equipment and computer readable medium
CN112836128A (en) Information recommendation method, device, equipment and storage medium
CN116756615A (en) Data analysis method, data analysis device, computer readable medium and electronic equipment
CN112182317A (en) Index weight determination method and device, electronic equipment and medium
CN116796233A (en) Data analysis method, data analysis device, computer readable medium and electronic equipment
CN113392018B (en) Traffic distribution method and device, storage medium and electronic equipment
CN116756616A (en) Data processing method, device, computer readable medium and electronic equipment
CN115357350A (en) Task configuration method and device, electronic equipment and computer readable medium
CN113033707B (en) Video classification method and device, readable medium and electronic equipment
CN113220281A (en) Information generation method and device, terminal equipment and storage medium
CN116483891A (en) Information prediction method, device, equipment and storage medium
CN116245595A (en) Method, apparatus, electronic device and computer readable medium for transporting supply end article
CN112669816B (en) Model training method, voice recognition method, device, medium and equipment
CN116541421B (en) Address query information generation method and device, electronic equipment and computer medium
CN112463573A (en) Method, device, terminal and storage medium for testing application
CN116862118B (en) Carbon emission information generation method, device, electronic equipment and computer readable medium
CN117591048B (en) Task information processing method, device, electronic equipment and computer readable medium
CN117132245B (en) Method, device, equipment and readable medium for reorganizing online article acquisition business process
CN113077352B (en) Insurance service article recommending method based on user information and insurance related information
CN111400322B (en) Method, apparatus, electronic device and medium for storing data
CN118175056A (en) Communication network data checking method and device, electronic equipment and storage medium
CN116628045A (en) Task updating method, device, medium and electronic equipment
CN116010814A (en) Data set manufacturing method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination