CN111522797B - Method and device for constructing business model based on business database - Google Patents

Method and device for constructing business model based on business database Download PDF

Info

Publication number
CN111522797B
CN111522797B CN202010343388.0A CN202010343388A CN111522797B CN 111522797 B CN111522797 B CN 111522797B CN 202010343388 A CN202010343388 A CN 202010343388A CN 111522797 B CN111522797 B CN 111522797B
Authority
CN
China
Prior art keywords
feature
data table
data
service
service data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010343388.0A
Other languages
Chinese (zh)
Other versions
CN111522797A (en
Inventor
周庆岳
张卓
薛菲
蒋宛静
钱江
李楠
曹睿
谢文韬
黄柏
戴春博
赵坤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alipay Hangzhou Information Technology Co Ltd
Original Assignee
Alipay Hangzhou Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alipay Hangzhou Information Technology Co Ltd filed Critical Alipay Hangzhou Information Technology Co Ltd
Priority to CN202010343388.0A priority Critical patent/CN111522797B/en
Publication of CN111522797A publication Critical patent/CN111522797A/en
Application granted granted Critical
Publication of CN111522797B publication Critical patent/CN111522797B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Abstract

Embodiments of the present description provide a method and apparatus for building a business model based on a business database. In the method, a first service data table having an association with a service sample instance set is determined from service data tables of a service data pool. In addition, feature exploration is performed based on the first business data table to obtain a model feature set. Then, an automatic machine learning method is used to construct a target business model from the model feature set.

Description

Method and device for constructing business model based on business database
Technical Field
Embodiments of the present disclosure relate generally to the field of artificial intelligence, and more particularly, to a method and apparatus for building business models based on business databases.
Background
With the development of artificial intelligence technology, business models based on machine learning models are applied to more and more fields. In the machine learning field, business data and model features determine the upper limit of a machine learning model, and improvements in model architecture and model algorithms merely approximate this upper limit. When building a service model for different service scenarios, model building experience of a model designer is generally relied on, a model feature set used by the service model is set by researching the service scenarios, and corresponding service sample data is collected for model training. The set of model features determined in this way may have some model features that are not suitable for the business model. This part of the model features as an input model to the business model may reduce the model effect of the business model. Or, the most suitable model features of the service model are omitted in the model feature set, so that the model effect of the obtained service model is poor.
Disclosure of Invention
In view of the foregoing, embodiments of the present specification provide a method and apparatus for building a business model based on a business database. By utilizing the method and the device, the service data table with relevance with the service sample instance set of the service model is searched from the massive service database, the service characteristics are extracted based on the searched service data table, the characteristic exploration is carried out based on the extracted service characteristics, the model characteristic set most suitable for the service model can be obtained based on the massive service database, and further, the service model is automatically constructed by using an automatic machine learning algorithm based on the model characteristic set, so that the model effect of the obtained service model can be improved, and the labor cost is reduced.
According to an aspect of embodiments of the present specification, there is provided a method for building a business model based on a business database, the business database comprising a business data pool comprising at least one business data table, each business data table comprising at least one business data column, the method comprising: determining a first service data table with relevance with the service sample instance set from the service data tables of the service data pool; performing feature exploration based on the first business data table to obtain a model feature set; and constructing a target business model according to the model feature set by using an automatic machine learning method.
Optionally, in one example of the above aspect, determining a first service data table having an association with the service sample instance set from the service data tables of the service data pool may include: calculating Cartesian products between each service data table in the service data pool and the service sample instance set; determining the ratio between the data number of the service sample instance set and the data number of the Cartesian product of each service data table as the association degree between each service data table and the service sample instance set; and determining the service data table with the association degree larger than a first threshold value as the first service data table.
Optionally, in one example of the above aspect, determining a first service data table having an association with the service sample instance set from the service data tables of the service data pool may include: generating a bloom filter using the set of business sample instances; using bloom filters to perform data filtering on each service data table in the service data pool; determining the association degree between each service data table in the service data pool and the service sample instance set based on the data filtering result of each service data table; and determining the service data table with the association degree larger than a first threshold value as the first service data table.
Optionally, in one example of the above aspect, determining the association degree between each service data table in the service data pool and the service sample instance set based on the data filtering result of each service data table may include: and aiming at each service data table, determining the ratio of the number of data after being filtered by the bloom filter to the number of data before being filtered by the bloom filter as the association degree between the service data table and the service sample instance set.
Optionally, in one example of the above aspect, before using a bloom filter to data filter each service data table in the service data pool, the method may further include: data sampling is performed on each service data table respectively, and data filtering is performed on each service data table in the service data pool by using a bloom filter can comprise: a bloom filter is used to data filter the data samples of the respective traffic data tables.
Optionally, in one example of the above aspect, respectively sampling the respective service data tables may include: performing a predetermined number of data samples on each of the service data tables, respectively, wherein determining a degree of association between each of the service data tables in the service data pool and the service sample instance set based on data filtering results of each of the service data tables may include: aiming at each service data table, determining the ratio of the number of data after being filtered by a bloom filter and before being filtered by the bloom filter of each data sample as the association probability between the service data table corresponding to the data sample and the service sample instance set; determining a confidence interval of the obtained association probability under a given confidence range; and determining the lower limit of the confidence interval as the association degree between the service data table and the service sample instance set.
Optionally, in one example of the above aspect, before determining the first service data table having an association with the service sample instance set from the service data tables of the service data pool, the method may further include: and performing data exploration on each service data table in the service data pool to obtain meta information of each service data column in each service data table.
Optionally, in one example of the above aspect, the meta information may include at least one of the following information: statistical information, social attribute information, data type information, and information similarity.
Optionally, in one example of the above aspect, before determining the first service data table having an association with the service sample instance set from the service data tables of the service data pool, the method may further include: and carrying out data reduction processing on each service data table based on the information similarity of each service data column.
Optionally, in one example of the above aspect, each feature has a feature weight, and performing feature exploration based on the first service data table to obtain a model feature set may include: the following iterative process is executed until the iteration end condition is satisfied: randomly selecting seed features of a current iterative process from a set of previous features based on feature weights, the set of previous features including original features and all derived features generated during a previous iterative process, the original features being determined during a first iteration based on the first business data table; performing feature derivatization on the seed features using feature derivatization rules to obtain derivatized features; and carrying out feature evaluation on the seed features and the derivative features to obtain feature ordering results of the features, determining the model feature set from the seed features and the derivative features according to the feature ordering results when the iteration ending condition is met, and adjusting feature weights of corresponding features in the previous feature set and the derivative features according to the feature ordering results when the iteration ending condition is not met.
Optionally, in one example of the above aspect, the feature ordering result may include a ranking value, each feature in the previous feature set forms a blood-edge relationship tree based on feature-derived blood-edge relationships, and adjusting feature weights of corresponding features in the previous feature set according to the feature ordering result may include: selecting a preset number of features which are ranked in front from the feature set of the current iterative process based on the ranking value as weight adjustment features; determining a weight adjustment value of each weight adjustment feature based on the ranking value of each weight adjustment feature; and adjusting the feature weights of the weight adjustment features and all the upstream node features according to the weight adjustment values of the weight adjustment features and the number of times of weight update of the upstream node features in the blood relationship tree. .
Optionally, in one example of the above aspect, the feature derived rule may include at least one of the following rules: feature transformation, feature combination, and feature evolution.
Optionally, in one example of the above aspect, the iteration end condition may include: reaching a predetermined number of iterations; or the number of features meeting the requirements of the desired feature index reaches the desired number.
According to another aspect of embodiments of the present specification, there is provided an apparatus for building a business model based on a business database, the business database comprising a business data pool comprising at least one business data table, each business data table comprising at least one business data column, the apparatus comprising: an association data table determining unit for determining a first service data table having an association with the service sample instance set from the service data tables of the service data pool; the feature exploration unit is used for carrying out feature exploration based on the first business data table so as to obtain a model feature set; and a model construction unit for constructing a target business model by using an automatic machine learning method according to the model feature set.
Alternatively, in one example of the above aspect, the association data table determining unit may include: a Cartesian product calculation module for calculating Cartesian products between each service data table in the service data pool and the service sample instance set; the association degree determining module is used for determining the ratio between the data number of the service sample instance set and the data number of the Cartesian product of each service data table as the association degree between each service data table and the service sample instance set; and the association data table determining module is used for determining the service data table with the association degree larger than a first threshold value as the first service data table.
Alternatively, in one example of the above aspect, the association data table determining unit may include: a filter generation module that generates a bloom filter using the set of business sample instances; a data filtering module, which uses a bloom filter to perform data filtering on each service data table in the service data pool; the association degree determining module is used for determining association degrees between each service data table in the service data pool and the service sample instance set based on data filtering results of each service data table; and the association data table determining module is used for determining the service data table with the association degree larger than a first threshold value as the first service data table.
Optionally, in one example of the above aspect, the association determining module determines, for each service data table, a ratio of a number of data after bloom filter filtering to a number of data before bloom filter filtering as an association between the service data table and the service sample instance set.
Optionally, in one example of the above aspect, the association data table determining unit may further include: and the data sampling module is used for respectively carrying out data sampling on each service data table before using a bloom filter to carry out data filtering on each service data table in the service data pool, wherein the data filtering module is used for carrying out data filtering on the data samples of each service data table.
Optionally, in one example of the above aspect, the data sampling module performs a predetermined number of data samples on each service data table, and for each service data table, the association degree determining module determines, as an association probability between the service data table corresponding to the data sample and the service sample instance set, a ratio of the number of data after being filtered by the bloom filter to the number of data before being filtered by the bloom filter; determining a confidence interval of the obtained association probability under a given confidence range; and determining the lower limit of the confidence interval as the association degree between the service data table and the service sample instance set.
Optionally, in one example of the above aspect, the apparatus may further include: and the data exploration unit is used for carrying out data exploration on each service data table in the service data pool before determining a first service data table with relevance with the service sample instance set from the service data tables in the service data pool so as to obtain meta information of each service data column in each service data table.
Optionally, in one example of the above aspect, the meta information includes at least one of the following information: statistical information, social attribute information, data type information, and information similarity, the apparatus may further include: and the data simplifying unit is used for carrying out data simplifying processing on each service data table based on the information similarity of each service data column.
Optionally, in one example of the above aspect, each feature has a feature weight, and the feature exploration unit may include: a feature selection module that randomly selects seed features of a current iterative process from a set of previous features based on feature weights, the set of previous features including original features and all derived features generated in the previous iterative process, the original features determined based on the first business data table; a feature derivation module for deriving features of the seed features using feature derivation rules to obtain derived features; the feature evaluation module is used for performing feature evaluation on the seed features and the derivative features to obtain feature ordering results of the features; the feature weight adjustment module is used for adjusting the feature weight of the corresponding feature in the previous feature set and the current derivative feature according to the feature sequencing result when the iteration ending condition is not met; and the model feature set determining module is used for determining the model feature set from the seed features and the derivative features according to the feature sequencing result when the iteration ending condition is met, and the feature selecting module, the feature derivative module, the feature evaluating module and the feature weight adjusting module are used for performing iteration until the iteration ending condition is met.
Optionally, in one example of the above aspect, the feature ordering result includes a ranking value, each feature in the preceding feature set forms a blood-edge relationship tree based on feature-derived blood-edge relationships, and the feature weight adjustment module: selecting a preset number of features which are ranked in front from the feature set of the current iterative process based on the ranking value as weight adjustment features; determining a weight adjustment value of each weight adjustment feature based on the ranking value of each weight adjustment feature; and adjusting the feature weights of the weight adjustment features and all the upstream node features according to the weight adjustment values of the weight adjustment features and the number of times of weight update of the upstream node features in the blood relationship tree. .
According to another aspect of embodiments of the present specification, there is provided an electronic device including: at least one processor, and a memory coupled to the at least one processor, the memory storing instructions that, when executed by the at least one processor, cause the at least one processor to perform the method for building a business model based on business data as described above.
According to another aspect of embodiments of the present description, there is provided a machine-readable storage medium storing executable instructions that, when executed, cause the machine to perform a method for building a business model based on a business database as described above.
Drawings
A further understanding of the nature and advantages of the present description may be realized by reference to the following drawings. In the drawings, similar components or features may have the same reference numerals.
Fig. 1 shows a flowchart of one example of a method for building a business model based on a business database according to an embodiment of the present description.
Fig. 2 shows an example schematic diagram of a business database according to an embodiment of the present description.
Fig. 3 shows a flowchart of a statistical information-based data type determining method according to an embodiment of the present specification.
Fig. 4 illustrates a flowchart of one example of a method for determining a first business data table from a business data pool that has an association with a set of business sample instances according to an embodiment of the present disclosure.
Fig. 5 shows a flowchart of another example of a method for determining a first business data table having an association with a business sample instance set from a business data pool according to an embodiment of the present specification.
Fig. 6 shows a flowchart of one example of a process for determining the degree of association between a service data table and a set of service sample instances based on a bloom filter implementation in accordance with an embodiment of the present disclosure.
Fig. 7 shows a flowchart of one example of a method for feature exploration based on a business data table according to an embodiment of the present description.
FIG. 8 shows an example schematic diagram of a blood relationship tree according to an embodiment of the present description.
Fig. 9 shows an example flowchart of a feature evaluation process according to an embodiment of the present specification.
Fig. 10 shows an example flowchart of a feature weight adjustment process according to an embodiment of the present description.
Fig. 11 shows an example schematic diagram of automatic machine learning according to an embodiment of the present description.
Fig. 12 shows a block diagram of an apparatus for building a business model based on a business database according to an embodiment of the present specification.
Fig. 13 shows a block diagram of one example of the association data table determining unit according to the embodiment of the present specification.
Fig. 14 shows a block diagram of another example of the association data table determining unit according to the embodiment of the present specification.
Fig. 15 shows a block diagram of one example of a feature exploration unit according to an embodiment of the present specification.
Fig. 16 shows a schematic diagram of an electronic device for building a business model based on a business database according to an embodiment of the present description.
Detailed Description
The subject matter described herein will now be discussed with reference to example embodiments. It should be appreciated that these embodiments are discussed only to enable a person skilled in the art to better understand and thereby practice the subject matter described herein, and are not limiting of the scope, applicability, or examples set forth in the claims. Changes may be made in the function and arrangement of elements discussed without departing from the scope of the disclosure as set forth in the specification. Various examples may omit, replace, or add various procedures or components as desired. For example, the described methods may be performed in a different order than described, and various steps may be added, omitted, or combined. In addition, features described with respect to some examples may be combined in other examples as well.
As used herein, the term "comprising" and variations thereof mean open-ended terms, meaning "including, but not limited to. The term "based on" means "based at least in part on". The terms "one embodiment" and "an embodiment" mean "at least one embodiment. The term "another embodiment" means "at least one other embodiment". The terms "first," "second," and the like, may refer to different or the same object. Other definitions, whether explicit or implicit, may be included below. Unless the context clearly indicates otherwise, the definition of a term is consistent throughout this specification.
In the machine learning field, business data and model features determine the upper limit of a machine learning model, and improvements in model architecture and model algorithms merely approximate this upper limit. When building a service model for different service scenarios, model building experience of a model designer is generally relied on, a model feature set used by the service model is set by researching the service scenarios, and corresponding service sample data is collected for model training. The set of model features determined in this way may have some model features that are not suitable for the business model or the set of model features missing the most suitable model features of the business model, resulting in poor model performance of the resulting business model.
In order to solve the above-described problems, embodiments of the present specification provide a solution for building a business model based on a business database. In the service model construction scheme, a service data table with relevance with a service sample instance set of a service model is determined from a mass service database, service features are extracted based on the searched service data table, and feature exploration is performed based on the extracted service features, so that a model feature set which is most suitable for the service model can be automatically obtained based on the mass service database, further, the service model is automatically constructed by using an automatic machine learning algorithm based on the model feature set, and the model effect of the obtained service model can be improved, and meanwhile, the labor cost is reduced.
In this specification, the term "business sample instance" may refer to a sample instance of feature data having some or all of the model feature dimensions in a model feature set required for a business model, which may be, for example, pre-programmed at the time of model design, or determined in other suitable manners. For example, assume that the model feature set required for the business model includes features 1 through 6. The service sample instance may include feature data corresponding to features 1 to 3, and may also include feature data corresponding to a part of features 1 to 3. For example, assuming that the business model requires the age, gender, and city of the user, the business sample instance may include characteristic data related to the age, gender, and city of the residence, or may include characteristic data related to either or both of the age, gender, and city of the residence, such as characteristic data related to the city of the residence. The sample instance set may be actively created by the model builder or locally collected.
A method and apparatus for constructing a business model based on a business database according to embodiments of the present specification will be described below with reference to the accompanying drawings.
Fig. 1 shows a flowchart of one example of a method for building a business model based on a business database according to an embodiment of the present description. In this specification, the terms "business database" and "business data pool" are used interchangeably.
As shown in fig. 1, at block 110, a data probe is performed on the service data tables in the service data pool to obtain meta-information for each service data column in each service data table. In embodiments of the present description, the service data pool may be service data collected by respective service systems, such as user transaction data, user identity information data, user customer service data, and the like. The service data pool is composed of a plurality of service data tables. The service data table is composed of a plurality of service data columns. The business data column is used to describe one dimension of an entity, for example: age of the user, annual income, etc.
Fig. 2 shows an example schematic diagram of a business database 200 according to an embodiment of the present description. As shown in fig. 2, the service database 200 includes service data tables 210, 220, and 230. Each business data table may include a plurality of business data columns, for example, business data columns 211 with names of persons, ages, sexes, and cities, respectively.
The meta information may include statistical information, social attribute information, data type information, and information similarity. The statistical information may include, for example, variance, mean, null rate, etc. In the present specification, the null rate means a ratio of the number of data lines whose element value is zero in one traffic data column to the total number of data lines in the traffic data column. The social attribute information may include, for example, a document number, a cell phone number, a mail address, and the like.
The data type information may include continuous type data and discrete type data. If the value range of the data of the service data column cannot be listed and is a continuous section on the number axis, the data of the service data column is continuous data. If the range of values of the data of the service data sequence can be enumerated, the entire range of values of the data of the service data sequence is a discrete set, and the data of the service data sequence is discrete data, such as year, age, etc.
The information similarity is used to indicate the information similarity between the traffic data columns. In this specification, jaccard similarity is used to measure the similarity of information between two data columns. If two data columns are considered to be two sets, the ratio of the number of elements of the intersection of the two sets to the number of elements of the union can be calculated. Considering that the data set is usually relatively large, a minimum Hash algorithm may also be used to estimate Jaccard similarity.
In the embodiments of the present description, data exploration for the traffic data table may be accomplished by statistical analysis after data sampling.
Fig. 3 shows a flowchart of a statistical information-based data type determining method according to an embodiment of the present specification.
As shown in fig. 3, at block 310, it is determined whether the data of the traffic data column is string type data. If it is string type data, then at block 370, it is determined that the data of the traffic data column is discrete data.
If it is not string type data, at block 320, data is sampled for the data of the traffic data column. At block 330, statistics of the sampled data are calculated. The statistics may include: the data set line number Count, the element number distunt_count of the value field set, the maximum line number max_val_count and the minimum line number min_val_count. The number of data set rows Count refers to how many rows of data the service data column includes. The element number count_count of the value field set refers to the number of element values in the service data column. For example, assuming that the number of rows of the dataset includes 100 rows, the number of elements may include 50, where the element values of the data for some of the rows are the same. The maximum line number max_val_count refers to the line number of the data line having the maximum element value. The minimum line number min_val_count refers to the line number of the data line having the minimum element value.
At block 340, a repetition index r= (Count-max_val_count-min_val_count)/(discrete_count-2) is determined based on the statistics of the sampled data. The repetition index R is used to indicate the repetition of elements in the traffic data column.
At block 350, a determination is made as to whether the repetition index R is greater than N. Here, N is a parameter set in advance, and can be adjusted according to an actual scene. If greater than N, discrete data is determined. If not greater than N, it is determined as continuous data.
It is noted that in the example shown in fig. 3, data in the traffic data column needs to be data sampled in block 320. By sampling data, the data processing amount in the data type judging process can be reduced, so that the data type judging efficiency is improved, and the method is particularly suitable for data processing of a mass service database. In other examples of the present specification, the operations of the block 320, such as that the data amount of the traffic data column is small or that the computational power or computational resources of the execution subject of the data type determination process are sufficiently powerful, may not be required.
Returning to fig. 1, after data exploration of the traffic data tables in the traffic data pool as above, at block 120, a first set of traffic data having an association with the set of traffic sample instances is found from the traffic data pool.
Fig. 4 illustrates a flowchart of one example of a method for determining a first business data table from a business data pool that has an association with a set of business sample instances according to an embodiment of the present disclosure.
As shown in FIG. 4, at block 410, cartesian products between each business data table in the business data pool and the business sample instance set are calculated. The calculation of the Cartesian product between the service data table and the service sample instance set may be implemented using any suitable calculation method known in the art and will not be described in detail herein.
At block 420, a ratio between the number of data of the set of business sample instances and the number of data of the Cartesian product of each business data table is determined as a degree of association between each business data table and the set of business sample instances. For example, assuming that the number of data in the service sample instance set is M, the number of data in Cartesian products of each service data table is Q 1 ,Q 2 … …, qn, then M/Q 1 ,M/Q 2 … …, M/Qn is determined as the degree of association of the respective traffic data tables.
At block 430, a traffic data table having a degree of association greater than a first threshold is determined as a first traffic data table. Here, the first threshold value may be predetermined, for example, an empirically determined empirical value.
Fig. 5 shows a flowchart of another example of a method for determining a first business data table having an association with a business sample instance set from a business data pool according to an embodiment of the present specification.
As shown in FIG. 5, at block 510, a bloom filter is generated using a set of business sample instances.
At block 520, bloom filters are used to data filter the individual traffic data tables in the traffic data pool.
At block 530, a degree of association between each business data table in the business data pool and the business sample instance set is determined based on the data filtering results of each business data table.
At block 540, a determination is made as to whether the determined degree of association is greater than a first threshold.
If the degree of association is greater than the first threshold, then at block 550, the service data table is determined to be a first service data table. If the degree of association is not greater than the first threshold, then at block 560 the service data table is determined to not be the first service data table.
In one example of the present specification, determining the degree of association between each service data table in the service data pool and the service sample instance set based on the data filtering result of each service data table may include: and aiming at each service data table, determining the ratio of the number of data after being filtered by the bloom filter to the number of data before being filtered by the bloom filter as the association degree between the service data table and the service sample instance set. Specifically, for each service data table, calculating a ratio between the number of data in the service data table after being filtered by the bloom filter and the number of data in the service data table before being filtered by the bloom filter, and taking the ratio as a degree of association between the service data table and the service sample instance set.
In another example of the present specification, before using the bloom filter to data filter each of the service data tables in the service data pool, the method may further include: and respectively carrying out data sampling on each service data table. Accordingly, bloom filters are used to data filter data samples of the respective traffic data tables.
Further, in another example of the present specification, the predetermined number of data samples may be performed on the respective traffic data tables, respectively. Accordingly, the association degree determination method shown in fig. 6 may be employed to determine the association degree between the service data table and the service sample instance set.
Fig. 6 shows a flowchart of one example of a process for determining the degree of association between a service data table and a set of service sample instances based on a bloom filter implementation in accordance with an embodiment of the present disclosure.
As shown in FIG. 6, the operations of blocks 610 through 640 are performed in a loop until a predetermined number of data samples, such as K times, are completed on the traffic data table. Specifically, in each cycle, at block 610, data samples are taken of the respective traffic data table. At block 620, the various business data tables are filtered using bloom filters. Next, at block 630, a ratio of the number of data after being filtered by the bloom filter to the number of data before being filtered by the bloom filter for each data sample is determined as a probability of association between the service data table corresponding to the data sample and the service sample instance set.
If the K data samples have been completed, i.e., the determination at block 640 is positive, then at block 650, confidence intervals for the resulting association probabilities at the given confidence ranges are determined for each of the traffic data tables.
For example, assume that the probability of correlation of the ith sample data is Pi, and K values P are obtained after K samples 1 ,P 2 ,P 3 ,…,P K . The mean x and standard deviation s of the set of associated probabilities are calculated. Assuming that the set of data obeys the T distribution, the T score at which the confidence range is 0.9 is calculated, denoted T, where the degree of freedom df=k-1. Subsequently, a confidence interval is calculated
Figure BDA0002469215580000131
At block 660, for each business data table, the lower bound of the calculated confidence interval is determined as the degree of association between the business data table and the business sample instance set, e.g., as described above
Figure BDA0002469215580000132
Returning to FIG. 1, after determining a first business data table having an association with a business sample instance set as described above, feature exploration is performed based on the first business data table to obtain a model feature set at block 130.
The feature exploration process is a process of different iterations. The feature exploration process may extract features from the first business data table and use different derived rules to derive various derived features based on the extracted features and perform feature evaluation on the derived features and the extracted features, thereby obtaining a model feature set.
Fig. 7 shows a flowchart of one example of a method for feature exploration based on a business data table according to an embodiment of the present description.
As shown in fig. 7, the operations of blocks 710 through 760 are iteratively performed until an iteration end condition is satisfied. In an embodiment of the present specification, the iteration end condition may include: reaching a predetermined number of iterations; or the number of features meeting the requirements of the desired feature index reaches the desired number. Here, the desired feature index requirement is met, for example, by the feature evaluation score being greater than a predetermined threshold.
Specifically, during each iteration, at block 710, the seed features of the current iteration are randomly selected from the set of previous features based on feature weights. Here, the previous feature set includes the original features and all derived features generated during the previous iteration. For example, for the second round of iterations, the previous iteration process includes the first round of iteration process. For the third round of iteration, the previous iteration process includes the first through second round of iteration processes, and so on. The original characteristics are determined based on the first traffic data table during the first iteration. For example, column name information (or dimension information) of each service data column may be extracted from the first service data table as an original feature. In a first iteration, seed features are selected from the original features based on feature weights. From the second round of iterative process, seed features of the current iterative process are randomly selected from the set of previous features based on feature weights. In one example of the present specification, the initial feature weight of the original feature may be set to a predetermined weight value w, for example, w=0. In the subsequent iteration process, the feature weight adjustment can be performed by using the weight adjustment value of the weight adjustment feature obtained in each iteration process. The specific feature weight adjustment process will be described in detail later.
At block 720, the seed features are feature-derived using feature-derived rules to obtain derived features. In embodiments of the present description, feature derivation may be performed using at least one of the following derivation rules: feature transformation, feature combination, and feature evolution.
Feature transformations may include feature aggregation and feature discretization. Feature aggregation may be, for example, an aggregation calculation of data over a specified period of time, such as: a person's consumption amount within the last 2 hours. Feature discretization may be, for example, the mapping of continuous data to a discrete set based on a certain mapping rule.
Feature combinations may include feature numerical operations and feature intersections. The feature value operation is an operation of performing mathematics on a plurality of features, for example: feature c=feature a+feature B. Feature interleaving is a new feature that results from interleaving a plurality of discrete features, such as: feature a is gender and the value range set is (male, female). The characteristic B is an age group, and is divided into (infants, teenagers, young, middle-aged and elderly), and the characteristic AB after crossing includes (infants, teenagers, young and male, middle-aged and male, elderly and female infants, teenagers, young and female, middle-aged and elderly female).
Feature evolution refers to that a certain parameter in a derivative rule is changed within a certain range based on the existing derivative feature, for example, feature A is the consumption amount of a person within 2 hours recently, and becomes feature B after evolution: if a person has recently consumed an amount of money within 6 hours, then feature B is a new feature that has evolved from feature A.
After the feature is subjected to multi-round feature derivation, a blood-edge relation graph of the feature can be obtained. The blood relationship graph shows which upstream node features the derived features are derived based on. FIG. 8 shows an example schematic diagram of a blood relationship tree according to an embodiment of the present description.
As shown in fig. 8, feature a derives feature B, feature C, and feature D, whereby feature a is an upstream node of feature B, feature C, and feature D. Feature B derives feature E and feature F, feature C derives feature G and feature H, then feature B is the upstream node of feature E and feature F, and feature C is the upstream node of feature G and feature H. Feature D and feature H derive feature I, then feature D and feature H are upstream nodes of feature I.
Furthermore, as can be seen from fig. 8, the upstream node of feature I includes feature a, feature C, feature D, and feature H. The upstream nodes of features E and F are feature A and feature B. The upstream nodes of feature G and feature H are feature a and feature C.
Returning to fig. 7, the seed features 710 and the derived features 720 form a candidate feature set 730, and feature evaluation is performed on each candidate feature in the candidate feature set 740, resulting in a feature ordering result for each candidate feature, at block 740. In the embodiments of the present specification, various suitable feature evaluation methods may be employed for feature evaluation.
Fig. 9 shows an example flowchart of a feature evaluation process according to an embodiment of the present specification.
As shown in fig. 9, at block 910, candidate features in the candidate feature set are scored using a plurality of different feature evaluation approaches, respectively. At block 920, a feature ranking value (i.e., ranking value) for each candidate feature under each evaluation mode is determined based on the score for each candidate feature under that evaluation mode. At block 930, the ranking values of the same feature in different evaluation modes are added to obtain the final ranking values of each candidate feature. For example, assuming that for the feature a, the rank in the first evaluation mode is 1 st, the rank in the second evaluation mode is 2 nd, and the rank in the third evaluation mode is 6 th, the final ranking value of the feature a is 1+2+6=9, whereby feature evaluation is completed. In another example, an average or weighted average of the ranking values of the respective features in the respective evaluation modes may also be used as the final ranking value of the feature. In the embodiments of the present description, the smaller the ranking value, the more important the candidate feature is. In embodiments of the present description, the feature evaluation process may be implemented using the ensembled ranking algorithm.
In addition, for each candidate feature, the scores of the candidate feature in each evaluation mode can be integrated, so that the evaluation score of the candidate feature is obtained. For example, for candidate feature a, the evaluation score in the first evaluation mode is S1, the evaluation score in the second evaluation mode is S2, and the evaluation score in the third evaluation mode is S3, and then the average of the evaluation scores S1, S2, and S3 may be used to determine the final evaluation score for feature a. Alternatively, the evaluation scores S1, S2, and S3 may be weighted averaged to determine the final evaluation score for feature A. Alternatively, score integration may be performed using any suitable integration means. For a feature, if the resulting final evaluation score is greater than a predetermined threshold, the feature is deemed to meet the desired feature index requirement.
At block 750, a determination is made as to whether an iteration end condition is satisfied. For example, it is determined whether a predetermined number of iterations is reached, or after feature evaluation, the number of features meeting the desired feature index requirement reaches an expected number.
If the iteration end condition is met, at block 770, a model feature set is determined from the candidate feature sets (seed features and derivative features) in the current iteration process based on the feature ordering result. For example, top M candidate features may be selected from the candidate feature set in the current iteration process based on the ranking value, where M may be a value set by the model builder according to the actual situation of the service model.
If the iteration end condition is not met, at block 760, feature weights for corresponding features in the previous feature set and the current derived feature are adjusted according to the feature ordering result. Returning then to block 710, seed features in the next iteration process are randomly selected based on the adjusted feature weights and the next iteration process is performed.
Fig. 10 shows an example flowchart of a feature weight adjustment process according to an embodiment of the present description.
As shown in fig. 10, at block 1010, a predetermined number of features that are ranked first are selected from the feature set of the current iterative process as weight adjustment features based on the ranking values. For example, top K candidate features may be selected from the feature set of the current iterative process as the weight adjustment features based on the ranking value.
At block 1020, weight adjustment values for the respective weight adjustment features are determined based on the ranking values of the respective weight adjustment features. For example, the reciprocal of the ranking value of each weight adjustment feature may be determined as the weight adjustment value of that weight adjustment feature. For example, assuming that the ranking value of the weight adjustment feature is s, 1/s is determined as the weight adjustment value of the weight adjustment feature.
At block 1030, the feature weights of the weight adjustment feature and all upstream node features are adjusted based on the weight adjustment values of the respective weight adjustment features and the number of weight updates of the respective upstream node features in the blood relationship tree. For example, assuming that ranking of the weight adjustment feature F is s, the weight adjustment value of the weight adjustment feature F is Δw=1/s. In one example, the weight adjustment value of 1/s for feature F may be added to the feature weights for its upstream nodes. The respective features are then weight-attenuated according to the number of weight updates of the respective features. For example, in one example, the feature weights of the weight adjustment feature F and all of its upstream node features follow the formula W new =W old /(n old +1) +1/s, wherein W new Is the adjusted feature weight, W, of the feature old Is the characteristic weight before the characteristic is adjusted, namely the characteristic weight of the previous iteration process, n old Is the current number of weight updates. Here, for the weight adjustment feature F itself, if it is a derivative feature newly derived in the current iteration process, W old =0. If the derived feature is not newly derived in the iterative process of the round, W old Is the feature weight that feature F has in the iterative process of the previous round.
Returning to FIG. 1, after the model feature set is obtained as above, at block 140, an automated machine learning method is used to construct a target business model from the model feature set. For example, as shown in fig. 11, the model feature set and the business data are provided to an automatic machine learning model (AutoML) to automatically generate a business model. AutoML defines a search space that includes a range of model parameters. AutoML searches the model and model parameters through some search optimization algorithms, and generates a final service model after model automatic parameter optimization. The model feature set and the business model are evaluated and then serve as the final output of the system. Here, the service data may be service data in the first service data table, or may be service data local to autopl, or service data from other data holders.
A method for constructing a business model based on a business database according to an embodiment of the present specification is described above with reference to fig. 1 to 11.
By using the service model construction method shown in fig. 1, by searching a service data table with correlation with a service sample instance set of a service model from a mass service database, extracting service features based on the searched service data table, and performing feature exploration based on the extracted service features, a model feature set most suitable for the service model can be obtained based on the mass service database, and further, the service model can be automatically constructed by using an automatic machine learning algorithm based on the model feature set and service data in the corresponding service data table, so that the model effect of the obtained service model can be improved, and the labor cost can be reduced.
Further, with the bloom filter-based correlation determination method shown in fig. 5, the amount of computation in the correlation determination process can be reduced, thereby reducing computation time.
Further, with the association degree determination method shown in fig. 6, by performing data sampling for the service data table, the data processing amount in the association degree determination process can be reduced. In addition, by performing multiple data sampling on the service data table and obtaining corresponding association probabilities based on multiple data sampling results, a confidence interval is calculated as the association degree between the service data table and the service sample instance, so that the accuracy of the association degree can be improved.
In addition, with the feature exploration method shown in fig. 7, by adjusting the feature weights of the corresponding features in the previous feature set and the current derivative feature according to the feature sequencing result after each iteration process, the seed feature selection in the next iteration process is affected, so that some new features are explored in the next iteration process, and the final result is not converged on a certain blood-edge relationship tree.
Furthermore, modifications may also be made to the embodiment shown in fig. 1. For example, in a modified embodiment, the operations of block 110 may not be required in the event that the meta information of the service data columns of the respective service data tables in the service database are complete. Furthermore, in a modified embodiment, it may further include: before a first service data table with relevance with a service sample instance set is determined from service data tables of a service data pool, carrying out data reduction processing on each service data table based on the information similarity of each service data column. By using the method, the data processing amount in the relevance determining process can be further reduced.
Fig. 12 shows a block diagram of an apparatus (hereinafter referred to as model building apparatus) 1200 for building a business model based on a business database according to an embodiment of the present specification. As shown in fig. 12, the model construction apparatus 1200 may include a data exploration unit 1210, an association data table determination unit 1220, a feature exploration unit 1230, and a model construction unit 1240.
The data exploration unit 1210 is configured to perform data exploration on each service data table in the service data pool before determining a first service data table having an association with the service sample instance set from the service data tables in the service data pool, so as to obtain meta information of each service data column in each service data table. The operation of the data probe unit 1210 may refer to the operation of the block 110 described above with reference to fig. 1.
The association data table determining unit 1220 is configured to determine a first service data table having an association with the service sample instance set from the service data tables of the service data pool. The operation of the association data table determining unit 1220 may refer to the operation of the block 120 described above with reference to fig. 1.
Fig. 13 shows a block diagram of one example of the association data table determining unit 1300 according to the embodiment of the present specification. As shown in fig. 13, the association data table determining unit 1300 includes a cartesian product calculating module 1310, an association degree determining module 1320, and an association data table determining module 1330.
The cartesian product calculation module 1310 is configured to calculate a cartesian product between each of the service data tables and the service sample instance set in the service data pool. The operation of the Cartesian product calculation module 1310 may refer to the operation of block 410 described above with reference to FIG. 4.
The association determination module 1320 is configured to determine a ratio between the number of data of the set of service sample instances and the number of data of the cartesian product of each service data table as an association between each service data table and the set of service sample instances. The operation of the association determination module 1320 may refer to the operation of block 420 described above with reference to fig. 4.
The association data table determining module 1330 is configured to determine a traffic data table having an association degree greater than a first threshold as a first traffic data table. The operation of the association data table determination module 1320 may refer to the operation of block 430 described above with reference to fig. 4.
Fig. 14 shows a block diagram of another example of the association data table determining unit 1400 according to an embodiment of the present specification. As shown in fig. 14, the association data table determining unit 1400 includes a filter generating module 1410, a data filtering module 1420, an association degree determining module 1430, and an association data table determining module 1440.
The filter generation module 1410 is configured to generate a bloom filter using the set of business sample instances. The operation of the filter generation module 1410 may refer to the operation of block 510 described above with reference to fig. 5.
The data filtering module 1420 is configured to use bloom filters to data filter individual ones of the traffic data tables in the traffic data pool. The operation of the data filtering module 1420 may refer to the operation of block 520 described above with reference to fig. 5.
The association determination module 1430 is configured to determine an association between each of the service data tables in the service data pool and the service sample instance set based on the data filtering results of each of the service data tables. The operation of the association determination module 1430 may refer to the operation of block 530 described above with reference to fig. 5.
The association data table determining module 1440 is configured to determine a service data table having an association degree greater than a first threshold as a first service data table. The operation of the association data table determination module 1440 may refer to the operation of block 550 described above with reference to fig. 5.
Optionally, in one example, for each business data table, the association determination module 1430 determines a ratio of the number of data after bloom filter filtering to the number of data before bloom filter filtering as the association between the business data table and the set of business sample instances.
Optionally, in another example, the association data table determining unit 1400 may further include a data sampling module (not shown). The data sampling module is configured to sample data of each of the traffic data tables before using the bloom filter to filter data of each of the traffic data tables in the traffic data pool. Accordingly, the data filtering module 1420 uses bloom filters to data filter the data samples of the respective traffic data tables.
Alternatively, in one example, the data sampling module may perform a predetermined number of data samples, e.g., K times, on each of the traffic data tables, respectively. Accordingly, for each service data table, the association degree determining module 1430 may determine the ratio of the number of data after the bloom filter is filtered to the number of data before the bloom filter is filtered for each data sample as the association probability between the service data table corresponding to the data sample and the service sample instance set; determining a confidence interval of the obtained association probability under a given confidence range; and determining the lower limit of the confidence interval as the association degree between the service data table and the service sample instance set.
Returning to fig. 12, after the first service data table is obtained as above, the feature exploration unit 1230 performs feature exploration based on the first service data table to obtain a model feature set. The operation of the feature exploration unit 1230 may refer to the operation of block 130 described above with reference to fig. 1.
Fig. 15 shows a block diagram of one example of a feature exploration unit 1500 according to an embodiment of the present description. In the example shown in fig. 15, the individual features have feature weights.
As shown in fig. 15, the feature exploration unit 1500 includes a feature selection module 1510, a feature derivation module 1520, a feature evaluation module 1530, a feature weight adjustment module 1540, and a model feature set determination module 1550. The feature selection module 1510, the feature derivation module 1520, the feature evaluation module 1530, and the feature weight adjustment module 1540 iteratively perform operations until an iteration end condition is satisfied.
Specifically, during each iteration, the feature selection module 1510 randomly selects seed features of the current iteration from a set of previous features based on feature weights, the set of previous features including original features and all derived features generated during the previous iteration, the original features determined based on the first business data table. The operation of the feature selection module 1510 may refer to the operation of block 710 described above with reference to fig. 7.
The feature derivation module 1520 is configured to feature derive the seed feature using feature derivation rules to derive a derived feature. The operation of the feature derivation module 1520 may refer to the operation of block 720 described above with reference to fig. 7.
The feature evaluation module 1530 is configured to perform feature evaluation on the seed feature and the derivative feature to obtain a feature ranking result of each feature. The operation of the feature evaluation module 1530 may refer to the operation of block 740 described above with reference to fig. 7.
The feature weight adjustment module 1540 is configured to adjust feature weights of corresponding features in the previous feature set and the current derived feature according to the feature ordering result when the iteration end condition is not satisfied. The operation of the feature weight adjustment module 1540 may refer to the operation of block 760 described above with reference to fig. 7.
Optionally, in one example, the feature ordering result includes a ranking value, and each feature in the preceding feature set forms a blood-lineage relationship tree based on feature-derived blood-lineage relationships. Accordingly, the feature weight adjustment module 1540 may select, based on the ranking value, a top-ranked predetermined number of features from the feature set of the current iterative process as weight adjustment features; determining a weight adjustment value of each weight adjustment feature based on the ranking value of each weight adjustment feature; and adjusting the weight adjustment feature and the feature weights of all the upstream node features according to the weight adjustment value of each weight adjustment feature and the number of times of weight update of each upstream node feature in the blood relationship tree.
The model feature set determination module 1550 is configured to determine a model feature set from the seed feature and the current derivative feature based on the feature ordering result when the iteration end condition is satisfied. The operation of the model feature set determination module 1550 may refer to the operation of block 770 described above with reference to fig. 7.
Returning to fig. 12, after the model feature set is obtained as above, the model construction unit 1240 constructs a target business model using an automatic machine learning method from the model feature set. The operation of module build unit 1240 may refer to the operation of block 140 described above with reference to fig. 1.
Further optionally, in one example, the meta information may include at least one of the following information: statistical information, social attribute information, data type information, and information similarity. The model building apparatus 1200 may further comprise a data reduction unit (not shown). The data compaction unit is configured to perform data compaction processing on each service data table based on the information similarity of each service data column.
In another example of the present specification, the model construction apparatus 1200 may not include the data exploration unit 110.
The model building method and the model building apparatus according to the embodiments of the present specification are described above with reference to fig. 1 to 15. The above model building means may be implemented in hardware, or in software, or in a combination of hardware and software.
Fig. 16 shows a schematic diagram of an electronic device for building a business model based on a business database according to an embodiment of the present description. As shown in fig. 16, the electronic device 1600 may include at least one processor 1610, memory (e.g., non-volatile memory) 1620, memory 1630, and a communication interface 1640, and the at least one processor 1610, memory 1620, memory 1630, and communication interface 1640 are connected together via a bus 1660. At least one processor 1610 executes at least one computer-readable instruction stored or encoded in memory (i.e., the elements described above that are implemented in software).
In one embodiment, computer-executable instructions are stored in memory that, when executed, cause the at least one processor 1610 to: determining a first service data table with relevance with the service sample instance set from the service data tables of the service data pool; performing feature exploration based on the first business data table to obtain a model feature set; and constructing the target business model according to the model feature set by using an automatic machine learning method.
It should be appreciated that the computer-executable instructions stored in the memory, when executed, cause the at least one processor 1610 to perform the various operations and functions described above in connection with fig. 1-15 in various embodiments of the present description.
According to one embodiment, a program product such as a machine-readable medium (e.g., a non-transitory machine-readable medium) is provided. The machine-readable medium may have instructions (i.e., elements described above implemented in software) that, when executed by a machine, cause the machine to perform the various operations and functions described above in connection with fig. 1-15 in various embodiments of the specification. In particular, a system or apparatus provided with a readable storage medium having stored thereon software program code implementing the functions of any of the above embodiments may be provided, and a computer or processor of the system or apparatus may be caused to read out and execute instructions stored in the readable storage medium.
In this case, the program code itself read from the readable medium may implement the functions of any of the above-described embodiments, and thus the machine-readable code and the readable storage medium storing the machine-readable code form part of the present invention.
Examples of readable storage media include floppy disks, hard disks, magneto-optical disks, optical disks (e.g., CD-ROMs, CD-R, CD-RWs, DVD-ROMs, DVD-RAMs, DVD-RWs), magnetic tapes, nonvolatile memory cards, and ROMs. Alternatively, the program code may be downloaded from a server computer or cloud by a communications network.
It will be appreciated by those skilled in the art that various changes and modifications can be made to the embodiments disclosed above without departing from the spirit of the invention. Accordingly, the scope of the invention should be limited only by the attached claims.
It should be noted that not all the steps and units in the above flowcharts and the system configuration diagrams are necessary, and some steps or units may be omitted according to actual needs. The order of execution of the steps is not fixed and may be determined as desired. The apparatus structures described in the above embodiments may be physical structures or logical structures, that is, some units may be implemented by the same physical entity, or some units may be implemented by multiple physical entities, or may be implemented jointly by some components in multiple independent devices.
In the above embodiments, the hardware units or modules may be implemented mechanically or electrically. For example, a hardware unit, module or processor may include permanently dedicated circuitry or logic (e.g., a dedicated processor, FPGA or ASIC) to perform the corresponding operations. The hardware unit or processor may also include programmable logic or circuitry (e.g., a general purpose processor or other programmable processor) that may be temporarily configured by software to perform the corresponding operations. The particular implementation (mechanical, or dedicated permanent, or temporarily set) may be determined based on cost and time considerations.
The detailed description set forth above in connection with the appended drawings describes exemplary embodiments, but does not represent all embodiments that may be implemented or fall within the scope of the claims. The term "exemplary" used throughout this specification means "serving as an example, instance, or illustration," and does not mean "preferred" or "advantageous over other embodiments. The detailed description includes specific details for the purpose of providing an understanding of the described technology. However, the techniques may be practiced without these specific details. In some instances, well-known structures and devices are shown in block diagram form in order to avoid obscuring the concepts of the described embodiments.
The previous description of the disclosure is provided to enable any person skilled in the art to make or use the disclosure. Various modifications to the disclosure will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other variations without departing from the scope of the disclosure. Thus, the disclosure is not limited to the examples and designs described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (25)

1. A method for building a business model based on a business database, the business database comprising a business data pool comprising at least one business data table, each business data table comprising at least one business data column, the method comprising:
determining a first service data table with relevance with a service sample instance set of a target service model from service data tables of a service data pool;
performing feature exploration based on the first business data table to obtain a model feature set; and
the target business model is constructed using an automatic machine learning method from the model feature set.
2. The method of claim 1, wherein determining a first business data table having an association with a set of business sample instances of the target business model from the business data tables of the business data pool comprises:
calculating Cartesian products between each service data table in the service data pool and the service sample instance set;
determining the ratio between the data number of the service sample instance set and the data number of the Cartesian product of each service data table as the association degree between each service data table and the service sample instance set; and
and determining the service data table with the association degree larger than a first threshold value as the first service data table.
3. The method of claim 1, wherein determining a first business data table having an association with a set of business sample instances of the target business model from the business data tables of the business data pool comprises:
generating a bloom filter using the set of business sample instances;
using bloom filters to perform data filtering on each service data table in the service data pool;
determining the association degree between each service data table in the service data pool and the service sample instance set based on the data filtering result of each service data table; and
And determining the service data table with the association degree larger than a first threshold value as the first service data table.
4. The method of claim 3, wherein determining a degree of association between each business data table in the business data pool and the set of business sample instances based on data filtering results of each business data table comprises:
and determining the ratio of the number of data after the bloom filter is filtered to the number of data before the bloom filter is filtered as the association degree between the service data table and the service sample instance set for each service data table.
5. The method of claim 3, wherein prior to using a bloom filter to data filter individual ones of the traffic data tables in the traffic data pool, the method further comprises:
the data sampling is performed on each service data table respectively,
using bloom filters to data filter individual ones of the traffic data tables in the traffic data pool includes:
a bloom filter is used to data filter the data samples of the respective traffic data tables.
6. The method of claim 5, wherein the data sampling of each service data table comprises:
a predetermined number of data samples are performed on each service data table,
Based on the data filtering result of each service data table, determining the association degree between each service data table in the service data pool and the service sample instance set comprises:
for each of the tables of traffic data,
determining the ratio of the number of data after being filtered by a bloom filter and before being filtered by the bloom filter of each data sampling as the association probability between a service data table corresponding to the data sampling and the service sample instance set;
determining a confidence interval of the obtained association probability under a given confidence range; and
and determining the lower limit of the confidence interval as the association degree between the service data table and the service sample instance set.
7. The method of claim 1, wherein prior to determining a first business data table having an association with a set of business sample instances of the target business model from the business data tables of the business data pool, the method further comprises:
and performing data exploration on each service data table in the service data pool to obtain meta information of each service data column in each service data table.
8. The method of claim 7, wherein the meta information includes at least one of: statistical information, social attribute information, data type information, and information similarity.
9. The method of claim 8, wherein prior to determining a first service data table having an association with a service sample instance set from the service data tables of the service data pool, the method further comprises:
and carrying out data reduction processing on each service data table based on the information similarity of each service data column.
10. The method of claim 1, wherein each feature has a feature weight, and performing feature exploration based on the first business data table to obtain a model feature set comprises:
the following iterative process is executed until the iteration end condition is satisfied:
randomly selecting seed features of a current iteration process from a previous feature set based on feature weights, the previous feature set including original features and all derived features generated in the previous iteration process, the original features being determined based on the first business data table in a first iteration process;
performing feature derivatization on the seed features using feature derivatization rules to obtain derivatized features;
performing feature evaluation on the seed features and the derivative features to obtain feature ordering results of the features,
determining the model feature set from the seed features and the derived features according to the feature ordering result when the iteration end condition is satisfied,
And when the iteration ending condition is not met, adjusting the feature weights of the corresponding features in the previous feature set and the current derivative features according to the feature sequencing result.
11. The method of claim 10, wherein the feature ordering result comprises a ranking value, each feature in the prior feature set forming a blood-lineage tree based on feature-derived blood-lineage relationships, adjusting feature weights for corresponding features in the prior feature set and current derived features according to the feature ordering result includes:
selecting a preset number of features which are ranked in front from the feature set of the current iterative process based on the ranking value as weight adjustment features;
determining a weight adjustment value of each weight adjustment feature based on the ranking value of each weight adjustment feature; and
and adjusting the feature weights of the weight adjustment features and all the upstream node features according to the weight adjustment values of the weight adjustment features and the number of times of weight update of the upstream node features in the blood relationship tree.
12. The method of claim 10, wherein the feature derived rule comprises at least one of the following rules: feature transformation, feature combination, and feature evolution.
13. The method of claim 10, wherein the iteration end condition comprises:
reaching a predetermined number of iterations; or alternatively
The number of features meeting the requirements of the desired feature index reaches the desired number.
14. An apparatus for building a business model based on a business database, the business database comprising a business data pool, the business data pool comprising at least one business data table, each business data table comprising at least one business data column, the apparatus comprising:
the association data table determining unit is used for determining a first service data table with association with the service sample instance set of the target service model from the service data tables of the service data pool;
the feature exploration unit is used for carrying out feature exploration based on the first business data table so as to obtain a model feature set; and
and a model construction unit for constructing the target business model by using an automatic machine learning method according to the model feature set.
15. The apparatus of claim 14, wherein the association data table determining unit comprises:
a Cartesian product calculation module for calculating Cartesian products between each service data table in the service data pool and the service sample instance set;
The association degree determining module is used for determining the ratio between the data number of the service sample instance set and the data number of the Cartesian product of each service data table as the association degree between each service data table and the service sample instance set; and
and the association data table determining module is used for determining the service data table with the association degree larger than a first threshold value as the first service data table.
16. The apparatus of claim 14, wherein the association data table determining unit comprises:
a filter generation module that generates a bloom filter using the set of business sample instances;
a data filtering module, which uses a bloom filter to perform data filtering on each service data table in the service data pool;
the association degree determining module is used for determining association degrees between each service data table in the service data pool and the service sample instance set based on data filtering results of each service data table; and
and the association data table determining module is used for determining the service data table with the association degree larger than a first threshold value as the first service data table.
17. The apparatus of claim 16, wherein the association determination module determines, for each business data table, a ratio of the number of data after bloom filter filtering to the number of data before bloom filter filtering as an association between the business data table and the set of business sample instances.
18. The apparatus of claim 16, wherein the association data table determining unit further comprises:
a data sampling module for respectively sampling data of each service data table before using bloom filter to filter the data of each service data table in the service data pool,
wherein the data filtering module uses bloom filters to data filter data samples of each service data table.
19. The apparatus of claim 18, wherein the data sampling module performs a predetermined number of data samples for each of the traffic data tables, respectively,
for each service data table, the association degree determining module determines the ratio of the number of data after being filtered by the bloom filter and before being filtered by the bloom filter of each data sample as the association probability between the service data table corresponding to the data sample and the service sample instance set; determining a confidence interval of the obtained association probability under a given confidence range; and determining the lower limit of the confidence interval as the association degree between the service data table and the service sample instance set.
20. The apparatus of claim 14, further comprising:
And the data exploration unit is used for carrying out data exploration on each service data table in the service data pool before determining a first service data table with relevance with the service sample instance set from the service data tables in the service data pool so as to obtain meta information of each service data column in each service data table.
21. The apparatus of claim 20, wherein the meta information comprises at least one of: statistical information, social attribute information, data type information, and information similarity, the apparatus further comprising:
and the data simplifying unit is used for carrying out data simplifying processing on each service data table based on the information similarity of each service data column.
22. The apparatus of claim 14, wherein each feature has a feature weight, the feature exploration unit comprising:
a feature selection module for randomly selecting seed features of a current iterative process from a previous feature set based on feature weights, the previous feature set including original features and all derived features generated in the previous iterative process, the original features being determined based on the first business data table;
a feature derivation module for deriving features of the seed features using feature derivation rules to obtain derived features;
The feature evaluation module is used for performing feature evaluation on the seed features and the derivative features to obtain feature ordering results of the features;
the feature weight adjusting module is used for adjusting the feature weights of the corresponding features in the previous feature set and the current derivative features according to the feature sequencing result when the iteration ending condition is not met; and
a model feature set determining module for determining the model feature set from the seed feature and the derivative feature according to the feature ordering result when the iteration end condition is satisfied,
and the feature selection module, the feature deriving module, the feature evaluation module and the feature weight adjustment module execute operations iteratively until the iteration end condition is met.
23. The apparatus of claim 22, wherein the feature ordering result comprises a ranking value, each feature in the prior feature set forming a blood-edge relationship tree based on feature-derived blood-edge relationships, the feature weight adjustment module:
selecting a preset number of features which are ranked in front from the feature set of the current iterative process based on the ranking value as weight adjustment features;
determining a weight adjustment value of each weight adjustment feature based on the ranking value of each weight adjustment feature; and
And adjusting the feature weights of the weight adjustment features and all the upstream node features according to the weight adjustment values of the weight adjustment features and the number of times of weight update of the upstream node features in the blood relationship tree.
24. An electronic device, comprising:
at least one processor, and
a memory coupled to the at least one processor, the memory storing instructions that, when executed by the at least one processor, cause the at least one processor to perform the method of any of claims 1 to 13.
25. A machine readable storage medium storing executable instructions that when executed cause the machine to perform the method of any one of claims 1 to 13.
CN202010343388.0A 2020-04-27 2020-04-27 Method and device for constructing business model based on business database Active CN111522797B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010343388.0A CN111522797B (en) 2020-04-27 2020-04-27 Method and device for constructing business model based on business database

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010343388.0A CN111522797B (en) 2020-04-27 2020-04-27 Method and device for constructing business model based on business database

Publications (2)

Publication Number Publication Date
CN111522797A CN111522797A (en) 2020-08-11
CN111522797B true CN111522797B (en) 2023-06-02

Family

ID=71902769

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010343388.0A Active CN111522797B (en) 2020-04-27 2020-04-27 Method and device for constructing business model based on business database

Country Status (1)

Country Link
CN (1) CN111522797B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105260835A (en) * 2015-10-13 2016-01-20 北京凯行同创科技有限公司 Method for modeling, analyzing, and self-optimizing multi-source business big data
CN109426438A (en) * 2017-08-31 2019-03-05 中国移动通信集团广东有限公司 Real-time big data mirrored storage method and device
CN110008977A (en) * 2018-12-05 2019-07-12 阿里巴巴集团控股有限公司 Clustering Model construction method and device
CN110162566A (en) * 2019-04-15 2019-08-23 平安普惠企业管理有限公司 Association analysis method, device, computer equipment and the storage medium of business datum
CN110334720A (en) * 2018-03-30 2019-10-15 百度在线网络技术(北京)有限公司 Feature extracting method, device, server and the storage medium of business datum

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11663517B2 (en) * 2017-11-03 2023-05-30 Salesforce, Inc. Automatic machine learning model generation
US11663067B2 (en) * 2017-12-15 2023-05-30 International Business Machines Corporation Computerized high-speed anomaly detection

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105260835A (en) * 2015-10-13 2016-01-20 北京凯行同创科技有限公司 Method for modeling, analyzing, and self-optimizing multi-source business big data
CN109426438A (en) * 2017-08-31 2019-03-05 中国移动通信集团广东有限公司 Real-time big data mirrored storage method and device
CN110334720A (en) * 2018-03-30 2019-10-15 百度在线网络技术(北京)有限公司 Feature extracting method, device, server and the storage medium of business datum
CN110008977A (en) * 2018-12-05 2019-07-12 阿里巴巴集团控股有限公司 Clustering Model construction method and device
CN110162566A (en) * 2019-04-15 2019-08-23 平安普惠企业管理有限公司 Association analysis method, device, computer equipment and the storage medium of business datum

Also Published As

Publication number Publication date
CN111522797A (en) 2020-08-11

Similar Documents

Publication Publication Date Title
CN112434169B (en) Knowledge graph construction method and system and computer equipment thereof
CN110781317B (en) Method and device for constructing event map and electronic equipment
CN107122455A (en) A kind of network user's enhancing method for expressing based on microblogging
CN104573130B (en) The entity resolution method and device calculated based on colony
CN105095434B (en) The recognition methods of timeliness demand and device
CN111967971B (en) Bank customer data processing method and device
CN103425727B (en) Context speech polling expands method and system
CN111310023B (en) Personalized search method and system based on memory network
CN111966793B (en) Intelligent question-answering method and system based on knowledge graph and knowledge graph updating system
CN112463952B (en) News text aggregation method and system based on neighbor search
CN113140254A (en) Meta-learning drug-target interaction prediction system and prediction method
CN111078832A (en) Auxiliary response method and system for intelligent customer service
CN111026877A (en) Knowledge verification model construction and analysis method based on probability soft logic
CN111274332A (en) Intelligent patent retrieval method and system based on knowledge graph
CN113220904A (en) Data processing method, data processing device and electronic equipment
CN116362243A (en) Text key phrase extraction method, storage medium and device integrating incidence relation among sentences
CN103324641A (en) Information record recommendation method and device
Chen et al. Time-aware collaborative poisson factorization for service recommendation
CN111522797B (en) Method and device for constructing business model based on business database
CN111325255B (en) Specific crowd delineating method and device, electronic equipment and storage medium
CN112883133A (en) Flow prediction method based on time sequence data and function evolution data
CN112651590B (en) Instruction processing flow recommending method
CN111898039B (en) Attribute community searching method integrating hidden relations
CN115526315A (en) Generation method and device of rating card model
CN115146022A (en) Computer-implemented method for keyword search in knowledge graph

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant