CN117313824A

CN117313824A - Fusion method, device and equipment of data assets

Info

Publication number: CN117313824A
Application number: CN202311220864.XA
Authority: CN
Inventors: 周璟; 王宝坤; 刘京; 孟昌华; 金宏; 王维强
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2023-09-20
Filing date: 2023-09-20
Publication date: 2023-12-29

Abstract

The embodiment of the specification discloses a method, a device and equipment for fusing data assets, wherein the method comprises the following steps: obtaining a data asset set of data assets from one or more different data sources, the data asset set including a first subset of data assets of the data asset set carrying tag information; performing supervised model training on the target model based on the data assets carrying the tag information in the first data asset subset and a preset loss function to obtain a trained target model, wherein the loss function is a loss function constructed by one or more of a risk-free minimum loss function, a distributed robust optimization loss function and an integrated distillation learning loss function; respectively inputting the data assets in the data asset set into the trained target model to obtain corresponding prediction results; and determining fusion data assets corresponding to the data asset set based on the obtained prediction result, and providing the fusion data assets for other scenes to perform data application.

Description

Fusion method, device and equipment of data assets

Technical Field

The present document relates to the field of computer technologies, and in particular, to a method, an apparatus, and a device for fusing data assets.

Background

As people pay more attention to their own private data, risk prevention and control become an essential part in business processing. In the context of ESG (Environmental, social and Governance), the purpose of green wind control is to prevent waste of resources, promoting sustainable development of economy, society and environment. In the process of carrying out transactions by users through mobile phones, a transaction platform is often required to carry out risk analysis on a transaction event. In general, risks involved in a transaction event may include theft, fraud, illegal financial transactions, etc., and in actual business, one transaction often corresponds to a different scenario, and business logic and data distribution behind the scenario are greatly different.

The traditional data modeling mode is complicated, different models are required to be established for different scenes for distribution control, so that the problems of repeated construction and repeated application exist, and further storage and calculation resources are wasted. The green wind control advocated by the ESG provides a different thought for people, and one solution under the green wind control concept is multi-source risk data asset fusion (namely, integrating risk data from different sources) so as to obtain more comprehensive and accurate data assets, thereby improving the precision and efficiency of risk assessment and decision, preventing resource waste and achieving the aim of green wind control. Therefore, a better multi-source data asset fusion scheme is needed to be provided, so that conditions such as conditional probability distribution and edge probability distribution drift can be prevented, the stability of the data asset is higher, the dependence of the model on a single data asset is avoided, and the generalization capability of the trained model is improved.

Disclosure of Invention

The embodiment of the specification aims to provide a better multi-source data asset fusion scheme, which can prevent conditions such as conditional probability distribution, edge probability distribution drift and the like, and can also enable the stability of the data asset to be higher, so that the dependence of a model on a single data asset is avoided, and the generalization capability of the trained model is improved.

In order to achieve the above technical solution, the embodiments of the present specification are implemented as follows:

the embodiment of the specification provides a fusion method of data assets, which comprises the following steps: a set of data assets from one or more disparate data sources is obtained, the set of data assets including a first subset of data assets that carry tag information. And performing supervised model training on the target model based on the data assets carrying the tag information in the first data asset subset and a preset loss function to obtain a trained target model, wherein the loss function is a loss function constructed by one or more of a constant risk minimization loss function, a distributed robust optimization loss function and an integrated distillation learning loss function. And respectively inputting the data assets in the data asset set into the trained target models so as to predict the data assets in the data asset set through the trained target models and obtain corresponding prediction results. And determining a fusion data asset corresponding to the data asset set based on the obtained prediction result, and providing the fusion data asset for other scenes except the application scene corresponding to the target model for data application.

The embodiment of the specification provides a fusion device of data assets, the device includes: a data set acquisition module acquires a data asset set of data assets from one or more different data sources, the data asset set including a first subset of data assets of the data asset set carrying tag information. And the model training module is used for performing supervised model training on the target model based on the data assets carrying the tag information in the first data asset subset and a preset loss function to obtain a trained target model, wherein the loss function is a loss function constructed by one or more of a risk-free minimum loss function, a distributed robust optimization loss function and an integrated distillation learning loss function. And the prediction module is used for respectively inputting the data assets in the data asset set into the trained target model so as to predict the data assets in the data asset set through the trained target model and obtain corresponding prediction results. And the fusion module is used for determining fusion data assets corresponding to the data asset set based on the obtained prediction result, and providing the fusion data assets for other scenes except the application scene corresponding to the target model for data application.

The embodiment of the specification provides a data asset's fusion equipment, data asset's fusion equipment includes: a processor; and a memory arranged to store computer executable instructions that, when executed, cause the processor to: a set of data assets from one or more disparate data sources is obtained, the set of data assets including a first subset of data assets that carry tag information. And performing supervised model training on the target model based on the data assets carrying the tag information in the first data asset subset and a preset loss function to obtain a trained target model, wherein the loss function is a loss function constructed by one or more of a constant risk minimization loss function, a distributed robust optimization loss function and an integrated distillation learning loss function. And respectively inputting the data assets in the data asset set into the trained target models so as to predict the data assets in the data asset set through the trained target models and obtain corresponding prediction results. And determining a fusion data asset corresponding to the data asset set based on the obtained prediction result, and providing the fusion data asset for other scenes except the application scene corresponding to the target model for data application.

The present description also provides a storage medium for storing computer-executable instructions that when executed by a processor implement the following: a set of data assets from one or more disparate data sources is obtained, the set of data assets including a first subset of data assets that carry tag information. And performing supervised model training on the target model based on the data assets carrying the tag information in the first data asset subset and a preset loss function to obtain a trained target model, wherein the loss function is a loss function constructed by one or more of a constant risk minimization loss function, a distributed robust optimization loss function and an integrated distillation learning loss function. And respectively inputting the data assets in the data asset set into the trained target models so as to predict the data assets in the data asset set through the trained target models and obtain corresponding prediction results. And determining a fusion data asset corresponding to the data asset set based on the obtained prediction result, and providing the fusion data asset for other scenes except the application scene corresponding to the target model for data application.

Drawings

For a clearer description of embodiments of the present description or of the solutions of the prior art, the drawings that are required to be used in the description of the embodiments or of the prior art will be briefly described, it being obvious that the drawings in the description below are only some of the embodiments described in the description, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art;

FIG. 1A is a schematic diagram of a domain-specific adaptive DA according to the present disclosure;

fig. 1B is a schematic diagram of a DG for domain generalization in the present specification;

fig. 1C is a schematic diagram of another DG for domain generalization in the present specification;

FIG. 2 is a diagram of one embodiment of a fusion method of data assets according to the present disclosure;

FIG. 3 is a schematic illustration of data asset alignment according to the present disclosure;

FIG. 4 is another embodiment of a fusion method of data assets of the present specification;

FIG. 5 is a diagram of yet another embodiment of a method of fusing data assets in the present specification;

FIG. 6 is a schematic diagram of an expert module according to the present disclosure;

FIG. 7 is a diagram of yet another embodiment of a method of fusing data assets in the present specification;

FIG. 8 is a fusion apparatus embodiment of a data asset of the present description;

FIG. 9 is a diagram of an embodiment of a fusion device for data assets in the present specification.

Detailed Description

The embodiment of the specification provides a fusion method, device and equipment of data assets.

In order to make the technical solutions in the present specification better understood by those skilled in the art, the technical solutions in the embodiments of the present specification will be clearly and completely described below with reference to the drawings in the embodiments of the present specification, and it is obvious that the described embodiments are only some embodiments of the present specification, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are intended to be within the scope of the present disclosure.

The embodiment of the specification provides a novel multi-source data asset fusion solution which can be used for green wind control to realize model development cost control and model deployment and management cost control. In the context of ESG (Environmental, social and Governance), the purpose of green wind control is to prevent waste of resources, promoting sustainable development of economy, society and environment. ESG is a mechanism for comprehensive assessment of an enterprise, aimed at measuring the performance of the enterprise in terms of environment, society and corporate governance, thereby providing information about its sustainability and risk. In the process of carrying out transactions by users through mobile phones, a transaction platform is often required to carry out risk analysis on a transaction event. Typically, a transaction event involves risks that may include theft, fraud, illegal financial transactions, etc. In actual business, one transaction often corresponds to different scenes, and business logic and data distribution behind the scenes are greatly different.

The traditional data modeling mode is complicated, different models are required to be established for different scenes for distribution control, so that the problems of repeated construction and repeated application exist, and further storage and calculation resources are wasted. For example, due to the change of risk situations, a risk operation platform tends to continue to build on an existing system, and in practical situations, a single risk may include multiple data assets with different semantics, and in the use process, the data assets are called respectively, so that the waste of computing resources is caused. The green wind control advocated by the ESG provides a different thought for people, and a solution under the green wind control concept is multi-source risk data asset fusion (namely, integrating risk data from different sources) so as to obtain more comprehensive and accurate data assets, thereby improving the precision and the efficiency of risk assessment and decision making, and using a more generalized data asset to replace the working mode of separately preventing and controlling the data assets which originally need to be hashed in each scene can prevent resource waste and achieve the purpose of green wind control.

The following may be encountered during the data asset fusion process: high-accuracy data asset and low-accuracy data asset coexist; disorder of dimension: such as a list class data asset, a rating class data asset, a score class data asset, and the like; results conflict: that is, the same subject dimension may return inconsistent results within different data assets expressing the same semantics; semantic overlap: if one data asset contains logic a and logic B, another data asset may contain logic B and logic C; generalization performance: data assets may originate from different scenarios, how to guarantee outputThe data asset can still keep high generalization in all known scenes and even potential unknown scenes, and in resource risk identification, a plurality of scenes exist, and the edge probability distribution and the conditional probability distribution of the data asset are different. Domain generalization (Domain Generalization, DG) it is studied to learn a model with a strong generalization ability from several data set domains with different data distributions in order to get a better effect on unknown test sets. Domain generalization differs most from domain adaptation (Domain Adaptation, DA) in that: as shown in fig. 1A, the DA has access to data in both the source domain (i.e., the region represented by S therein) and the target domain (i.e., the region represented by T therein) during model training, while in DG, as shown in fig. 1B, only data in several source domains (i.e., the region represented by S therein) for model training (i.e., S in S ₁ 、S ₂ 、S ₃ Data in (a)), test data (i.e., T therein ₁ 、T ₂ 、T ₃ The region of the representation) cannot be accessed. In practical applications, because the edge probability distribution cannot be generally perceived, and there is a case where the source domain data and the test data overlap (such as the overlapping area hatched in fig. 1C), this case may be defined as a mixed DG problem, in the mixed DG, there may be a case where there is a conditional probability distribution, an edge probability distribution drift, or the like, or there may be a case where there is a stability difference, and thus a dependence is generated on a single data asset, and the generalization capability of the trained model is weak. Aiming at the mixed DG situation in the multi-source data asset fusion process, the embodiment of the specification solves the problem of conditional probability distribution drift by minimizing IRM without risk variation, solves the problem of edge probability distribution drift by optimizing DRO through distribution robustness, solves the problem of stability by learning integrated distillation (Ensemble Distillation), reduces the dependence on a single asset, learns a model with strong generalization capability from a plurality of data sets with different data distributions, and can also have better performance when the fused asset faces unknown data distribution. Specific processing can be seen from the details in the following examples.

As shown in fig. 2, the embodiment of the present disclosure provides a data asset fusion method, where an execution subject of the method may be a terminal device or a server, where the terminal device may be a mobile terminal device such as a mobile phone, a tablet computer, or a computer device such as a notebook computer or a desktop computer, or may be an IoT device (specifically, such as a smart watch, an in-vehicle device, or the like), and the server may be a separate server, or may be a server cluster formed by multiple servers, and the server may be a background server such as a financial service or an online shopping service, or may be a background server of an application program, or the like. In this embodiment, the execution subject is taken as a server for example for detailed description, and for the case that the execution subject is a terminal device, the following processing of the case of the server may be referred to, and will not be described herein. The method specifically comprises the following steps:

in step S202, a set of data assets from one or more disparate data sources is acquired, the set of data assets including a first subset of data assets that carry tag information.

Where a data asset may refer to a data asset, physically or electronically recorded, that is owned or controlled by an individual or organization, etc., capable of bringing future benefits to the individual or organization, the data asset is data in a valuable, quantifiable, readable network space that owns the data property (exploration rights, use rights, ownership rights). The tag information may include various types, such as default, fraud, illegal financial transactions, etc., and may be specifically set according to a service scenario, which is not limited in the embodiments of the present specification.

In implementation, in the current scheme, only a single data asset fusion scheme can be used, but the domain generalization is poor in the scheme, and no additional design for improving the domain generalization is provided, so that the fused data asset usually performs poorly when facing to unknown data distribution, in actual business, specific data distribution can be perceived in a model training process for a known scene, but for the unknown scene and an external scene, no channel is needed to perceive the unknown scene and the external scene before the demand arrives, so that the asset fusion scheme is required to have the domain generalization characteristic, and in many scenes, the situation that the edge probability distribution and the conditional probability distribution of the data asset drift occurs is provided.

Data assets from one or more of the different data sources may be acquired in a number of different ways, for example, transaction data generated during a historical period may be acquired from a financial institution for online transactions, transaction data generated during a historical period may be acquired from a bank, transaction data generated during a historical period may be acquired from an operating institution for instant messaging applications, etc., and a data asset set may be constructed from the acquired data assets. The data assets in different data sources may belong to different domains, and furthermore, the data assets in the same data source may belong to different domains for which the distinction between different domains may be small, e.g. the same camera uses different parameters, may be defined as two different domains, the distinction between different domains may be large, e.g. an adult and a child may be defined as two different domains, or the data generated by a user paying through a cell phone and the data generated by a user paying through a bank card may be defined as two different domains.

In addition, some data assets in the data asset set may carry tag information, and the remaining data assets may not carry tag information, or may each data asset in the data asset set may carry tag information, which may be specifically set according to the actual situation, and the embodiment of the present disclosure is not limited to this. Data assets carrying tag information may be combined into a first subset of data assets, e.g., a data asset set comprising 100 ten thousand data assets, wherein 10 ten thousand data assets carry tag information, and the 10 ten thousand data assets may be combined into the first subset of data assets.

It should be noted that, in the process of acquiring the data asset set, a driving table may be preset, where the driving table may refer to a data table containing various features (such as personal information, financial information, behavior data, and the like), through which a driving process may be performed on a corresponding data asset set, where the driving table is not necessarily set on a total number of data assets in the data asset set, for example, the data asset carrying tag information only accounts for 1% of the total number of data assets in the data asset set, and then division of driving events may be performed on the 1% of the data set.

In step S204, a supervised model training is performed on the target model based on the data assets carrying tag information in the first subset of data assets and a preset loss function, to obtain a trained target model, where the loss function is a loss function constructed based on one or more of a risk-invariant minimization loss function, a distributed robust optimization loss function, and an integrated distillation learning loss function.

The target model may include various types, for example, the target model may be a risk prevention and control model in online financial service, the target model may also be an information recommendation model, the target model may also be a biological recognition model (such as a facial recognition model or a fingerprint recognition model), etc., which may be specifically set according to actual situations, and the embodiment of the present disclosure is not limited to this. In addition, the target model may be constructed by a plurality of different algorithms, for example, the target model may be constructed by a neural network algorithm, the target model may be constructed by a classification algorithm, etc., and may be specifically set according to the actual situation.

In practice, in practical applications, because the edge probability distribution is generally not perceived, and there is a case where the source domain data overlaps with the test data, based on this, a mixed DG problem can be defined: x, which is available to the target scene, is actually a subset of the set of data assets; the characteristics of the target scene edge probability distribution are consistent with the subset of the master station; the conditional probability distribution of the target scene is inconsistent with the conditional probability distribution of the subset of the master station. Robustness for edge probability distribution: the data asset of the actual application of the target scene is a subset of the original domain, and any subset segmentation mode is guaranteed to be robust. For conditional probability distribution, the tag information of the target scene can be solved by the difference between the distribution of the input data and the original domain.

Based on the above-mentioned hybrid DG problem, the model training may be performed on the target model using the above-mentioned asset data carrying tag information, specifically, a loss function required for model training may be preset according to actual situations, in the process of setting the loss function, the generalization of the data asset set may be targeted, a corresponding loss function may be set, for example, a non-changing risk minimization loss function may be set as the above-mentioned loss function, where the non-changing risk minimization (Invariant Risk Minimization, IRM) loss function may be a loss function for solving the risk minimization problem in adaptive and migration learning, and the non-changing risk minimization loss function models the distribution of data assets within and between domains, and makes the classifier have the same risk (or a non-changing risk) for the domains included in the data asset set, thereby achieving a high level generalization capability between different domains. The differences between the different data sets can be minimized by minimizing the loss function without risk of variation, i.e. by minimizing the risk of invariance to give the target model a better generalization capability. One of the objectives of the invariant risk minimization loss function is to minimize the invariant risk, which means that the target model behaves relatively stable under different distributions, rather than well for a particular data distribution, assuming that data assets in different environments can be accessed, the data distributions of the data assets in different environments are different, but potentially causal, and the causal relationship between the variables of interest and some observable characteristics does not change with changes in the environment.

The invariant risk minimization loss function may be defined as follows: Φ: X-H is a nonlinear representation layer, omega-H-Y is a linear classification layer

Wherein Φ represents a predictor, ω represents a parameter, e represents an environment, R represents a loss function, ε _tr Representing data assetsThe environment contained in the production set. The risk-free minimum loss function can be constructed through the expression, or a new expression can be obtained by reasonably deforming the expression, and the new expression can be specifically set according to actual conditions, and the embodiment of the specification is not limited.

As another example, a distributed robust optimization loss function may also be set as the loss function described above, wherein the distributed robust optimization (Distributionally Robust Optimization, DRO) loss function is an optimized loss function for handling the problem of data distribution uncertainty, the basic idea being to consider the effect of data distribution uncertainty on model performance when optimizing the objective function, in particular, the goal of DRO is to learn the model in a worst case distribution scenario in order that it can be well generalized into the test data to minimize the prediction error of the model under uncertain data distribution.

DRO can be defined as follows, learning a model in a worst case distribution scenario, where k is the current domain, R _k (θ) is the current objective loss function, the problem addressed by DRO is the representative difference, i.e. lower loss information or higher loss information phenomena due to a small population.

Where P is the probability distribution, Z is the random variable, and θ is the model parameter. The distributed robust optimization loss function may be constructed by the above expression, or a new expression may be obtained by reasonably deforming the above expression as the distributed robust optimization loss function, which may be specifically set according to the actual situation, and the embodiment of the present specification is not limited thereto.

For another example, an integrated distillation (Ensemble Distillation) learning loss function may be set as the loss function, wherein the integrated distillation learning loss function is a loss function constructed by introducing expert and non-expert concepts into a domain by adopting an integrated learning manner and performing distillation learning on the non-expert by using the expert results of the domain. The integrated distillation learning loss function can make the stability of the data asset stronger and reduce the dependence on a single data asset.

In practical application, one of the risk-free minimum loss function, the distributed robust optimization loss function and the integrated distillation learning loss function may be selected as a loss function for model training of the target model, or 2 loss functions may be selected from the 3 loss functions, the selected 2 loss functions may be fused to obtain a fused loss function, the fused loss function may be used as a loss function for model training of the target model, the 3 loss functions may be fused to obtain a fused loss function, the fused loss function may be used as a loss function for model training of the target model, and the like, and may be specifically set according to practical situations. In practical application, in view of the objective that the integrated distillation learning loss function can achieve, when the loss function is selected from the 3 kinds of loss functions, the selected loss function may include at least the integrated distillation learning loss function.

The data assets with the tag information in the first data asset subset can be input into the target model to obtain a corresponding output result, the corresponding loss information can be calculated through the determined loss function based on the output result and the tag information corresponding to the data asset, model parameters of the target model can be adjusted based on the obtained loss information, then another data asset is input into the adjusted target model to obtain a corresponding output result, the corresponding loss information can be calculated through the determined loss function based on the output result and the tag information corresponding to the data asset, further the model parameters of the target model are adjusted, the processing process is repeatedly executed until the loss function converges, and the trained target model is obtained.

In step S206, the data assets in the data asset set are respectively input into the trained target models, so as to predict the data assets in the data asset set through the trained target models, and obtain corresponding prediction results.

In step S208, a fusion data asset corresponding to the data asset set is determined based on the obtained prediction result, and the fusion data asset is provided to other scenes except the application scene corresponding to the target model for data application.

In implementation, label information of each data asset in the data asset set can be predicted through the trained target model to obtain a corresponding prediction result, the obtained prediction result can be matched with the corresponding data asset, so that each data asset in the data asset set and a corresponding prediction result thereof are obtained, and a fusion data asset corresponding to the data asset set can be constructed based on each data asset in the data asset set and the corresponding prediction result thereof. Because the fusion data asset has better generalization, the fusion data asset can be provided for other scenes except for the application scene corresponding to the target model to perform data application, for example, the fusion data asset can be used as sample data to perform model training on a specified model in other scenes to obtain a trained specified model and the like, and the fusion data asset can be specifically set according to actual conditions, and the embodiment of the specification is not limited.

The embodiment of the present disclosure provides a method for fusing data assets, by obtaining a data asset set composed of data assets from one or more different data sources, the data asset set including a first data asset subset composed of data assets carrying tag information, then, based on the data assets carrying tag information in the first data asset subset and a preset loss function, performing supervised model training on a target model to obtain a trained target model, the loss function is a loss function constructed by one or more of a loss function minimizing risk, a distributed robust optimization loss function and an integrated distillation learning loss function, then, the data assets in the data asset set can be respectively input into the trained target model, so as to predict the data assets in the data asset set through the trained target model, to obtain corresponding prediction results, finally, the fused data assets corresponding to the data asset set can be determined based on the obtained prediction results, and the fused data assets can be provided to other scenes outside the application scene corresponding to the target model, thus, the mixed problem in the multi-source data asset fusion process is obtained, the loss function is the loss function constructed by one or more of the loss functions in a minimum risk, the integrated cost function is effectively prevented from having a plurality of distributed robust performance conditions by the integrated cost function, the integrated cost function is effectively reduced by the probability of the integrated cost function is reduced by the integrated cost function, and the probability of the integrated cost function is effectively lost by the integrated cost function is reduced by the error of the integrated cost function is better than the error-free performance of the integrated cost model 62, furthermore, the method can obtain better effects on the unknown test set, and in addition, the fused data asset can also have better effect performance when facing the unknown data distribution.

In practical application, after the step S202, the data assets in the acquired data asset set may be further aligned, and in practical application, the manner of aligning the data assets may be various, and the following five optional processing manners are provided, which may specifically include the following one-way to five-way processing.

Mode one: and carrying out alignment processing on the data assets of different domains in the data asset set through the check-up rate to obtain an aligned data asset set.

In implementation, as shown in FIG. 3, the data assets for different domains in the constructed data asset set are aligned along the search yield. For example, the data asset set contains a model score (whose value belongs to a continuous value of 0-1), a policy rating (whose value belongs to a discrete value of 1-5), a list (whose value belongs to a single value), and there are large differences in the dimensions of the three, and one alignment way to map the discrete value to the continuous value is to align according to the search yield.

The alignment process of data assets for different domains in a data asset set is as follows:

where Input represents Input data and Output represents Output data.

Mode two: and respectively carrying out standardization processing on the data assets in different domains in the data asset set through a preset standardization algorithm, and constructing an aligned data asset set based on the standardized data assets, wherein the standardization algorithm comprises a z-score standardization algorithm or a min-max standardization algorithm.

In implementation, data assets in different domains such as model scores, policy ratings, lists and the like can be subjected to standardization processing, so that the data assets are in the same dimension range, standardized data assets are obtained, and an aligned data asset set can be constructed by using the standardized data assets.

Mode three: mapping data assets belonging to discrete values in the data asset set to first data assets belonging to continuous values, and constructing an aligned data asset set based on the first data assets and the data assets belonging to continuous values in the data asset set.

In implementation, for example, the value of the policy rating belongs to 1-5 discrete values, the discrete values of the policy rating can be mapped to ordered continuous values, for example, the policy rating 1-5 is mapped to 1.0-5.0, so that continuity is achieved, and alignment processing can be performed on the data assets in different domains such as model scores and lists, and an aligned data asset set is obtained.

Mode four: mapping data assets belonging to the continuous type numerical value in the data asset set to second data assets belonging to the discrete type numerical value, and constructing an aligned data asset set based on the second data assets and the data assets belonging to the discrete type numerical value in the data asset set.

Mode five: and setting corresponding weights for each data asset according to importance information corresponding to the data assets in different domains in the data asset set, and carrying out alignment processing on the data assets in different domains in the data asset set based on the set weights to obtain an aligned data asset set.

Through the alignment processing, the data assets of different domains in the data asset set can be made to have comparability on the same dimension, so that statistical analysis, modeling or model evaluation can be better performed, and when larger differences exist in the dimensions of the data assets of different domains in the data asset set, the alignment processing can eliminate the dimension differences, so that comparison and comprehensive analysis are easier to perform between the data assets of different domains in the data asset set. By means of the alignment process, the range and distribution of the data assets in different domains in the data asset set can be ensured to be on the same scale, and model bias or instability caused by dimension differences can be avoided.

In addition, as shown in fig. 4, after the data asset set is acquired through the above-described processing of step S202, admission evaluation may be performed on the data assets included in the data asset set, and specifically, see the following processing of step S210 to step S216.

In step S210, admission evaluation is performed on the data assets in the data asset set based on the preset data asset evaluation policy and the tag information of the data assets included in the data asset set, so as to obtain an admission evaluation result of each data asset, where the data asset evaluation policy is constructed by one or more of an information value IV, an area under curve AUC, KS statistics information, an F1 score, a stability index PSI, whether a production period of the data asset is delayed, and a non-empty rate of the data asset on a characteristic corresponding to the preset data asset.

The information value IV is an information value, and is used for indicating the contribution degree of the feature to the target prediction, that is, the prediction capability of the feature, and in general, the higher the information value IV is, the stronger the prediction capability of the feature is, and the higher the information contribution degree is. The area under the curve AUC is the area enclosed by the ROC curve and the coordinate axis, the value range of the AUC is between 0.5 and 1, and the closer the AUC is to 1, the higher the authenticity of the searched data asset is. The KS statistical information is used for evaluating discrimination capability of the data assets with the superiority and the inferiority, and calculating the gap between the accumulated inferiority data assets and the percentage of the accumulated superiority data assets. The F1 Score, also known as the balance F Score (balance F Score), is defined as the harmonic mean of the precision and recall.

In implementation, during the data asset admission evaluation process, a preset data asset evaluation policy may be used to evaluate the quality of the data asset, such as the data asset evaluation policy is constructed by one or more of an information value IV, an area under curve AUC, KS statistics, an F1 score, a stability index PSI, whether a production cycle of the data asset is delayed, a non-empty rate of the data asset on a characteristic corresponding to the preset data asset, and so on. The information value IV, the area under the curve AUC, KS statistical information, the F1 score, the stability index PSI, whether the output period of the data asset is delayed, the non-empty rate of the data asset on the corresponding feature of the preset data asset, and other evaluation indexes can be calculated based on different tag information, so as to evaluate the performance of the data asset on different tag information.

The specific evaluation process can be performed according to the following steps: constructing a drive table and a label; for each data asset, inputting the data of each data asset as a characteristic into a model or rule corresponding to evaluation indexes such as the information value IV, the area under the curve AUC, KS statistical information, the F1 score, the stability index PSI, whether the output period of the data asset is delayed or not, the non-empty rate of the data asset on the preset characteristic corresponding to the data asset and the like, so as to obtain a prediction result or evaluation score; according to the predicted result or evaluation score, calculating the corresponding evaluation index value (i.e. the information value IV, the area under curve AUC, KS statistical information, F1 score and other values) by combining with the actual label information, thereby obtaining the admission evaluation result of each data asset

In step S212, based on the admission evaluation result of each data asset, deleting the data assets whose admission evaluation result exceeds the preset threshold value in the data asset set, and obtaining the deleted data asset set.

The quality of the data assets can be evaluated by comprehensively considering the data asset evaluation strategies on different tag information, and screening is carried out according to a set preset threshold value, so that the data assets in the data asset set can be ensured to have certain prediction capability or evaluation value, and the effect of a subsequent modeling or decision process is improved.

Based on the processing of the above-described step S210 and step S212, the above-described step S204 can be realized by the processing of the following step S214.

In step S214, supervised model training is performed on the target model based on the data assets carrying tag information in the deleted data asset set and a preset loss function, so as to obtain a trained target model.

The specific process of step S214 may be referred to the specific content of step S204, and will not be described herein.

Based on the processing of the above-described step S210 and step S212, the above-described step S206 can be realized by the processing of the following step S216.

In step S216, the data assets in the deleted data asset set are respectively input into the trained target models, so as to predict the data assets in the deleted data asset set through the trained target models, and obtain corresponding prediction results.

The specific process of step S216 may refer to the specific content of step S206, which is not described herein.

In addition, as shown in fig. 5, the above-mentioned process of step S204 or step S214 may be varied, and the following provides an alternative process, and in particular, see the following processes of step S218 and step S220.

In step S218, the fusion of the data assets in the data asset set or the deleted data asset set is evaluated based on a preset asset fusion evaluation policy, so as to obtain a corresponding evaluation result, where the asset fusion evaluation policy is an evaluation policy constructed by a comprehensive effect evaluation sub-policy and/or a key segment effect evaluation sub-policy, and the comprehensive effect evaluation sub-policy is constructed by one or more of an information value IV, an area under a curve AUC, KS statistical information, and an F1 score.

The comprehensive effect evaluation sub-strategy can be used for judging whether the data asset has a certain improvement on the comprehensive effect, particularly for a fraud scoring model, the fraud scoring model can be constructed and trained by using a plurality of data assets with single data sources, such as personal basic information, transaction information, behavior data and the like. If the fraud scoring model constructed and trained by using the data assets of a single data source is generally represented on the evaluation indexes (such as the information value IV, the area under the curve AUC, KS statistical information, the F1 score and the like) in the comprehensive effect evaluation sub-strategy (namely, the obtained fraud scoring model is represented in the middle level (including a high level and a low level) of the representation range on the evaluation indexes in the comprehensive effect evaluation sub-strategy), and the fraud scoring model constructed and trained after the data assets in the data source are fused has a significantly improved model effect on the evaluation indexes, the fusion of the data assets can be considered to be effective, which indicates that the credit risk of an individual can be predicted and evaluated more accurately by fusing the data assets of a plurality of data sources. For the key segmentation effect evaluation sub-strategy, different data sources may have different influences on individuals of different segments, so that the key segmentation effect evaluation sub-strategy can be used for judging whether the effect of a data asset on a key part is improved to a certain extent, particularly in the anti-fraud field, the risk can be judged by integrating the data assets of different segments by using the asset data of a plurality of single data sources (such as an identification model part, a trusted model part, a transaction record part, a communication record part or a behavior record part and the like), some data assets have better high segmentation effect, but are missing on middle-low segments, some data assets have better effect on low segments, but are missing on middle-high segments, and the risk characteristics of individuals of different segments can be more comprehensively mastered by fusing the data assets of a plurality of data sources, so that the accuracy of prediction and decision is improved.

In step S220, if the evaluation result indicates that the fusion processing of the data asset set or the data asset in the deleted data asset set is effective, the supervised model training is performed on the target model based on the data asset carrying the tag information and the preset loss function, so as to obtain a trained target model.

Wherein the evaluation result indicates that it is effective to perform fusion processing on the data assets in the data asset set or the deleted data asset set, that is, the corresponding effect after fusion of the data assets is better than the effect obtained by using the data assets of the single data source.

In practical applications, the constructed loss function may include a risk-invariant minimization loss function, the risk-invariant minimization loss function is determined by minimizing an average risk of an environment corresponding to the data asset set and a preset parameter reaching an optimal value in each environment at the same time, the average risk of the environment corresponding to the minimized data asset set is determined by a preset predictor in a nonlinear representation layer in the target model, the preset parameter is set in a linear classification layer in the target model, and the preset parameter reaching the optimal value in each environment at the same time is determined by adjusting an output result of the predictor to the gradient of the preset parameter by the risk-invariant minimization loss function.

In practice, based on the above formula (1), the following formula (3) can be obtained by deforming it

The first of these is ERM (Empirical Risk Minimization ), i.e. minimizing the average risk for all environments, can be achieved using one predictor phi. The goal of ERM is to minimize the average loss over the data asset set, i.e., to enable the target model to better adapt to the training data by minimizing empirical risk, however, this approach is susceptible to data asset bias and noise, resulting in poor performance on new data assets, as opposed to IRMs that are more focused on maintaining the stability and consistency of the target model. One of the goals of IRM is to minimize the differences between different data asset sets, i.e. to make the goal model better generalizable by minimizing the risk of invariance. The IRM may obtain consistent results across different sets of data assets, avoiding the effects of data asset bias and noise on the target model.

The second item is the content added by the IRM, encourages phi to be optimal in all environments e at the same time, and in a specific implementation, the item determines whether phi is locally optimal or not by means of gradient calculation, and whether the local omega can be improved or not.

The goal of the invariant risk minimization loss function is to learn an ω so that it is optimal in every environment at the same time. If optimal under certain circumstances, the risk-invariant minimization loss function has a gradient of 0 for it. If not zero, a penalty term may be set.

Besides, IRMs also have the following variants:

the goal of the invariant risk minimization loss function is to learn an ω so that it is optimal in every environment at the same time. If optimal under certain circumstances, the risk-invariant minimization loss function has a gradient of 0 for it.

If phi (x) extracts invariant features, then P for each environment ^e (Y|phi (x)) are all the same, where Var in equation (5) represents a variable and lambda is a parameter.

A classification component Classifier specific to the environment is fitted in the environment e, and then the difference of the loss information of the two is determined.

In practical applications, the constructed loss function comprises a distributed robust optimization loss function, the distributed robust optimization loss function is determined by minimizing the data assets of the domain with the largest expected loss, and the data assets of the domain with the largest expected loss are determined by the target loss function, the probability distribution, the random variable and the model parameter corresponding to each domain in the data asset set.

In an implementation, based on equation (2) above, there is a corresponding grouping for each data asset in the set of data assets. However, according to the prior of the business experience grouping, depending on expert experience, the artificial grouping can make the model more robust, but the prior probability of the user may still have a larger deviation from the actual application process, so in the regular part of the DRO, two assumptions are considered:

the grouping information is known in the training process: it is assumed that the grouping information is known from expert experience and "seen" for worst case groupings based on the notion of GroupDRO and related work. In particular, when the grouping information is known, i.e., the data asset contains (x, y, z) triples, but the domain (i.e., the domain) is not assumed to be observed at the time of testing, so the target model cannot directly use the domain (i.e., the domain), instead, one group-robust target model is learned, minimizing the experience worst group risk.

Where Θ represents the range of values of the model parameters and the bar represents the target model. The grouping information is unknown in the training process: the artificially derived expert experience is discarded and a potential corresponding distribution is generated from the existing distribution. Specifically, considering d_ { χ ζ 2} (p||q) as the difference χ -overgene between the distribution P and the distribution Q, further considering the case that the specific packet is unknown, the following equation (8) defines a Chi-square ball around the distribution P.

Wherein r is a parameter. At this time, the objective of the optimization is

In practical application, the constructed loss function comprises an integrated distillation learning loss function, wherein the integrated distillation learning loss function is constructed through the number of domains contained in the data asset set, the number of domains outside the current domain, the information of the current domain, the data distribution information corresponding to any domain in the domains contained in the data asset set, the numerical value output by an expert module of any domain in the domains contained in the data asset set, and the input characteristics of any domain in the domains contained in the data asset set, and the expert module comprises a multi-layer fully-connected network.

In implementation, the model parameters of the expert module are shared independently by each domain, wherein each expert module is an expert in the own domain and is a non-expert in the non-own domain, and the specific structure of the expert module is a multi-layer full-connection network of the domains, as shown in fig. 6. The final outcome is the scores of the experts in each field.

The data asset is constructed, one domain at a time is chosen as the provider of pseudo tag information, and the values output by expert modules of other domains on that domain are calculated. Where K represents the number of domains contained in the data asset set, K-1 represents the number of domains outside the current domain, i represents the current domain, D _i Representing data distribution information corresponding to the ith domain, E _i Representing the value, X, output by the expert module of the ith domain _i Representing the input features of the i-th domain, a represents a data enhancement process (e.g., a intra-domain Mixup process, etc.), and is removable.

In addition, aiming at the conditions of coexistence, disorder dimension, result conflict, semantic overlapping and the like of high-accuracy assets and low-accuracy data assets in the data asset fusion process, the scheme performs normalization and decoupling on the data assets through data asset alignment, data asset effect evaluation and the like, eliminates and changes the data assets through admittance setting, fuses the multi-source data assets in a data asset fusion stage in a supervised learning mode, and further disassembles and combs the semantics of the multi-source data assets in an expert module integrated distillation mode.

The following provides a detailed description of a method for fusing data assets in the embodiments of the present disclosure in connection with a specific application scenario, where the target model is a risk prevention and control model in a financial service, and the data asset set may be a financial transaction related data set, where the data asset set may include one or more fields of privacy data of a user, behavior data of the user, and resource class information of the user, and the resource class information of the user may include account information, transaction amount, transaction time, and transaction location of both parties of a transaction, for example. The tag information may include one or more of default, fraudulent, and illegal financial transactions.

As shown in fig. 7, the embodiment of the present disclosure provides a data asset fusion method, where an execution subject of the method may be a terminal device or a server, where the terminal device may be a mobile terminal device such as a mobile phone, a tablet computer, or a computer device such as a notebook computer or a desktop computer, or may be an IoT device (specifically, such as a smart watch, an in-vehicle device, or the like), and the server may be a separate server, or may be a server cluster formed by multiple servers, and the server may be a background server such as a financial service or an online shopping service, or may be a background server of an application program, or the like. In this embodiment, the execution subject is taken as a server for example for detailed description, and for the case that the execution subject is a terminal device, the following processing of the case of the server may be referred to, and will not be described herein. The method specifically comprises the following steps:

in step S702, a set of financial transaction-related data comprising data assets from one or more different data sources is acquired, the set of financial transaction-related data comprising a first subset of financial transaction-related data comprising data assets carrying tag information.

In step S704, admission evaluation is performed on the data assets in the data set related to the financial transaction based on the preset data asset evaluation policy and the tag information of the data assets included in the data set related to the financial transaction, so as to obtain an admission evaluation result of each data asset, where the data asset evaluation policy is constructed by one or more of an information value IV, an area under curve AUC, KS statistics information, an F1 score, a stability index PSI, whether a production period of the data asset is delayed, and a non-empty rate of the data asset on a characteristic corresponding to the preset data asset.

In step S706, based on the admission evaluation result of each data asset, the data asset whose admission evaluation result exceeds the preset threshold value in the financial transaction related data set is deleted, and the deleted financial transaction related data set is obtained.

In step S708, the fusion of the data assets in the deleted financial transaction related data set is evaluated based on a preset asset fusion evaluation policy, so as to obtain a corresponding evaluation result, where the asset fusion evaluation policy is an evaluation policy constructed by a comprehensive effect evaluation sub-policy and/or a key segmentation effect evaluation sub-policy, and the comprehensive effect evaluation sub-policy is constructed by one or more of an information value IV, an area under a curve AUC, KS statistical information, and an F1 score.

In step S710, if the evaluation result indicates that the fusion processing of the data asset in the deleted financial transaction related data set is effective, the risk prevention and control model is supervised and model-trained based on the data asset carrying the tag information and the preset loss function, so as to obtain a trained risk prevention and control model.

Wherein the loss function is a loss function constructed from a risk-invariant minimization loss function, a distributed robust optimization loss function, and an integrated distillation learning loss function. The constant risk minimization loss function is determined by minimizing the average risk of the environment corresponding to the financial transaction related data set and enabling the preset parameter to reach an optimal value in each environment at the same time, the average risk of the environment corresponding to the minimized financial transaction related data set is determined by a preset predictor in a nonlinear representation layer in a risk prevention and control model, the preset parameter is arranged in a linear classification layer in the risk prevention and control model, and the preset parameter is determined by adjusting the output result of the predictor on the gradient of the preset parameter through the constant risk minimization loss function when reaching the optimal value in each environment. The distributed robust optimization loss function is determined by minimizing the data assets of the domain with the largest expected loss, which is determined by the objective loss function, the probability distribution, random variables, and model parameters corresponding to each domain in the data asset set. The integrated distillation learning loss function is constructed by the number of domains contained in the financial transaction related data set, the number of domains outside the current domain, the information of the current domain, the data distribution information corresponding to any domain in the domains contained in the financial transaction related data set, the numerical value output by an expert module of any domain in the financial transaction related data set, and the input characteristics of any domain in the domains contained in the financial transaction related data set, wherein the expert module comprises a multi-layer fully connected network.

Based on the above, a unified model optimization objective is as follows:

in step S712, the data assets in the deleted financial transaction related data set are respectively input into the trained risk prevention and control model, so as to predict the data assets in the deleted financial transaction related data set through the trained risk prevention and control model, and obtain a corresponding prediction result.

In step S714, a fused data asset corresponding to the financial transaction related data set is determined based on the obtained prediction result, and the fused data asset is provided to other scenes except the application scene corresponding to the risk prevention and control model for data application.

The above method for fusing data assets provided in the embodiments of the present disclosure further provides a device for fusing data assets based on the same concept, as shown in fig. 8.

The fusion device of the data asset comprises: a data set acquisition module 801, a model training module 802, a prediction module 803, and a fusion module 804, wherein:

a data set acquisition module 801 that acquires a data asset set of data assets from one or more different data sources, the data asset set including a first subset of data assets that carry tag information;

Model training module 802, performing supervised model training on a target model based on the data assets carrying tag information in the first data asset subset and a preset loss function, to obtain a trained target model, wherein the loss function is a loss function constructed by one or more of a risk-free minimization loss function, a distributed robust optimization loss function and an integrated distillation learning loss function;

the prediction module 803 is used for respectively inputting the data assets in the data asset set into the trained target model so as to predict the data assets in the data asset set through the trained target model and obtain corresponding prediction results;

and a fusion module 804, for determining a fusion data asset corresponding to the data asset set based on the obtained prediction result, and providing the fusion data asset to other scenes except the application scene corresponding to the target model for data application.

In an embodiment of the present disclosure, the apparatus further includes:

the first alignment module is used for carrying out alignment processing on the data assets of different domains in the data asset set through the checking yield to obtain an aligned data asset set; or,

The second alignment module is used for respectively carrying out standardization processing on the data assets in different domains in the data asset set through a preset standardization algorithm, and constructing an aligned data asset set based on the standardized data assets, wherein the standardization algorithm comprises a z-score standardization algorithm or a min-max standardization algorithm; or,

a third alignment module that maps data assets belonging to discrete values in the set of data assets to first data assets belonging to continuous values, and constructs an aligned set of data assets based on the first data assets and the data assets belonging to continuous values in the set of data assets; or,

a fourth alignment module that maps data assets belonging to the continuous type numerical value in the data asset set to second data assets belonging to the discrete type numerical value, and constructs an aligned data asset set based on the second data assets and the data assets belonging to the discrete type numerical value in the data asset set; or,

and a fifth alignment module, for setting corresponding weight for each data asset according to the importance information corresponding to the data assets of different domains in the data asset set, and performing alignment processing on the data assets of different domains in the data asset set based on the set weight to obtain an aligned data asset set.

In an embodiment of the present disclosure, the apparatus further includes:

the admission evaluation module is used for carrying out admission evaluation on the data assets in the data asset set based on a preset data asset evaluation strategy and the tag information of the data assets contained in the data asset set to obtain an admission evaluation result of each data asset, wherein the data asset evaluation strategy is constructed through one or more of an information value IV, an area under a curve AUC, KS statistical information, an F1 score, a stability index PSI, whether the output period of the data asset is delayed or not and the non-empty rate of the data asset on the corresponding characteristic of the preset data asset;

the screening module is used for deleting the data assets, the admission evaluation result of which exceeds a preset threshold value, in the data asset set based on the admission evaluation result of each data asset, and a deleted data asset set is obtained;

the model training module 802 performs supervised model training on the target model based on the data assets carrying tag information in the deleted data asset set and a preset loss function to obtain a trained target model;

the prediction module 803 inputs the data assets in the deleted data asset set into the trained target model respectively, so as to predict the data assets in the deleted data asset set through the trained target model, and obtain a corresponding prediction result.

In the embodiment of the present specification, the model training module 802 includes:

the fusion evaluation unit is used for evaluating fusion of the data assets in the data asset set or the deleted data asset set based on a preset asset fusion evaluation strategy to obtain a corresponding evaluation result, wherein the asset fusion evaluation strategy is an evaluation strategy constructed through a comprehensive effect evaluation sub-strategy and/or a key segmentation effect evaluation sub-strategy, and the comprehensive effect evaluation sub-strategy is constructed through one or more of an information value IV, an area under a curve AUC, KS statistical information and an F1 score;

and the model training unit is used for performing supervised model training on the target model based on the data asset carrying the tag information and a preset loss function to obtain a trained target model if the evaluation result indicates that the fusion processing of the data asset in the data asset set or the deleted data asset set is effective.

In this embodiment of the present disclosure, the constructed loss function includes a risk-invariant minimization loss function, where the risk-invariant minimization loss function is determined by minimizing an average risk of an environment corresponding to the data asset set and a preset parameter reaching an optimal value in each of the environments at the same time, the average risk of the environment corresponding to the minimized data asset set is determined by a preset predictor in a nonlinear representation layer in the target model, the preset parameter is set in a linear classification layer in the target model, and the preset parameter reaching the optimal value in each of the environments at the same time is determined by adjusting an output result of the predictor by a gradient of the risk-invariant minimization loss function to the preset parameter.

In this embodiment of the present disclosure, the constructed loss function includes a distributed robust optimization loss function, where the distributed robust optimization loss function is determined by minimizing data assets of a domain with a maximum expected loss, and the data assets of the domain with the maximum expected loss are determined by a target loss function, a probability distribution, a random variable, and a model parameter corresponding to each domain in the data asset set.

In this embodiment of the present disclosure, the constructed loss function includes an integrated distillation learning loss function, where the integrated distillation learning loss function is constructed by a number of domains included in the data asset set, a number of domains other than the current domain, information of the current domain, data distribution information corresponding to any domain in the data asset set, a numerical value output by an expert module of any domain in the data asset set, and an input feature of any domain in the data asset set, and the expert module includes a multi-layer fully connected network.

In this embodiment of the present disclosure, the target model is a risk prevention and control model in a financial service, the data asset set includes one or more fields of privacy data of a user, behavior data of the user, and resource class information of the user, and the tag information includes one or more of default, fraud, and illegal financial transactions.

The embodiment of the specification provides a fusion device of data assets, through obtaining a data asset set formed by data assets from one or more different data sources, then, can carry out supervised model training on a target model based on the data assets carrying tag information and a preset loss function, obtain a trained target model, the loss function is a loss function constructed by one or more items of a constant risk minimization loss function, a distributed robust optimization loss function and an integrated distillation learning loss function, afterwards, can respectively input the data assets in the data asset set into the trained target model, so as to predict the data assets in the data asset set through the trained target model, obtain corresponding prediction results, finally, can determine fusion data assets corresponding to the data asset set based on the obtained prediction results, and provide the fusion data assets for other scenes except for application scenes corresponding to the target model, so, aiming at the problem of mixed DG in the multi-source data asset fusion process, can effectively prevent the condition of conditional probability distribution drift through the constant risk minimization IRM loss function, can effectively prevent the occurrence of the condition of the drift probability distribution, can effectively prevent the unknown condition from being better than the problem of the drift probability distribution by the DRO, can effectively prevent the unknown data from having a plurality of the data from having better stability, can effectively obtaining the unknown data in the integrated data model by the integrated data model, can effectively prevent the problem of the data has better stability, can be better than the unknown data has the stability, and can be better solved by the data has the stability, and can be better tested by the data has the stability better than the data has the stability better than the data model has the stability better stability, can also have better effect.

The above device for fusing data assets provided in the embodiments of the present disclosure further provides a device for fusing data assets based on the same concept, as shown in fig. 9.

The data asset fusion device may provide a terminal device or a server or the like for the above embodiments.

The fusion device of data assets may vary widely in configuration or performance, may include one or more processors 901 and memory 902, and may have one or more stored applications or data stored in memory 902. Wherein the memory 902 may be transient storage or persistent storage. The application programs stored in memory 902 may include one or more modules (not shown in the figures), each of which may include a series of computer-executable instructions in a fusion device for data assets. Still further, the processor 901 may be configured to communicate with the memory 902 and execute a series of computer executable instructions in the memory 902 on a fusion device of data assets. The fusion device of data assets may also include one or more power supplies 903, one or more wired or wireless network interfaces 904, one or more input output interfaces 905, and one or more keyboards 906.

In particular, in this embodiment, a fusion device for a data asset includes a memory, and one or more programs, where the one or more programs are stored in the memory, and the one or more programs may include one or more modules, and each module may include a series of computer-executable instructions for the fusion device for a data asset, and execution of the one or more programs by one or more processors includes computer-executable instructions for:

obtaining a set of data assets from one or more different data sources, the set of data assets including a first subset of data assets that carry tag information;

performing supervised model training on a target model based on data assets carrying tag information in the first data asset subset and a preset loss function to obtain a trained target model, wherein the loss function is a loss function constructed by one or more of a constant risk minimization loss function, a distributed robust optimization loss function and an integrated distillation learning loss function;

Respectively inputting the data assets in the data asset set into a trained target model to predict the data assets in the data asset set through the trained target model so as to obtain corresponding prediction results;

and determining a fusion data asset corresponding to the data asset set based on the obtained prediction result, and providing the fusion data asset for other scenes except the application scene corresponding to the target model for data application.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for a data asset fusion device embodiment, since it is substantially similar to a method embodiment, the description is relatively simple, as relevant to see a partial description of the method embodiment.

The embodiment of the present disclosure provides a fusion device for data assets, by obtaining a data asset set composed of data assets from one or more different data sources, the data asset set including a first data asset subset composed of data assets carrying tag information, then, based on the data assets carrying tag information in the first data asset subset and a preset loss function, performing supervised model training on a target model to obtain a trained target model, the loss function is a loss function constructed by one or more of a loss function minimizing risk, a distributed robust optimization loss function and an integrated distillation learning loss function, then, the data assets in the data asset set can be respectively input into the trained target model, so as to predict the data assets in the data asset set through the trained target model, to obtain corresponding prediction results, finally, the fusion data assets corresponding to the data asset set can be determined based on the obtained prediction results, and the fusion data assets are provided to other scenes outside the application scene corresponding to the target model, thus, the mixed problem in the multi-source data asset fusion process is obtained, the loss function is the loss function constructed by one or more of the loss functions in a minimum risk, the integrated robust optimization function can be effectively prevented from having a single-distributed robust performance, the probability of the integrated distillation loss function can be effectively prevented from having a better performance than the drift performance by the integrated cost function, the integrated cost function is better than the integrated cost function by the error-free performance model 62, furthermore, the method can obtain better effects on the unknown test set, and in addition, the fused data asset can also have better effect performance when facing the unknown data distribution.

Further, based on the method shown in fig. 2 to 7, one or more embodiments of the present disclosure further provide a storage medium, which is used to store computer executable instruction information, and in a specific embodiment, the storage medium may be a U disc, an optical disc, a hard disk, etc., where the computer executable instruction information stored in the storage medium can implement the following flow when executed by a processor:

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for one of the above-described storage medium embodiments, since it is substantially similar to the method embodiment, the description is relatively simple, and reference is made to the description of the method embodiment for relevant points.

The embodiment of the present disclosure provides a storage medium, by obtaining a data asset set composed of data assets from one or more different data sources, then, can perform supervised model training on a target model based on the data asset carrying tag information and a preset loss function, to obtain a trained target model, the loss function is a loss function constructed by one or more of a constant risk minimization loss function, a distributed robust optimization loss function and an integrated distillation learning loss function, then, can input the data asset in the data asset set into the trained target model respectively, so as to predict the data asset in the data asset set through the trained target model, to obtain a corresponding prediction result, finally, can determine a fusion data asset corresponding to the data asset set based on the obtained prediction result, and provide the fusion data asset to other scenes except for an application scene corresponding to the target model, in this way, aiming at the mixing problem in the multi-source data asset fusion process, through a constant risk minimization IRM loss function, can effectively prevent the occurrence of a condition probability distribution drift function, can effectively prevent the occurrence of a drift probability distribution condition, can effectively obtain a plurality of unknown data by a better error distribution probability distribution function, can effectively prevent the occurrence of a drift probability distribution from a plurality of unknown data from having a better performance characteristics by a better than a dynamic error distribution, can also can test the unknown data in the integrated data set through the integrated data with a better performance, can obtain a better performance, and can test effect on the unknown data has a better stability, and can be better than has a better than the stability, and can test the unknown data can be compared with the data can be better by a better tested by the integrated by the data model and has a better than a better stability distribution model, can also have better effect.

The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

In the 90 s of the 20 th century, improvements to one technology could clearly be distinguished as improvements in hardware (e.g., improvements to circuit structures such as diodes, transistors, switches, etc.) or software (improvements to the process flow). However, with the development of technology, many improvements of the current method flows can be regarded as direct improvements of hardware circuit structures. Designers almost always obtain corresponding hardware circuit structures by programming improved method flows into hardware circuits. Therefore, an improvement of a method flow cannot be said to be realized by a hardware entity module. For example, a programmable logic device (Programmable Logic Device, PLD) (e.g., field programmable gate array (Field Programmable Gate Array, FPGA)) is an integrated circuit whose logic function is determined by the programming of the device by a user. A designer programs to "integrate" a digital system onto a PLD without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Moreover, nowadays, instead of manually manufacturing integrated circuit chips, such programming is mostly implemented by using "logic compiler" software, which is similar to the software compiler used in program development and writing, and the original code before the compiling is also written in a specific programming language, which is called hardware description language (Hardware Description Language, HDL), but not just one of the hdds, but a plurality of kinds, such as ABEL (Advanced Boolean Expression Language), AHDL (Altera Hardware Description Language), confluence, CUPL (Cornell University Programming Language), HDCal, JHDL (Java Hardware Description Language), lava, lola, myHDL, PALASM, RHDL (Ruby Hardware Description Language), etc., VHDL (Very-High-Speed Integrated Circuit Hardware Description Language) and Verilog are currently most commonly used. It will also be apparent to those skilled in the art that a hardware circuit implementing the logic method flow can be readily obtained by merely slightly programming the method flow into an integrated circuit using several of the hardware description languages described above.

The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer readable medium storing computer readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, application specific integrated circuits (Application Specific Integrated Circuit, ASIC), programmable logic controllers, and embedded microcontrollers, examples of which include, but are not limited to, the following microcontrollers: ARC 625D, atmel AT91SAM, microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic of the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller in a pure computer readable program code, it is well possible to implement the same functionality by logically programming the method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc. Such a controller may thus be regarded as a kind of hardware component, and means for performing various functions included therein may also be regarded as structures within the hardware component. Or even means for achieving the various functions may be regarded as either software modules implementing the methods or structures within hardware components.

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functionality of the units may be implemented in one or more software and/or hardware when implementing one or more embodiments of the present description.

It will be appreciated by those skilled in the art that embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, one or more embodiments of the present description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Moreover, one or more embodiments of the present description can take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

Embodiments of the present description are described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the specification. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable fraud case serial-to-parallel device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable fraud case serial-to-parallel device, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

One or more embodiments of the present specification may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. One or more embodiments of the present description may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.

The foregoing is merely exemplary of the present disclosure and is not intended to limit the present disclosure. Various modifications and alterations to this specification will become apparent to those skilled in the art. Any modifications, equivalent substitutions, improvements, or the like, which are within the spirit and principles of the present description, are intended to be included within the scope of the claims of the present description.

Claims

1. A method of fusing data assets, the method comprising:

2. The method of claim 1, after the acquiring the set of data assets from the one or more disparate data sources, the method further comprising:

Aligning the data assets of different domains in the data asset set through the checking yield to obtain an aligned data asset set; or,

respectively carrying out standardization processing on the data assets in different domains in the data asset set through a preset standardization algorithm, and constructing an aligned data asset set based on the standardized data assets, wherein the standardization algorithm comprises a z-score standardization algorithm or a min-max standardization algorithm; or,

mapping data assets belonging to discrete values in the data asset set to first data assets belonging to continuous values, and constructing an aligned data asset set based on the first data assets and the data assets belonging to continuous values in the data asset set; or,

mapping data assets belonging to continuous values in the data asset set to second data assets belonging to discrete values, and constructing an aligned data asset set based on the second data assets and the data assets belonging to discrete values in the data asset set; or,

and setting corresponding weights for each data asset according to the importance information corresponding to the data assets in different domains in the data asset set, and carrying out alignment processing on the data assets in different domains in the data asset set based on the set weights to obtain an aligned data asset set.

3. The method of claim 1, the method further comprising:

performing admission evaluation on the data assets in the data asset set based on a preset data asset evaluation strategy and tag information of the data assets contained in the data asset set to obtain an admission evaluation result of each data asset, wherein the data asset evaluation strategy is constructed through one or more of an information value IV, an area under a curve AUC, KS statistical information, an F1 score, a stability index PSI, whether a production period of the data asset is delayed or not and a non-empty rate of the data asset on a corresponding characteristic of the preset data asset;

deleting the data assets of which the admission evaluation results exceed a preset threshold value in the data asset set based on the admission evaluation results of each data asset, and obtaining a deleted data asset set;

the performing supervised model training on the target model based on the data assets carrying tag information in the first data asset subset and a preset loss function to obtain a trained target model, including:

performing supervised model training on the target model based on the data assets carrying the tag information in the deleted data asset set and a preset loss function to obtain a trained target model;

The data assets in the data asset set are respectively input into the trained target models, so that the data assets in the data asset set are predicted through the trained target models, and corresponding prediction results are obtained, and the method comprises the following steps:

and respectively inputting the data assets in the deleted data asset set into a trained target model to predict the data assets in the deleted data asset set through the trained target model, so as to obtain a corresponding prediction result.

4. A method according to claim 3, wherein the supervised model training of the target model based on the data asset carrying the tag information and the predetermined loss function results in a trained target model, comprising:

evaluating the fusion of the data assets in the data asset set or the deleted data asset set based on a preset asset fusion evaluation strategy to obtain a corresponding evaluation result, wherein the asset fusion evaluation strategy is an evaluation strategy constructed through a comprehensive effect evaluation sub-strategy and/or a key segmentation effect evaluation sub-strategy, and the comprehensive effect evaluation sub-strategy is constructed through one or more of an information value IV, an area under a curve AUC, KS statistical information and an F1 score;

And if the evaluation result indicates that the fusion processing of the data asset set or the data asset in the deleted data asset set is effective, performing supervised model training on the target model based on the data asset carrying the tag information and a preset loss function to obtain a trained target model.

5. The method of any of claims 1-4, the constructed loss function comprising a risk-invariant minimization loss function determined by minimizing an average risk of an environment corresponding to the set of data assets and a preset parameter concurrently reaching an optimal value within each of the environments, the average risk of the environment corresponding to the set of data assets being determined by a preset predictor in a nonlinear representation layer in the target model, the preset parameter being disposed in a linear classification layer in the target model, the concurrent reaching of an optimal value within each of the environments being determined by adjusting an output result of the predictor by a gradient of the risk-invariant minimization loss function to the preset parameter.

6. The method of claim 5, the constructed loss function comprising a distributed robust optimization loss function determined by minimizing data assets of a domain with a maximum expected loss determined by a target loss function, a probability distribution, a random variable, and model parameters corresponding to each domain in a set of data assets.

7. The method of claim 6, wherein the constructed loss function comprises an integrated distillation learning loss function constructed from a number of domains contained in the data asset set, a number of domains outside of a current domain, information of a current domain, data distribution information corresponding to any domain in the data asset set, values output by expert modules of any domain in the data asset set, input features of any domain in the data asset set, the expert modules comprising a multi-layer fully connected network.

8. The method of claim 7, the target model being a risk prevention model in a financial business, the set of data assets comprising data assets of one or more domains of user privacy data, user behavior data, user resource class information, the tag information comprising one or more of surprise, fraud, and illegal financial transactions.

9. A fusion apparatus for data assets, the apparatus comprising:

a data set acquisition module that acquires a data asset set of data assets from one or more different data sources, the data asset set including a first subset of data assets that carry tag information;

The model training module is used for performing supervised model training on the target model based on the data assets carrying the tag information in the first data asset subset and a preset loss function to obtain a trained target model, wherein the loss function is a loss function constructed by one or more of a risk-free minimum loss function, a distributed robust optimization loss function and an integrated distillation learning loss function;

the prediction module is used for respectively inputting the data assets in the data asset set into the trained target model so as to predict the data assets in the data asset set through the trained target model and obtain corresponding prediction results;

and the fusion module is used for determining fusion data assets corresponding to the data asset set based on the obtained prediction result, and providing the fusion data assets for other scenes except the application scene corresponding to the target model for data application.

10. A fusion device of data assets, the fusion device of data assets comprising:

a processor; and

a memory arranged to store computer executable instructions that, when executed, cause the processor to: