CN110738476B - Sample migration method, device and equipment - Google Patents

Sample migration method, device and equipment Download PDF

Info

Publication number
CN110738476B
CN110738476B CN201910905305.XA CN201910905305A CN110738476B CN 110738476 B CN110738476 B CN 110738476B CN 201910905305 A CN201910905305 A CN 201910905305A CN 110738476 B CN110738476 B CN 110738476B
Authority
CN
China
Prior art keywords
target domain
sample
same
source
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910905305.XA
Other languages
Chinese (zh)
Other versions
CN110738476A (en
Inventor
王骏
陈弢
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alipay Hangzhou Information Technology Co Ltd
Original Assignee
Alipay Hangzhou Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alipay Hangzhou Information Technology Co Ltd filed Critical Alipay Hangzhou Information Technology Co Ltd
Priority to CN201910905305.XA priority Critical patent/CN110738476B/en
Publication of CN110738476A publication Critical patent/CN110738476A/en
Application granted granted Critical
Publication of CN110738476B publication Critical patent/CN110738476B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q20/00Payment architectures, schemes or protocols
    • G06Q20/08Payment architectures
    • G06Q20/085Payment architectures involving remote charge determination or related payment systems
    • G06Q20/0855Payment architectures involving remote charge determination or related payment systems involving a third party
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/251Fusion techniques of input or preprocessed data

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Image Analysis (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

A method, an apparatus and a device for sample migration are disclosed. According to the scheme provided by the embodiment of the specification, a source sample in a source field with mature service and a small number of target domain samples in a target domain are extracted, the same feature and different features are respectively compared, adaptation after high-dimensional mapping is carried out on the same feature, and completion is carried out on different features, so that a source sample and a target domain sample after correction are obtained, and a fusion sample set which can be used for the target domain is obtained by combination.

Description

Sample migration method, device and equipment
Technical Field
The embodiment of the specification relates to the technical field of information, in particular to a sample migration method, device and equipment.
Background
The establishment of the wind control model cannot bypass the model training stage, and the training stage needs business data accumulation and marking data in a certain period. In practical application, it is often encountered in a certain environment to develop some mature services from zero base.
For example, when third party payment is mature in China, but business needs to be developed in a foreign country, the business scenarios are similar, but the environments are different, and only a very small amount of samples are accumulated in the early stage of business development, which causes that the business is difficult to train with local data and establish effective wind control models and strategies for risk prevention and control in the early stage of online business.
Based on this, a reliable sample migration scheme is needed.
Disclosure of Invention
It is an object of embodiments of the present application to provide a reliable sample migration.
In order to solve the above technical problem, the embodiment of the present application is implemented as follows:
a method of sample migration, comprising:
acquiring a source sample set and a target domain sample set, wherein the source sample and the target domain sample contain the same number of features, and the source sample and the target domain sample are applied to similar business fields;
determining the same characteristics and different characteristics contained in the source sample and the target domain sample;
aiming at the same characteristic, mapping the source sample and the target domain sample to the same high-dimensional space, determining the closest characteristic value distribution in the source sample set and the target domain sample set under the same characteristic, and changing the characteristic value of each sample under the same characteristic to be the characteristic value under the high-dimensional space according to the closest characteristic value distribution;
according to different characteristics, supplementing characteristic values of different characteristics in the target domain sample according to values of the different characteristics in a source sample set;
and combining the source sample set and the target domain sample set after the characteristic value is changed to generate a fusion sample set for model training in the target domain.
Correspondingly, embodiments of the present specification further provide a sample transfer device, including:
the system comprises a sample acquisition module, a data acquisition module and a data acquisition module, wherein a sample acquisition device acquires a source sample set and a target domain sample set, wherein the source sample and the target domain sample contain the same number of characteristics, and the source sample and the target domain sample are applied to similar business fields;
the characteristic determining module is used for determining the same characteristics and different characteristics contained in the source sample and the target domain sample;
the same feature transformation module is used for mapping the source sample and the target domain sample to the same high-dimensional space aiming at the same feature, determining the closest feature value distribution in the source sample set and the target domain sample set under the same feature, and changing the feature value of each sample under the same feature into the feature value under the high-dimensional space according to the closest feature value distribution;
the different feature transformation module is used for completing the feature values of different features in the target domain sample according to the values of the different features in the source sample set aiming at the different features;
and the fusion module is used for merging the source sample set and the target domain sample set after the characteristic value is changed to generate a fusion sample set for model training in the target domain.
According to the scheme provided by the embodiment of the specification, a source sample in a source field with mature service and a small number of target domain samples in a target domain are extracted, the same feature and different features are respectively compared, adaptation after high-dimensional mapping is carried out on the same feature, and completion is carried out on different features, so that a source sample and a target domain sample after correction are obtained, and a fusion sample set which can be used for the target domain is obtained by combination. Because the target domain samples and the source domain samples belong to similar business scenes, the distribution of the characteristic values of the target domain samples and the source domain samples is approximately equivalent by the method, and the quantity of the fused sample sets and the reliability in the target domain are improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of embodiments of the invention.
In addition, any one of the embodiments in the present specification is not required to achieve all of the effects described above.
Drawings
In order to more clearly illustrate the embodiments of the present specification or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the embodiments of the present specification, and other drawings can be obtained by those skilled in the art according to the drawings.
Fig. 1 is a schematic flow chart of a sample migration method provided in an embodiment of the present disclosure;
FIG. 2 is a schematic diagram of a feature description provided in an embodiment of the present disclosure;
FIG. 3 is a diagram illustrating a sample feature of a fill-up target domain in accordance with an embodiment of the present disclosure;
FIG. 4 is a schematic structural diagram of a sample transfer device provided in an embodiment of the present disclosure;
fig. 5 is a schematic structural diagram of an apparatus for configuring a method according to an embodiment of the present disclosure.
Detailed Description
In order to make those skilled in the art better understand the technical solutions in the embodiments of the present specification, the technical solutions in the embodiments of the present specification will be described in detail below with reference to the drawings in the embodiments of the present specification, and it is obvious that the described embodiments are only a part of the embodiments of the present specification, and not all the embodiments. All other embodiments that can be derived by one of ordinary skill in the art from the embodiments given herein are intended to be within the scope of protection.
Some concepts involved in the embodiments of the present description are explained first:
a source domain: the method can be used as a source service field of sample migration, such as a mature third party payment field in China, in the field, the number of samples is large, and a trained model is mature and reliable.
Source sample: applied to the samples in the source domain, each sample having been tagged with a label for either a fraudulent transaction or a normal transaction.
A source sample set: the set formed by the source samples has a large number of elements in the set.
Target domain: the method refers to a target domain needing sample migration, such as a third party payment domain developed in a certain overseas country, and the service in the target domain is similar to the service scene in the source domain.
Target domain samples: the samples applied to the target domain are fewer because the business is just developed abroad.
Target domain sample set: the set of target domain samples has a smaller number of elements in the set.
The technical solutions provided by the embodiments of the present description are described in detail below with reference to the accompanying drawings. As shown in fig. 1, fig. 1 is a schematic flow chart of a sample migration method provided in an embodiment of the present specification, where the flow chart specifically includes the following steps:
s101, a source sample set and a target domain sample set are obtained, wherein the source sample and the target domain sample contain the same number of features.
It should be noted that the sample is composed of a plurality of features, and each feature has a corresponding feature value. In the source sample, both the features and feature values are well-defined as the traffic has matured. And each source sample may have an explicit label. In other words, the number of features in the source sample is determined, as is the value of the feature in each sample.
Meanwhile, for the target domain sample, since sample migration is required, the number of features in the target domain sample needs to be the same as that in the source sample. One embodiment may be that the number of features in the target domain sample is determined to be equal to the number of features in the source sample when the target domain sample is created.
S103, the same characteristics and different characteristics contained in the source sample and the target domain sample are determined.
The different features referred to in the embodiments of the present specification refer to features that are present in the source samples but not in the target domain samples.
If one feature is present in the target domain sample and the source sample is not present, it can be regarded as a redundant feature in the process, and the deletion is performed. In fact, because the traffic types are similar, the target domain sample also contains the same or less features as the source sample when being established. As shown in fig. 2, fig. 2 is a schematic diagram of a feature description provided in an embodiment of the present disclosure. Overlapping portions in the figures are denoted by the same feature, and it is obvious that the same feature may be one or more.
And S105, aiming at the same characteristic, mapping the source sample and the target domain sample to the same high-dimensional space, determining the closest characteristic value distribution in the source sample set and the target domain sample set under the same characteristic, and changing the characteristic value of each sample under the same characteristic into the characteristic value under the high-dimensional space according to the closest characteristic value distribution.
In particular, various functional transformations may be employed to map the source and target domain samples to the same high-dimensional space simultaneously. For example using a linear transformation, or a polynomial transformation, or a gaussian transformation, etc.
The purpose of high-dimensional mapping by using function transformation is that in a low-dimensional space, the relationship between characteristic values of the same characteristic is difficult to see due to different environments of a target domain and the target domain. For example, for the feature "monthly spending limit" or the feature "trusted overdraft limit", it is obviously not feasible to directly make a comparison or migration of feature values in domestic and foreign environments.
However, when the sample data amount is large enough, it can be known that the distribution mode of the "monthly consumption quota" or the "credible overdraft quota" has a similar rule even under different field environments, for example, the distribution mode follows gaussian distribution, but parameters of the gaussian distribution are different in different fields. Therefore, if proper multi-dimensional mapping is carried out, on a certain high-dimensional space, the characteristic value distribution of the 'monthly consumption line' of the domestic domain and the 'monthly consumption line' of the foreign domain can still be seen to be very close. The specific expression is that after high-dimensional mapping is carried out according to the monthly consumption limit, the characteristic values of the source sample and the target domain sample in the high-dimensional space are relatively close, and the clustering effect is relatively obvious.
In practical application, the potential rules are multivariate, so that different high-dimensional mapping modes can be adopted for mapping respectively, and then the clustering effects in the high-dimensional space are compared, so that the closest feature value distribution in the source sample set and the target domain sample set is obtained, and further, for the same feature, the feature values of the same feature in the source sample and the target domain sample are replaced by the feature values in the high-dimensional space.
One practical way is to set the calculation step size to map the source sample set and the target domain sample set multiple times by using different mapping functions or adjusting mapping parameters in the mapping functions. After each mapping, i.e. calculating the difference of the average eigenvalue of the source samples and the average eigenvalue of the target domain samples in the high dimensional space. Furthermore, the minimum value of the differences obtained by multiple mappings is determined, and obviously, when the difference is taken as the minimum value, it can be determined that the source sample and the target domain sample are close enough in the mapped high-dimensional space, and the characteristic value distribution corresponding to the minimum value is the closest characteristic value distribution.
Further, it can be known that the mapping function corresponding to the minimum value is the target distribution adaptive mapping function corresponding to the same feature. For any sample in the source sample set and the sample in the target domain sample set, the target distribution adaptive mapping function can be adopted for mapping aiming at the same characteristic, and the characteristic value of the changed sample is obtained.
And S107, according to different features, completing feature values of different features in the target domain sample according to values of the different features in the source sample set.
As previously described, different features refer to features that are present in the source samples but not in the target domain samples. In other words, initially the target domain samples are created without including features that are not present in the source samples, or features that are present in the target domain samples but not present in the source samples are removed during processing.
Correspondingly, in order to keep the target domain samples and the source samples consistent, different features may be added to the target domain samples at this time. Fig. 3 is a schematic diagram of a feature of a filled-up target domain sample according to an embodiment of the present disclosure, as shown in fig. 3.
After the features are filled, the feature values of different features in the target domain sample can be filled based on a preset value (the preset value can be set based on experience). Alternatively, the population may be based on statistics (e.g., mean, median, mode, maximum/minimum, etc.) of the different features in the set of source samples. Generally, since the service scenes are similar, the average value comparison is adopted to meet the requirement of practical application.
And S109, combining the source sample set and the target domain sample set after the characteristic value is changed, and generating a fusion sample set for model training in the target domain.
In the manner, for the source samples in the source sample set, the feature values of the same features in the source samples are changed through high-dimensional mapping; for the target domain samples in the target domain sample set, the feature values of the same features in the target domain samples are changed through high-dimensional mapping, and the feature values under different features are obtained through feature filling. And moreover, the source sample and the target domain sample are ensured to have the same characteristics and similar characteristic value distribution.
Therefore, the source sample set and the target domain sample set after the variable characteristic value is changed can be merged and applied to the target domain as a training sample for model training to obtain a wind control model available in the target domain.
According to the scheme provided by the embodiment of the specification, a source sample in a source field with mature service and a small number of target domain samples in a target domain are extracted, the same feature and different features are respectively compared, adaptation after high-dimensional mapping is carried out on the same feature, and completion is carried out on different features, so that a source sample and a target domain sample after correction are obtained, and a fusion sample set which can be used for the target domain is obtained by combination. Because the target domain samples and the source domain samples belong to similar business scenes, the distribution of the characteristic values of the target domain samples and the source domain samples is approximately equivalent by the method, and the quantity of the fused sample sets and the reliability in the target domain are improved.
In one embodiment, when mapping the source samples and the target domain samples to the same high-dimensional space, the mapping may be performed for each of the same features one by one. The feature values of the same features of the source sample and the target domain sample are mapped to the same high-dimensional space one by one, and the closest feature value distribution in the source sample set and the target domain sample set in the high-dimensional space under the current same features is determined.
In this way, each identical feature has a corresponding mapping function, and the forms and parameters of the corresponding target distribution adaptation mapping functions of the identical features may be different. Changing the eigenvalues of the same signature will also require adapting the mapping function with a different target distribution. In this way, each identical feature in each of the source sample set and the target domain sample set can ensure a sufficiently close feature distribution, and the obtained sample is more accurate after the conversion.
In one embodiment, part/all of the same features may also be batch mapped to the same high-dimensional space using one function, for example, a polynomial mapping function. In this way, when determining the distribution of feature values, it is also necessary to perform corresponding determination in a high-dimensional space under the same feature of partial/full amount, and when performing feature value conversion after determining the distribution and obtaining the corresponding target distribution adaptive mapping function, it is also necessary to correspondingly convert the feature values of the same feature of partial/full amount. In this way, a plurality of same features are regarded as a whole to be transformed, and when the same features are more, the computing efficiency can be effectively improved, and the time is saved.
In one embodiment, since the source sample set and the target domain sample set are marked samples, and the number of the source sample sets is sufficient, when merging, the part or all of the source sample set and the full target domain sample set after feature value change can be merged to obtain a merged sample set.
Further, part of samples in the part of source sample sets are selected for combination, and the selection can be performed randomly, for example, 50% of the source samples are selected randomly, or some samples are selected in a targeted manner according to certain conditions, for example, part of the source samples in the latest time window period are selected, and when the number of the source samples is large enough, the efficiency of risk model training of the fused sample sets in the target domain can be improved by selecting part of the source samples on the premise of guaranteeing the number of the samples
Correspondingly, an embodiment of the present specification further provides a sample transfer device, as shown in fig. 4, fig. 4 is a schematic structural diagram of the sample transfer device provided in the embodiment of the present specification, and includes:
a sample obtaining module 401, where a sample obtaining apparatus obtains a source sample set and a target domain sample set, where the source sample and the target domain sample include features with the same number, and the source sample and the target domain sample are applied to similar business fields;
a feature determination module 403, which determines the same features and different features contained in the source sample and the target domain sample;
the same feature transformation module 405 maps the source sample and the target domain sample to the same high-dimensional space for the same feature, determines the closest feature value distribution in the source sample set and the target domain sample set under the same feature, and changes the feature value under the same feature of each sample to the feature value under the high-dimensional space according to the closest feature value distribution;
a different feature transformation module 407, configured to, for different features, fill up feature values of different features in the target domain sample according to values of the different features in the source sample set;
and the fusion module 409 combines the source sample set and the target domain sample set after the characteristic value is changed, and generates a fusion sample set for model training in the target domain.
Further, the same feature transformation module 405 maps feature values of the same feature to the same high-dimensional space by adjusting a mapping function or parameters in the mapping function; and when the average value of the sample characteristic values of the source sample set and the target domain sample set is minimum in the high-dimensional space, determining the characteristic value distribution at the moment as the closest characteristic value distribution.
Further, the same feature transformation module 405 maps feature values of the same features of the source sample and the target domain sample to the same high-dimensional space one by one, and determines the closest feature value distribution in the source sample set and the target domain sample set in the current high-dimensional space under the same features.
Further, the same feature transformation module 405 maps the feature values of the same features of the source sample and the target domain sample to the same high-dimensional space, and determines the closest feature value distribution in the source sample set and the target domain sample set in the high-dimensional space under the same features of the total amount.
Further, the different feature transformation module 407 determines, for any different feature, an average value of the different feature in the source sample set, adds the different feature in the target domain sample, and takes a value of the different feature in the target domain sample as the average value.
Further, the fusion module 409 merges part or all of the source sample set and the full target domain sample set after the feature value is changed.
Embodiments of the present specification also provide a computer device, which at least includes a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the sample migration method shown in fig. 1 when executing the program.
Fig. 5 is a schematic diagram illustrating a more specific hardware structure of a computing device according to an embodiment of the present disclosure, where the computing device may include: a processor 1010, a memory 1020, an input/output interface 1030, a communication interface 1040, and a bus 1050. Wherein the processor 1010, memory 1020, input/output interface 1030, and communication interface 1040 are communicatively coupled to each other within the device via bus 1050.
The processor 1010 may be implemented by a general-purpose CPU (Central Processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute related programs to implement the technical solutions provided in the embodiments of the present disclosure.
The Memory 1020 may be implemented in the form of a ROM (Read Only Memory), a RAM (Random Access Memory), a static storage device, a dynamic storage device, or the like. The memory 1020 may store an operating system and other application programs, and when the technical solution provided by the embodiments of the present specification is implemented by software or firmware, the relevant program codes are stored in the memory 1020 and called to be executed by the processor 1010.
The input/output interface 1030 is used for connecting an input/output module to input and output information. The i/o module may be configured as a component in a device (not shown) or may be external to the device to provide a corresponding function. The input devices may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output devices may include a display, a speaker, a vibrator, an indicator light, etc.
The communication interface 1040 is used for connecting a communication module (not shown in the drawings) to implement communication interaction between the present apparatus and other apparatuses. The communication module can realize communication in a wired mode (such as USB, network cable and the like) and also can realize communication in a wireless mode (such as mobile network, WIFI, Bluetooth and the like).
Bus 1050 includes a path that transfers information between various components of the device, such as processor 1010, memory 1020, input/output interface 1030, and communication interface 1040.
It should be noted that although the above-mentioned device only shows the processor 1010, the memory 1020, the input/output interface 1030, the communication interface 1040 and the bus 1050, in a specific implementation, the device may also include other components necessary for normal operation. In addition, those skilled in the art will appreciate that the above-described apparatus may also include only those components necessary to implement the embodiments of the present description, and not necessarily all of the components shown in the figures.
Embodiments of the present description also provide a computer-readable storage medium on which a computer program is stored, which when executed by a processor implements the sample migration method shown in fig. 1.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
From the above description of the embodiments, it is clear to those skilled in the art that the embodiments of the present disclosure can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the embodiments of the present specification may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments of the present specification.
The systems, methods, modules or units described in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. A typical implementation device is a computer, which may take the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email messaging device, game console, tablet computer, wearable device, or a combination of any of these devices.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, as for the method embodiment, since it is substantially similar to the method embodiment, it is relatively simple to describe, and reference may be made to the partial description of the method embodiment for relevant points. The above-described method embodiments are merely illustrative, wherein the modules described as separate components may or may not be physically separate, and the functions of the modules may be implemented in one or more software and/or hardware when implementing the embodiments of the present specification. And part or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
The foregoing is only a specific embodiment of the embodiments of the present disclosure, and it should be noted that, for those skilled in the art, a plurality of modifications and decorations can be made without departing from the principle of the embodiments of the present disclosure, and these modifications and decorations should also be regarded as the protection scope of the embodiments of the present disclosure.

Claims (13)

1. A method of sample migration, comprising:
acquiring a source sample set and a target domain sample set, wherein the source sample and the target domain sample contain the same number of features, and the source sample and the target domain sample are applied to similar business fields;
determining the same characteristics and different characteristics contained in the source sample and the target domain sample;
aiming at the same characteristic, mapping the source sample and the target domain sample to the same high-dimensional space, determining the closest characteristic value distribution in the source sample set and the target domain sample set under the same characteristic, and changing the characteristic value of each sample under the same characteristic to be the characteristic value under the high-dimensional space according to the closest characteristic value distribution; when the difference between the average values of the sample characteristic values of the source sample set and the target domain sample set is minimum in the same high-dimensional space, determining the characteristic value distribution at the moment as the closest characteristic value distribution;
according to different characteristics, supplementing characteristic values of different characteristics in the target domain sample according to values of the different characteristics in a source sample set;
and combining the source sample set and the target domain sample set after the characteristic value is changed to generate a fusion sample set for model training in the target domain.
2. The method of claim 1, mapping source samples and target domain samples to the same high-dimensional space for the same feature, comprising:
and mapping the characteristic values of the same characteristic to the same high-dimensional space by adjusting the mapping function or parameters in the mapping function.
3. The method of claim 1, mapping source samples and target domain samples to the same high-dimensional space for the same feature, comprising:
mapping feature values of the same features of the source sample and the target domain sample to the same high-dimensional space one by one, and correspondingly, determining the closest feature value distribution in the source sample set and the target domain sample set under the same features comprises the following steps:
and determining the nearest characteristic value distribution in the source sample set and the target domain sample set in the high-dimensional space under the current same characteristic.
4. The method of claim 1, mapping source samples and target domain samples to the same high-dimensional space for the same feature, comprising:
mapping the feature values of the partial/full-scale same features of the source sample and the target domain sample to the same high-dimensional space, and correspondingly, determining the closest feature value distribution in the source sample set and the target domain sample set under the same features comprises the following steps:
the closest eigenvalue distributions in the source and target domain sample sets in the high dimensional space under partial/full amount of the same features are determined.
5. The method of claim 1, wherein completing feature values of different features in the target domain samples according to values of the different features in the source sample set comprises:
and aiming at any different characteristic, determining the average value of the different characteristics in the source sample set, adding the different characteristics in the target domain sample, and taking the value of the different characteristics in the target domain sample as the average value.
6. The method of claim 1, merging the altered set of feature value source samples and the altered set of target domain samples, comprising:
and combining part or all of the source sample set and the full target domain sample set after the characteristic value is changed.
7. A sample transfer device, comprising:
the system comprises a sample acquisition module, a data acquisition module and a data acquisition module, wherein a sample acquisition device acquires a source sample set and a target domain sample set, wherein the source sample and the target domain sample contain the same number of characteristics, and the source sample and the target domain sample are applied to similar business fields;
the characteristic determining module is used for determining the same characteristics and different characteristics contained in the source sample and the target domain sample;
the same feature transformation module is used for mapping the source sample and the target domain sample to the same high-dimensional space aiming at the same feature, determining the closest feature value distribution in the source sample set and the target domain sample set under the same feature, and changing the feature value of each sample under the same feature into the feature value under the high-dimensional space according to the closest feature value distribution; when the difference between the average values of the sample characteristic values of the source sample set and the target domain sample set is minimum in the same high-dimensional space, determining the characteristic value distribution at the moment as the closest characteristic value distribution;
the different feature transformation module is used for completing the feature values of different features in the target domain sample according to the values of the different features in the source sample set aiming at the different features;
and the fusion module is used for merging the source sample set and the target domain sample set after the characteristic value is changed to generate a fusion sample set for model training in the target domain.
8. The apparatus of claim 7, wherein the same feature transformation module maps feature values of the same feature to the same high-dimensional space by adjusting a mapping function or parameters in the mapping function; and when the average value of the sample characteristic values of the source sample set and the target domain sample set is minimum in the high-dimensional space, determining the characteristic value distribution at the moment as the closest characteristic value distribution.
9. The apparatus of claim 7, wherein the same feature transformation module maps feature values of the same features of the source sample and the target domain sample one by one to a same high-dimensional space, and determines a closest feature value distribution in the source sample set and the target domain sample set in the high-dimensional space under the current same features.
10. The apparatus of claim 7, wherein the same feature transformation module maps feature values of a total number of same features of the source samples and the target domain samples to a same high-dimensional space, and determines a closest feature value distribution in the source sample set and the target domain sample set in the high-dimensional space under the total number of the same features.
11. The apparatus of claim 7, wherein the different feature transformation module determines an average of any different feature in the source sample set, adds the different feature to the target domain sample, and takes a value of the different feature in the target domain sample as the average.
12. The apparatus of claim 7, the fusion module to merge some or all of the altered feature value source sample sets and a full target domain sample set.
13. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any of claims 1 to 6 when executing the program.
CN201910905305.XA 2019-09-24 2019-09-24 Sample migration method, device and equipment Active CN110738476B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910905305.XA CN110738476B (en) 2019-09-24 2019-09-24 Sample migration method, device and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910905305.XA CN110738476B (en) 2019-09-24 2019-09-24 Sample migration method, device and equipment

Publications (2)

Publication Number Publication Date
CN110738476A CN110738476A (en) 2020-01-31
CN110738476B true CN110738476B (en) 2021-06-29

Family

ID=69269377

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910905305.XA Active CN110738476B (en) 2019-09-24 2019-09-24 Sample migration method, device and equipment

Country Status (1)

Country Link
CN (1) CN110738476B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111428783B (en) * 2020-03-23 2022-06-21 支付宝(杭州)信息技术有限公司 Method and device for performing sample domain conversion on training samples of recommendation model

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107045640A (en) * 2017-03-31 2017-08-15 南京邮电大学 A kind of method kept based on neighborhood with kernel space alignment for image recognition
CN108460523A (en) * 2018-02-12 2018-08-28 阿里巴巴集团控股有限公司 A kind of air control rule generating method and device
CN108898181A (en) * 2018-06-29 2018-11-27 咪咕文化科技有限公司 A kind of processing method, device and the storage medium of image classification model
CN109034080A (en) * 2018-08-01 2018-12-18 桂林电子科技大学 The adaptive face identification method in multi-source domain
CN109214421A (en) * 2018-07-27 2019-01-15 阿里巴巴集团控股有限公司 A kind of model training method, device and computer equipment
CN109902393A (en) * 2019-03-01 2019-06-18 哈尔滨理工大学 Fault Diagnosis of Roller Bearings under a kind of variable working condition based on further feature and transfer learning

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108399414B (en) * 2017-02-08 2021-06-01 南京航空航天大学 Sample selection method and device applied to cross-modal data retrieval field

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107045640A (en) * 2017-03-31 2017-08-15 南京邮电大学 A kind of method kept based on neighborhood with kernel space alignment for image recognition
CN108460523A (en) * 2018-02-12 2018-08-28 阿里巴巴集团控股有限公司 A kind of air control rule generating method and device
CN108898181A (en) * 2018-06-29 2018-11-27 咪咕文化科技有限公司 A kind of processing method, device and the storage medium of image classification model
CN109214421A (en) * 2018-07-27 2019-01-15 阿里巴巴集团控股有限公司 A kind of model training method, device and computer equipment
CN109034080A (en) * 2018-08-01 2018-12-18 桂林电子科技大学 The adaptive face identification method in multi-source domain
CN109902393A (en) * 2019-03-01 2019-06-18 哈尔滨理工大学 Fault Diagnosis of Roller Bearings under a kind of variable working condition based on further feature and transfer learning

Also Published As

Publication number Publication date
CN110738476A (en) 2020-01-31

Similar Documents

Publication Publication Date Title
CN108563548B (en) Abnormality detection method and apparatus
WO2019095782A1 (en) Data sample label processing method and apparatus
CN110163612B (en) Payment wind control method and device
CN109145025B (en) Multi-data-source integrated data query method and device and service server
CN109255486B (en) Method and device for optimizing policy configuration
CN113793071A (en) Suspicious group identification method and device
CN106355391A (en) Service processing method and device
CN111553488A (en) Risk recognition model training method and system for user behaviors
CN105224343A (en) A kind of renewal reminding method of application program and device
WO2018219285A1 (en) Data object display method and device
CN111079944B (en) Transfer learning model interpretation realization method and device, electronic equipment and storage medium
CN111506580B (en) Transaction storage method based on centralized block chain type account book
CN111475853A (en) Model training method and system based on distributed data
CN106326062A (en) Method and device for controlling running state of application program
CN110852754A (en) Risk identification method, device and equipment
CN111126623A (en) Model updating method, device and equipment
CN110738476B (en) Sample migration method, device and equipment
CN112492535A (en) Short message sending method and device
CN109325015B (en) Method and device for extracting characteristic field of domain model
CN108985831B (en) Offline transaction distinguishing method and device and computer equipment
CN109598478B (en) Wind measurement result description document generation method and device and electronic equipment
CN111798263A (en) Transaction trend prediction method and device
CN110781500A (en) Data wind control system and method
CN110020780A (en) The method, apparatus and electronic equipment of information output
CN109656805B (en) Method and device for generating code link for business analysis and business server

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant