Detailed Description
The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, and not all of the embodiments. All other embodiments obtained by a person skilled in the art based on the embodiments in the present specification without any inventive step should fall within the scope of protection of the present specification.
Please refer to fig. 1. The embodiment of the specification provides a model training method. The execution subject of the model training method can be a computer device. The computer devices include, but are not limited to, servers, industrial personal computers (industrial control computers), Personal Computers (PCs), all-in-one machines, and the like. The model training method may include the following steps.
Step S10: migration variables and mutual variables are determined.
In this embodiment, the computer device may select a migration variable and a mutual variation variable from a plurality of preset variables based on at least one history data of the source region and at least one history data of the target region.
In this embodiment, the size of the source region may be flexibly set according to business needs, for example, the source region may be a street, a business district, a city, a country, or a region composed of multiple countries, and the like. The size of the target area can be flexibly set according to business needs, and can be, for example, a street, a business district, a city, a country, or an area formed by a plurality of countries, and the like. Each history data of the source region and each history data of the target region can be any type of data, such as transaction data, product review data, or chat data. Each historical data of the source region and each historical data of the target region may have a variety of characteristic information corresponding to a plurality of dimensions. The dimensions may be the same or different depending on the type of historical data. For example, the historical data may be transactional data. The dimensions may then include a transaction channel, a transaction scenario, a transaction time, a transaction amount, a payment account, a transaction device identification, a transaction network address, and the like. The transaction channel may include wireless payment, PC payment, and agreement payment, among others. The transaction scenario may include on-the-spot payment, batch deduction, house loan repayment, credit card repayment, and the like. Specifically, for example, the history DATA of the source region may include DATA _ a1, and DATA _ a2, and the history DATA of the target region may include DATA _ B1, and DATA _ B2. The history DATA _ a1, DATA _ a2, DATA _ B1, and DATA _ B2 may be as shown in table 1 below.
TABLE 1
Taking the historical DATA _ a1 in table 1 above as an example, the characteristic information of the historical DATA _ a1 corresponding to the dimensions of the transaction channel, the transaction scene, the transaction time, the transaction amount, the payment Account, the transaction device identifier, the transaction network address and the like can be wireless payment, in-place payment, 20180430, 1000, Account1, Account2, ID1, 222.92, xxx.xxx, respectively.
Each historical data of the source region and each historical data of the target region may be tagged data, which may be understood as data tagged with a type tag. The type tag may include a first type and a second type. The first type may be a type that the target service data to be recognized has, and the second type may include other types except the first type. For example, the target business data may be transaction data relating to illegal contents such as fraud. Then the first type may be a risk type and the second type may be a normal type. As another example, the target business data may be commodity comment data with negative emotional properties. Then, the first type may be a type in which the emotional property is negative, and the second type may include types in which the emotional property is positive and neutral.
The at least one historical data for the source region and the at least one historical data for the target region may each be used to train a classification model for the target region. Specifically, the number of the historical data of the source region is large and the number of the historical data of the target region is small due to factors such as service online time. In view of the fact that the number of the historical data of the target region is small, the historical data of the target region is used alone to train a classification model for the target region, and the trained classification model often cannot have a good classification effect. In view of the difference between the business logic and the business coverage population of the target region and the source region, the historical data of the source region is used alone to train the classification model for the target region, and the trained classification model often cannot have a good classification effect. Meanwhile, the historical data of the source area and the historical data of the target area are used for training a classification model for the target area, so that a large amount of historical data of the source area is fully utilized, the differences of business logic, business coverage crowds and the like between the target area and the source area are considered, and the trained classification model can achieve a good classification effect.
In this embodiment, the preset plurality of variables may be attributed to at least one variable group. The set of variables may be the same or different depending on the type of historical data. For example, the historical data may be transactional data. Then, the variable sets may include a transaction amount variable set, a transaction number variable set, a transaction time variable set, and the like. Specifically, for example, the variables in the transaction amount variable set, the transaction number variable set, and the transaction time variable set may be as shown in table 2 below.
TABLE 2
Each of the preset variables may be used to characterize a characteristic information of the history data of the source region and a characteristic information of the history data of the target region. The migration variable may be used to characterize feature information common to the historical data between the source region and the target region, and the diversity variable may be used to characterize feature information unique to the historical data of the source region and the target region. The common feature information may be feature information having a high degree of similarity in the history data between the source region and the target region, and the unique feature information may be feature information having a low degree of similarity in the history data between the source region and the target region.
In this embodiment, the computer device may obtain historical data of the source region and the target region in a specified time interval respectively; a set formed by the acquired historical data of the source region can be used as a first historical data set; a set formed by the acquired historical data of the target region can be used as a second historical data set; a first feature value of each variable of the preset plurality of variables may be calculated based on the first historical data set and the second historical data set; at least one variable may be selected as a migration variable and at least one variable may be selected as a mutual difference variable from the preset plurality of variables based on the first characteristic value of the variable. The specified time interval may be any length of time, and may be, for example, 1 month, 1 quarter, 6 months, or 1 year, etc.
The first characteristic value may include a Mutual Information value (MI). Mutual information values of variables may be used to represent how similar the variables are between the characteristic information characterized by the first set of historical data and the characteristic information characterized by the second set of historical data. The magnitude of the mutual information value may be positively correlated with the degree of similarity the mutual information value represents. For example, the computer device may be formulated by formula
To calculate mutual information values of the variables; x can represent a set formed by values of the variable in the first historical data set; y may represent a set of values of the variable in the second historical data set; x may represent a value in set X; y may represent a value in set Y; p (x, y) may represent a probability (a joint probability of x and y) that the variable takes a value of x in the first historical data set and a value of y in the second historical data set; p (x) may represent the probability of the variable taking the value x in the first set of historical data; p (y) may represent the probability that the variable takes on value y in the second set of historical data. It should be noted that the above formula or method for calculating the mutual information value is only an example, and actually, other formulas or methods may be used to calculate the mutual information value of the variable.
The computer equipment can select at least one variable with a mutual information value larger than or equal to a first preset threshold value from the preset variables as a migration variable; at least one variable may be selected as a variant variable from the remaining variables except the selected migration variable. The size of the first preset threshold value can be flexibly set according to actual needs. Or, the computer device may further select, as a migration variable, a first preset number of variables with a maximum mutual information value from the preset plurality of variables; at least one variable may be selected as a variant variable from the remaining variables except the selected migration variable. The numerical value of the first preset quantity can be flexibly set according to actual needs.
It will be appreciated by those skilled in the art that the first characteristic value may also comprise other information values. The computer device may select at least one variable from the preset plurality of variables as a migration variable and at least one variable as a mutual variation variable based on other information values of the variables.
In an implementation manner of this embodiment, the computer device may calculate a second feature value of each variable of the preset plurality of variables based on the first historical data set and the second historical data set; at least one variable may be selected from the preset plurality of variables as a representative variable based on the second characteristic value of the variable.
The second characteristic Value may include an Information Value (IV). The information value of a variable may be used to represent the amount of information the variable implies in the first and second sets of historical data. The larger the value of the information value of a variable, the larger the amount of information that the variable implies in the first and second sets of historical data, and thus the larger the contribution of the variable in identifying the type of business data. For example, the computer device may be represented by the formula IV-IV
1+IV
2、
To calculate the information value of the variable; g
1The sum of the number of the first type historical data corresponding to each value of the variable in the first historical data set can be represented; b is
1Can representThe sum of the quantity of the second type historical data corresponding to each value of the variable in the first historical data set; g
TThe sum of the number of the first type historical data corresponding to each value of the variable in the first historical data set and the second historical data set can be represented; b is
TThe sum of the quantity of the second type historical data corresponding to each value of the variable in the first historical data set and the second historical data set can be represented; g
2The sum of the number of the first type historical data corresponding to each value of the variable in the second historical data set can be represented; b is
2The sum of the number of the second type of history data corresponding to each value of the variable in the second history data set can be represented. It should be noted that the above formula or method for calculating the information value is only an example, and there may be other formulas or methods for calculating the information value of the variable.
The computer device may select at least one variable having an information value greater than or equal to a second preset threshold value from the preset plurality of variables as a representative variable. The size of the second preset threshold value can be flexibly set according to actual needs. Alternatively, the computer device may select a second preset number of variables having the largest mutual information value from the preset number of variables as the representative variables. The numerical value of the second preset quantity can be flexibly set according to actual needs.
It will be appreciated by those skilled in the art that the second characteristic value may also comprise other information values. The computer device may select at least one variable from the preset plurality of variables as a representative variable based on other information values of the variables.
Step S12: training a first classification model constructed based on the migration variables and the mutually different variables based on historical data of a source region.
In this embodiment, the classification model may be a mathematical model for classifying the unclassified traffic data into known types. The classification model may be a bayesian classification model, a Support Vector Machine (SVM) classification model, or a Convolutional Neural Network (CNN) classification model. The classification model can be a risk classification model, an emotion classification model, a theme classification model or the like. The risk classification model may be used to classify the business data based on risk degree, the emotion classification model may be used to classify the business data based on emotion, and the topic classification model may be used to classify the business data based on expressed topics.
The first classification model may be a classification model constructed based on the migration variables and the disparate variables. For example, the first classification model may be
J may represent an objective function (objective function). The objective function may be used to represent how close the predicted and actual values are during machine learning. The goal of machine learning may be to optimize the objective function. The objective function may specifically be any type of function. m may represent a matrix of said mutually different variables, e.g. [ m ]
1 m
2 … m
i …],m
1、m
2、m
iOne distinct variable may be represented, respectively. u. of
1The matrix may represent a weight composition of the mutually different variables in the first classification model, and may be [ u ] for example
11 u
12 … u
1i …],u
11、u
12、u
1iThe weights of one of the mutually different variables may be represented, respectively. n may represent a matrix of said mutually different variables, e.g. [ n ]
1 n
2 … n
i …],n
1、n
2、n
iOne migration variable may be represented, respectively. v. of
sThe matrix may represent a weighted composition of the migration variables in the first classification model, and may be, for example, [ v [ ]
s1 v
s2 … v
si …],v
s1、v
s2、v
siThe weights of one migration variable may be represented, respectively. α may be an empirical value, and may be, for example, 0.1, 0.3, 0.6, etc.
In one implementation of this embodiment, the first classification model may include a first weight constraint term. The first weight constraint item can be used for constraining the weight of the mutually different variables in the first classification model, so that the first classification model can learn as much as possible of the common characteristic information of the historical data between the source region and the target region in the training process. Continuing with the previous example, the first weight constraint term may be a data term α | | | u in the first classification model1||2。
In this embodiment, the computer device may train the first classification model in any manner. Continuing with the previous example, the process of training the first classification model by the computer device can be understood as solving a formula
To the optimization problem of (2). The computer device may specifically solve the matrix u using an algorithm such as Stochastic Gradient Descent (SGD) to solve the formula
1And v
sSo that the formula
And optimization is achieved.
Step S14: and training a second classification model constructed based on the migration variable and the different variable based on the historical data of the target region and the training result of the first classification model.
In this embodiment, the second classification model may be a classification model constructed based on the migration variables and the mutual-variant variables. The second classification model is different from the first classification model. In particular, the first classification model may be adapted; the adjusted first classification model may be used as the second classification model. The process of adjusting may include, for example, adding data items, etc. For example, the first classification model may be
J may represent an objective function. m may represent saidMatrices of mutually different variables. u. of
2A matrix of weights of the mutually different variables in the second classification model may be represented. n may represent a matrix of said mutually different variables. v may represent a matrix of weights of the migration variables in the second classification model. v. of
sA matrix of weights of the migration variables in the first classification model may be represented. Alpha and beta may be empirical values.
In one implementation of this embodiment, the second classification model may include a second weight constraint term. The second weight constraint term may be used to constrain the weight of the mutually different variables in the second classification model. Continuing with the previous example, the second weight constraint term may be a data term α | | | u in the second classification model2||2。
In one implementation of this embodiment, the second classification model may include a difference constraint term. The difference constraint item can be used for constraining the difference of the weight of the migration variable between the first classification model and the second classification model, so that the second classification model can learn characteristic feature information of the target region historical data as much as possible in a training process. Continuing with the previous example, the difference constraint term may be the data term β | | | v-v in the second classification models||。
In this embodiment, the computer device may train the second classification model in any manner, and the trained second classification model may be used to identify the type of the traffic data from the target region. Continuing with the previous example, the process of training the second classification model by the computer device can be understood as solving the formula
To the optimization problem of (2). The computer device may specifically solve the matrix u using an algorithm such as stochastic gradient descent solving a formula
1And v, such that
And optimization is achieved.
In this embodiment, the computer device may determine a migration variable and a mutual exception variable; a first classification model constructed based on the migration variables and the disparate variables may be trained based on historical data of a source region; a second classification model constructed based on the migration variables and the disparate variables may be trained based on historical data of a target region and training results of the first classification model, and the second classification model may include a difference constraint term that may be used to constrain a difference in weight of the migration variables between the first classification model and the second classification model. In this way, the second classification model can learn the common characteristic information of the historical data between the source region and the target region from the historical data of the source region; characteristic feature information of the history data of the target area can be learned from the history data of the target area. Thus, the computer device can use the historical data of the source region and the historical data of the target region to train a classification model for the target region at the same time, and the distinguishing capability of the classification model trained for the target region is improved.
Please refer to fig. 2. The embodiment of the specification provides a data type identification method. The execution subject of the data type identification method can be a computer device. The computer devices include, but are not limited to, servers, industrial personal computers (industrial control computers), Personal Computers (PCs), all-in-one machines, and the like. The data type identification method may include the following steps.
Step S20: identifying a type of traffic data from a target region using a classification model trained for the target region.
In this embodiment, the classification model may be obtained by training based on a model training method in an embodiment of this specification. The business data may be any type of data, and may be, for example, transaction data, product review data, or chat data.
In this embodiment, the computer device may identify the type of traffic data from the target region using a classification model trained for the target region. Specifically, the computer device may calculate feature values of traffic data from a target region using a classification model trained for the target region; the type of the service data may be identified based on the characteristic value of the service data. For example, the business data may be transaction data. Then, when the characteristic value of the traffic data is greater than or equal to a preset threshold, the computer device may identify the type of the traffic data as a risk type. When the characteristic value of the service data is smaller than a preset threshold value, the computer device may identify that the type of the service data is a normal type.
In this embodiment, the computer device may identify the type of traffic data from the target region using a classification model trained for the target region. The embodiment can reduce the false recognition rate of the service data.
Please refer to fig. 3. The embodiment of the specification provides a computer device. The computer device may comprise the following elements.
A determination unit 30 for determining a migration variable and a mutual variation variable; the migration variables are used for representing common characteristic information of historical data between the source region and the target region; the different variables are used for respectively representing the unique characteristic information of historical data of the source region and the target region;
a first training unit 32, configured to train a first classification model constructed based on the migration variable and the mutually different variables based on historical data of a source region;
a second training unit 34, configured to train a second classification model constructed based on the migration variable and the disparate variables based on historical data of a target region and a training result of the first classification model; the second classification model includes a difference constraint term; the difference constraint term is used for constraining the difference of the weight of the migration variable between the first classification model and the second classification model.
Please refer to fig. 4. The embodiment of the specification provides another computer device. The computer device may include a memory and a processor.
In this embodiment, the memory may be implemented in any suitable manner. For example, the memory may be a read-only memory, a mechanical hard disk, a solid state disk, a U disk, or the like. The memory may be used to store computer instructions.
In this embodiment, the processor may be implemented in any suitable manner. For example, the processor may take the form of, for example, a microprocessor or processor and a computer-readable medium that stores computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, an Application Specific Integrated Circuit (ASIC), a programmable logic controller, an embedded microcontroller, and so forth. The processor may execute the computer instructions to perform the steps of: determining a migration variable and a mutual variation variable; the migration variables are used for representing common characteristic information of historical data between the source region and the target region; the different variables are used for respectively representing the unique characteristic information of historical data of the source region and the target region; training a first classification model constructed based on the migration variables and the different variables based on historical data of a source region; training a second classification model constructed based on the migration variable and the different variable based on historical data of a target region and a training result of the first classification model; the second classification model includes a difference constraint term; the difference constraint term is used for constraining the difference of the weight of the migration variable between the first classification model and the second classification model.
Please refer to fig. 5. The embodiment of the specification provides a computer device. The computer device may comprise the following elements.
An identifying unit 50 for identifying the type of the traffic data from the target region using the classification model trained for the target region. The classification model can be obtained by training based on a model training method in the embodiment of the present specification.
Please refer to fig. 4. The embodiment of the specification provides another computer device. The computer device may include a memory and a processor.
In this embodiment, the memory may be implemented in any suitable manner. For example, the memory may be a read-only memory, a mechanical hard disk, a solid state disk, a U disk, or the like. The memory may be used to store computer instructions.
In this embodiment, the processor may be implemented in any suitable manner. For example, the processor may take the form of, for example, a microprocessor or processor and a computer-readable medium that stores computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, an Application Specific Integrated Circuit (ASIC), a programmable logic controller, an embedded microcontroller, and so forth. The processor may execute the computer instructions to perform the steps of: identifying a type of traffic data from a target region using a classification model trained for the target region; the classification model can be obtained by training based on a model training method in the embodiment of the present specification.
It should be noted that, in the present specification, each embodiment is described in a progressive manner, and the same or similar parts in each embodiment may be referred to each other, and each embodiment focuses on differences from other embodiments. In particular, for the embodiment of the computer device, since it is substantially similar to the embodiment of the model training method, the description is simple, and relevant points can be referred to the partial description of the embodiment of the model training method.
In addition, it is understood that one skilled in the art, after reading this specification document, may conceive of any combination of some or all of the embodiments listed in this specification without the need for inventive faculty, which combinations are also within the scope of the disclosure and protection of this specification.
In the 90 s of the 20 th century, improvements in a technology could clearly distinguish between improvements in hardware (e.g., improvements in circuit structures such as diodes, transistors, switches, etc.) and improvements in software (improvements in process flow). However, as technology advances, many of today's process flow improvements have been seen as direct improvements in hardware circuit architecture. Designers almost always obtain the corresponding hardware circuit structure by programming an improved method flow into the hardware circuit. Thus, it cannot be said that an improvement in the process flow cannot be realized by hardware physical modules. For example, a Programmable Logic Device (PLD), such as a Field Programmable Gate Array (FPGA), is an integrated circuit whose Logic functions are determined by programming the Device by a user. A digital system is "integrated" on a PLD by the designer's own programming without requiring the chip manufacturer to design and fabricate a dedicated integrated circuit chip 2. Furthermore, nowadays, instead of manually making an Integrated Circuit chip, such Programming is often implemented by "logic compiler" software, which is similar to a software compiler used in program development and writing, but the original code before compiling is also written by a specific Programming Language, which is called Hardware Description Language (HDL), and HDL is not only one but many, such as abel (advanced Boolean Expression Language), ahdl (alternate Language Description Language), traffic, pl (core unified Programming Language), HDCal, JHDL (Java Hardware Description Language), langue, Lola, HDL, laspam, hardbyscript Description Language (vhr Description Language), and the like, which are currently used by Hardware compiler-software (Hardware Description Language-software). It will also be apparent to those skilled in the art that hardware circuitry that implements the logical method flows can be readily obtained by merely slightly programming the method flows into an integrated circuit using the hardware description languages described above.
The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions.
The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.
From the above description of the embodiments, it is clear to those skilled in the art that the present specification can be implemented by software plus a necessary general hardware platform. Based on such understanding, the technical solutions of the present specification may be essentially or partially implemented in the form of software products, which may be stored in a storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and include instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments of the present specification.
The description is operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
This description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
While the specification has been described with examples, those skilled in the art will appreciate that there are numerous variations and permutations of the specification that do not depart from the spirit of the specification, and it is intended that the appended claims include such variations and modifications that do not depart from the spirit of the specification.