Disclosure of Invention
One or more embodiments of the present disclosure describe a feature encoding method and apparatus, which can not only make the model not lose useful information, but also reduce the length of feature encoding, and have a certain generalization.
In a first aspect, a feature encoding method is provided, including:
obtaining variable values of characteristic variables related to a business target, wherein the variable values are of non-numerical types;
selecting a target coding mode corresponding to the variable value from the multiple characteristic coding modes according to a predetermined corresponding relation between multiple value sets of the characteristic variable and the multiple characteristic coding modes, wherein the multiple value sets are divided according to the pre-evaluated discrimination degree of various possible values of the characteristic variable on the service target, and the multiple characteristic coding modes are used for coding the values in the corresponding value sets into multiple vector spaces;
encoding the variable value into a target vector in a target vector space using the target encoding scheme, the target vector space being a vector space corresponding to the target encoding scheme among the plurality of vector spaces;
determining a feature vector of the feature variable based on the target vector.
In a possible implementation manner, the degree of distinction of the various possible values of the characteristic variable from the business objective is evaluated according to a prior probability of the various possible values on achievement of the business objective.
In a possible implementation manner, the plurality of value sets include a first value set and a second value set, and an average discrimination degree of values included in the first value set with respect to the service target is higher than that of the second value set; the multiple feature encoding modes comprise a first feature encoding mode and a second feature encoding mode, the first feature encoding mode corresponds to the first value set, the second feature encoding mode corresponds to the second value set, and the encoding compression rate of the first feature encoding mode is smaller than that of the second feature encoding mode.
Further, the first feature encoding mode is a full-scale encoding mode and is used for encoding corresponding values into a first vector space, and the dimensionality of the first vector space is equal to the number of values in the first value set;
the second characteristic coding mode is a compression coding mode and is used for coding the corresponding values to a second vector space, and the dimensionality of the second vector space is smaller than the value number in the second value set.
Further, the first characteristic encoding mode is a one-hot encoding mode, and the second characteristic encoding mode is a hash encoding mode.
Further, the second feature encoding method includes:
mapping the values in the second value set into values in a value space by adopting a numerical hash function;
and taking the modulus of the value to a preset number p, and mapping a modulus result into a p-dimensional vector, wherein the preset number p is less than the value number in the second value set.
Further, the value sets are divided by the following steps:
sorting all possible values of the characteristic variables from large to small according to the discrimination of the values to the business targets;
sequentially selecting a first number of values as the first value set according to the sequence;
and taking other values except the first number as the second value set.
Further, the value sets are divided by the following steps:
acquiring the discrimination of all possible values of the characteristic variables to the service target;
adding the value with the discrimination degree larger than or equal to a first threshold value to the first value set;
and adding the value with the discrimination smaller than the first threshold value to the second value set.
Further, the second value set comprises a first subset and a second subset; the average discrimination of the values contained in the first subset to the service target is higher than that of the second subset; the second characteristic encoding mode includes a first compression encoding mode corresponding to the first subset and a second compression encoding mode corresponding to the second subset, and an encoding compression ratio of the first compression encoding mode is smaller than that of the second compression encoding mode.
Further, the second feature encoding mode is a hash encoding mode, and:
the first compression encoding method includes:
mapping the values in the first subset into a first value by adopting a numerical hash function;
taking a modulus of the first numerical value to a first number, and mapping a modulus result into a vector of a dimension of the first number;
the second compression encoding method includes:
mapping the values in the second subset into a second value by adopting the numerical hash function;
taking a modulus of the second numerical value to a second number, and mapping a modulus result into a vector of a second number dimension;
wherein the first number is greater than the second number.
In one possible embodiment, wherein determining the feature vector of the feature variable based on the target vector comprises:
filling at least one other vector space of the plurality of vector spaces except the target vector space with default values to obtain at least one default vector;
and splicing and combining the target vector and the at least one default vector to obtain the feature vector.
In a second aspect, a feature encoding apparatus is provided, including:
the system comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring variable values of characteristic variables related to a business target, and the variable values are of non-numerical types;
a selecting unit, configured to select, according to a predetermined correspondence between multiple value sets of the feature variables and multiple feature coding modes, a target coding mode corresponding to the variable value obtained by the obtaining unit from the multiple feature coding modes, where the multiple value sets are divided according to pre-evaluated degrees of distinction of various possible values of the feature variables with respect to the service target, and the multiple feature coding modes are used to code the values in the corresponding value sets into multiple vector spaces;
an encoding unit configured to encode the variable value acquired by the acquisition unit into a target vector in a target vector space, which is a vector space corresponding to the target encoding scheme among the plurality of vector spaces, using the target encoding scheme selected by the selection unit;
and the determining unit is used for determining the characteristic vector of the characteristic variable based on the target vector obtained by the encoding unit.
In a third aspect, there is provided a computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of the first aspect.
In a fourth aspect, there is provided a computing device comprising a memory having stored therein executable code and a processor that, when executing the executable code, implements the method of the first aspect.
According to the method and the device provided by the embodiment of the specification, after the variable value of the characteristic variable related to the business target is obtained, the variable value is of a non-numerical type, firstly, according to the corresponding relation between a plurality of value sets of the characteristic variable and a plurality of characteristic coding modes, the target coding mode corresponding to the variable value is selected from the plurality of characteristic coding modes, then, the variable value is coded into the target vector in the target vector space by using the target coding mode, the target vector space is the vector space corresponding to the target coding mode in the plurality of vector spaces, and then, the characteristic vector of the characteristic variable is determined based on the target vector. The value sets are divided according to the pre-evaluated discrimination degree of various possible values of the characteristic variables to the service target, and the characteristic coding modes are used for coding the values in the corresponding value sets into a plurality of vector spaces, so that the model does not lose useful information, the length of characteristic coding can be reduced, and certain generalization performance is achieved.
Detailed Description
The scheme provided by the specification is described below with reference to the accompanying drawings.
Fig. 1 is a schematic view of an implementation scenario of an embodiment disclosed in this specification. As shown in fig. 1, when training a machine learning model, training data is input to the machine learning model, where the training data includes variables of non-numerical types, such as a commodity name, a shop name, and an I P address of a buyer, and these variables of non-numerical types can be input to the machine learning model after feature coding.
It is to be understood that the machine learning model shown in fig. 1 is only an example, and is not used to limit the machine learning algorithm in the embodiments of the present specification. For example, the machine learning algorithm may adopt supervised learning, unsupervised learning or reinforcement learning, which is not described herein.
Fig. 2 shows a flowchart of a feature encoding method according to an embodiment, as shown in fig. 2, the feature encoding method in this embodiment includes the following steps: step 21, obtaining variable values of characteristic variables related to the business target, wherein the variable values are of non-numerical type; step 22, selecting a target coding mode corresponding to the variable value from the multiple feature coding modes according to a predetermined correspondence between multiple value sets of the feature variable and the multiple feature coding modes, wherein the multiple value sets are divided according to pre-evaluated discrimination degrees of various possible values of the feature variable for the service target, and the multiple feature coding modes are used for coding the values in the corresponding value sets into multiple vector spaces; step 23, encoding the variable value into a target vector in a target vector space by using the target encoding mode, wherein the target vector space is a vector space corresponding to the target encoding mode in the plurality of vector spaces; and step 24, determining a feature vector of the feature variable based on the target vector. Specific execution modes of the above steps are described below.
First, in step 21, a variable value of a characteristic variable related to a business target is obtained, and the variable value is of a non-numerical type. It is understood that the feature variables related to the business target may include not only non-numerical type feature variables but also numerical type feature variables, and since the feature encoding method in the embodiment of the present specification is applied to encode non-numerical type feature variables, only obtaining variable values of non-numerical type feature variables is described in step 21.
Furthermore, the same variable will often exhibit different effects for different business objectives, such as a scholarly calendar being a very typical non-numeric type of variable, the higher the scholarly calendar the lower the credit risk, but the higher the scholarly calendar the less likely it is to be related to consumption preferences. When the business objective is to determine whether the credit risk is too low, learning as a feature variable associated with the business objective; when the business objective is to determine the consumption preferences of the user, the learned history is not a feature variable associated with the business objective.
Next, in step 22, according to a predetermined correspondence relationship between a plurality of value sets of the characteristic variables and a plurality of characteristic encoding modes, a target encoding mode corresponding to the variable value is selected from the plurality of characteristic encoding modes, wherein the plurality of value sets are divided according to pre-evaluated degrees of distinction of various possible values of the characteristic variables with respect to the service target, and the plurality of characteristic encoding modes are used for encoding the values in the corresponding value sets into a plurality of vector spaces.
In one example, the degree of distinction of the various possible values of the characteristic variable from the business objective is evaluated according to a prior probability of the various possible values on achievement of the business objective.
Specifically, after a specific business target to be solved is determined, the distinguishing capability of different values of each characteristic variable for the business target can be effectively evaluated, and the distinguishing degree of the various possible values of the characteristic variable for the business target is obtained. It can be understood that the stronger the distinguishing capability of a value for a business target, the higher the distinguishing degree of the value for the business target. For example, in a cash-out scene, a perpetrator often purchases a large amount of commodities of a non-flagship store, sometimes a seller provides cash-out services for a buyer, and even some commodities are specialized in cash-out, that is, the distribution of each commodity and each seller on the prior probability of cash-out is different. If a certain characteristic variable has N different values, different distinguishing capabilities of the N different values on the service target can be obtained.
The prior probability refers to a probability obtained from past experience and analysis. The prior probability can be divided into objective prior probability and subjective prior probability, wherein the prior probability obtained by using past historical data is called objective prior probability; when the historical data is not obtained or the data is incomplete, the prior probability obtained by judgment of subjective experience of people is called subjective prior probability.
It is understood that the prior probability mentioned in the embodiments of the present specification may be an objective prior probability or a subjective prior probability.
In one example, the plurality of value sets include a first value set and a second value set, and an average discrimination of values included in the first value set with respect to the service objective is higher than that of the second value set; the multiple feature encoding modes comprise a first feature encoding mode and a second feature encoding mode, the first feature encoding mode corresponds to the first value set, the second feature encoding mode corresponds to the second value set, and the encoding compression rate of the first feature encoding mode is smaller than that of the second feature encoding mode. That is, if the discrimination of the value to the service target is high, a characteristic encoding mode with a low encoding compression rate is adopted.
It is to be understood that, in the embodiments of the present specification, the number of the value sets included in the plurality of value sets is not limited, and may be two, three, or more.
Further, the first feature encoding mode is a full-scale encoding mode and is used for encoding corresponding values into a first vector space, and the dimensionality of the first vector space is equal to the number of values in the first value set; the second characteristic coding mode is a compression coding mode and is used for coding the corresponding values to a second vector space, and the dimensionality of the second vector space is smaller than the value number in the second value set.
For example, the first characteristic encoding mode is a one-hot encoding mode, and the second characteristic encoding mode is a hash encoding mode.
In one example, the second feature encoding mode includes:
mapping the values in the second value set into values in a value space by adopting a numerical hash function;
and taking the modulus of the value to a preset number p, and mapping a modulus result into a p-dimensional vector, wherein the preset number p is less than the value number in the second value set.
In one example, the plurality of value sets are divided by:
sorting all possible values of the characteristic variables from large to small according to the discrimination of the values to the business targets;
sequentially selecting a first number of values as the first value set according to the sequence;
and taking other values except the first number as the second value set.
For example, the characteristic variables represent commodities, there are 10 ten thousand commodities, and after sorting from top to bottom according to the degree of distinction of each commodity with respect to the business target, for example, the first 100 commodities with the highest degree of distinction can be preferentially selected for one-hot coding.
In another example, wherein the plurality of sets of values are divided by:
acquiring the discrimination of all possible values of the characteristic variables to the service target;
adding the value with the discrimination degree larger than or equal to a first threshold value to the first value set;
and adding the value with the discrimination smaller than the first threshold value to the second value set.
For example, the characteristic variable represents commodities, there are 10 ten thousand commodities, the degree of distinction of each commodity from the business target is determined according to the ability of distinguishing the commodity from the business target (for example, a value between 0 and 10), and for example, commodities with a degree of distinction greater than 5 may be preferentially selected for one-hot encoding.
Further, the second value set comprises a first subset and a second subset; the average discrimination of the values contained in the first subset to the service target is higher than that of the second subset; the second characteristic encoding mode includes a first compression encoding mode corresponding to the first subset and a second compression encoding mode corresponding to the second subset, and an encoding compression ratio of the first compression encoding mode is smaller than that of the second compression encoding mode. That is, the value set adopting the compression encoding mode can be continuously divided into subsets adopting different compression encoding modes, so that the vector dimension after feature encoding is further compressed.
Specifically, the second characteristic encoding mode is a hash encoding mode, and:
the first compression encoding method includes:
mapping the values in the first subset into a first value by adopting a numerical hash function;
taking a modulus of the first numerical value to a first number, and mapping a modulus result into a vector of a dimension of the first number;
the second compression encoding method includes:
mapping the values in the second subset into a second value by adopting the numerical hash function;
taking a modulus of the second numerical value to a second number, and mapping a modulus result into a vector of a second number dimension;
wherein the first number is greater than the second number.
For example, feature coding of the first K values (sorted from large to small according to the degree of distinction) of N different values of a feature variable is completed through a one-hot coding mode, and the remaining N-K values are not coded. The remaining N-K values are divided into two categories, one category is the value with behavior discrimination capability, but the discrimination capability is not so strong compared with the previous K values (assuming that there are M values, namely, the K +1 to K + M), and the second category is the value without discrimination capability at all (namely, the K + M +1 to N).
For the two types of values, hash (Hashing) coding is carried out to compress the characteristics, and the specific method is as follows:
determining a characteristic vector corresponding to a value through a formula idx ═ Hash (val)% p, wherein val represents the value of a characteristic variable, the value is a non-numerical value type, Hash () represents a Hash function,% represents modulus, p is a constant, and idx represents the coded characteristic vector.
The value of the original non-numerical type is mapped to a very large numerical space through a numerical hash function, then a modulus is taken for p, so that the non-numerical type can be mapped to an interval of [0, p-1], the mapped bit corresponds to 1, and other bits represent 0. This completes the numerical encoding of the values of the non-numerical type.
Because the expression capacities of the two types of value intervals from the K +1 to the K + M and from the K + M +1 to the N are not completely the same, if the same hash function is used, the two types of characteristics cannot be distinguished well, and therefore different p can be selected according to the actual number of the two types to control the fineness degree and the number of the characteristic codes. The method using the hash coding mode actually compresses the original features once, so that the dimension of a feature space is reduced.
Next, in step 23, the variable value is encoded into a target vector in a target vector space by using the target encoding method, wherein the target vector space is a vector space corresponding to the target encoding method in the plurality of vector spaces. It can be understood that different encoding modes (e.g., one-hot encoding mode and hash encoding mode) correspond to different vector spaces, and are not described herein again.
Finally, in step 24, a feature vector of the feature variable is determined based on the target vector. It can be understood that, in general, feature vectors corresponding to different values of a feature variable should have the same dimension, and therefore, in this embodiment of the present specification, the feature vector of the feature variable may be further determined based on the target vector.
In one example, wherein determining the feature vector of the feature variable based on the target vector comprises:
filling at least one other vector space of the plurality of vector spaces except the target vector space with default values to obtain at least one default vector;
and splicing and combining the target vector and the at least one default vector to obtain the feature vector.
For example, there are 100 values for one feature variable, where the 100 values are a, b, c … … d, e, f, and g respectively in descending order of a service target, it is predetermined that a one-hot coding method is adopted for the value sets (a, b, and c), and a hash coding method with a dimension of 4 is adopted for the remaining value sets, then the one-hot coding is performed on the value a to obtain a target vector (1, 0, 0), and a default vector (0, 0, 0, 0) is obtained assuming that the default value is 0, and then the default vector (0, 0, 0, 0, 0) is obtained after concatenation and combination, and the dimension (7 in this example) of the feature vector obtained by this hierarchical coding method is far smaller than the dimension (100 in this example) of the feature vector obtained by directly adopting the one-hot coding method for the entire value interval.
It should be noted that the above numerical examples are only for convenience of understanding and are not intended to limit the embodiments of the present specification.
It can be understood that similar encoding of all non-numeric type variables can complete the numeralization of the whole variable space, and all non-numeric variables can be applied in specific services as input of a machine learning method after being encoded.
According to the method provided by the embodiment of the specification, after the variable value of the characteristic variable related to the business target is obtained, the variable value is of a non-numerical type, firstly, according to the corresponding relation between a plurality of value sets of the characteristic variable and a plurality of characteristic coding modes, the target coding mode corresponding to the variable value is selected from the plurality of characteristic coding modes, then, the variable value is coded into the target vector in the target vector space by using the target coding mode, the target vector space is the vector space corresponding to the target coding mode in the plurality of vector spaces, and then, the characteristic vector of the characteristic variable is determined based on the target vector. The value sets are divided according to the pre-evaluated discrimination degree of various possible values of the characteristic variables to the service target, and the characteristic coding modes are used for coding the values in the corresponding value sets into a plurality of vector spaces, so that the model does not lose useful information, the length of characteristic coding can be reduced, and certain generalization performance is achieved.
According to an embodiment of another aspect, a feature encoding apparatus is also provided. Fig. 3 shows a schematic block diagram of a feature encoding apparatus according to an embodiment. As shown in fig. 3, the apparatus 300 includes:
an obtaining unit 31, configured to obtain a variable value of a feature variable related to a service target, where the variable value is a non-numerical type;
a selecting unit 32, configured to select, according to a predetermined correspondence between multiple value sets of the feature variables and multiple feature coding modes, a target coding mode corresponding to the variable value acquired by the acquiring unit 31 from the multiple feature coding modes, where the multiple value sets are divided according to pre-evaluated degrees of distinction of various possible values of the feature variables with respect to the service target, and the multiple feature coding modes are used to code the values in the corresponding value sets into multiple vector spaces;
an encoding unit 33 configured to encode the variable value acquired by the acquisition unit into a target vector in a target vector space, which is a vector space corresponding to the target encoding scheme among the plurality of vector spaces, using the target encoding scheme selected by the selection unit 32;
a determining unit 34, configured to determine a feature vector of the feature variable based on the target vector obtained by the encoding unit 33.
In one example, the degree of distinction of the various possible values of the characteristic variable from the business objective is evaluated according to a prior probability of the various possible values on achievement of the business objective.
In an example, the plurality of value sets include a first value set and a second value set, and an average discrimination degree of values included in the first value set with respect to the service objective is higher than that of the second value set; the multiple feature encoding modes comprise a first feature encoding mode and a second feature encoding mode, the first feature encoding mode corresponds to the first value set, the second feature encoding mode corresponds to the second value set, and the encoding compression rate of the first feature encoding mode is smaller than that of the second feature encoding mode.
Further, the first feature encoding mode is a full-scale encoding mode and is used for encoding corresponding values into a first vector space, and the dimensionality of the first vector space is equal to the number of values in the first value set;
the second characteristic coding mode is a compression coding mode and is used for coding the corresponding values to a second vector space, and the dimensionality of the second vector space is smaller than the value number in the second value set.
For example, the first characteristic encoding mode is a one-hot encoding mode, and the second characteristic encoding mode is a hash encoding mode.
Optionally, the second feature encoding manner includes:
mapping the values in the second value set into values in a value space by adopting a numerical hash function;
and taking the modulus of the value to a preset number p, and mapping a modulus result into a p-dimensional vector, wherein the preset number p is less than the value number in the second value set.
In one example, the plurality of value sets are divided by:
sorting all possible values of the characteristic variables from large to small according to the discrimination of the values to the business targets;
sequentially selecting a first number of values as the first value set according to the sequence;
and taking other values except the first number as the second value set.
In another example, wherein the plurality of sets of values are divided by:
acquiring the discrimination of all possible values of the characteristic variables to the service target;
adding the value with the discrimination degree larger than or equal to a first threshold value to the first value set;
and adding the value with the discrimination smaller than the first threshold value to the second value set.
Further, the second value set comprises a first subset and a second subset; the average discrimination of the values contained in the first subset to the service target is higher than that of the second subset; the second characteristic encoding mode includes a first compression encoding mode corresponding to the first subset and a second compression encoding mode corresponding to the second subset, and an encoding compression ratio of the first compression encoding mode is smaller than that of the second compression encoding mode.
For example, the second characteristic encoding method is a hash encoding method, and:
the first compression encoding method includes:
mapping the values in the first subset into a first value by adopting a numerical hash function;
taking a modulus of the first numerical value to a first number, and mapping a modulus result into a vector of a dimension of the first number;
the second compression encoding method includes:
mapping the values in the second subset into a second value by adopting the numerical hash function;
taking a modulus of the second numerical value to a second number, and mapping a modulus result into a vector of a second number dimension;
wherein the first number is greater than the second number.
In an example, the determining unit 34 is specifically configured to:
filling at least one other vector space of the plurality of vector spaces except the target vector space with default values to obtain at least one default vector;
and splicing and combining the target vector and the at least one default vector to obtain the feature vector.
With the apparatus provided in this specification, after the obtaining unit 31 obtains the variable value of the feature variable related to the business target, the variable value is of a non-numerical type, the selecting unit 32 first selects the target coding mode corresponding to the variable value obtained by the obtaining unit 31 from the multiple feature coding modes according to the predetermined correspondence between the multiple value sets of the feature variable and the multiple feature coding modes, then the encoding unit 33 encodes the variable value into the target vector in the target vector space by using the target coding mode selected by the selecting unit 32, where the target vector space is the vector space corresponding to the target coding mode in the multiple vector spaces, and then the determining unit 34 determines the feature vector of the feature variable based on the target vector obtained by the encoding unit 33. The value sets are divided according to the pre-evaluated discrimination degree of various possible values of the characteristic variables to the service target, and the characteristic coding modes are used for coding the values in the corresponding value sets into a plurality of vector spaces, so that the model does not lose useful information, the length of characteristic coding can be reduced, and certain generalization performance is achieved.
According to an embodiment of another aspect, there is also provided a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method described in connection with fig. 2.
According to an embodiment of yet another aspect, there is also provided a computing device comprising a memory having stored therein executable code, and a processor that, when executing the executable code, implements the method described in connection with fig. 2.
Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in this invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.
The above-mentioned embodiments, objects, technical solutions and advantages of the present invention are further described in detail, it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the present invention should be included in the scope of the present invention.