CN109146083B - Feature encoding method and apparatus - Google Patents

Feature encoding method and apparatus Download PDF

Info

Publication number
CN109146083B
CN109146083B CN201810887448.8A CN201810887448A CN109146083B CN 109146083 B CN109146083 B CN 109146083B CN 201810887448 A CN201810887448 A CN 201810887448A CN 109146083 B CN109146083 B CN 109146083B
Authority
CN
China
Prior art keywords
values
value
encoding
vector
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810887448.8A
Other languages
Chinese (zh)
Other versions
CN109146083A (en
Inventor
宋乐
李辉
葛志邦
黄鑫
王琳
朱冠胤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Advanced New Technologies Co Ltd
Advantageous New Technologies Co Ltd
Original Assignee
Advanced New Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Advanced New Technologies Co Ltd filed Critical Advanced New Technologies Co Ltd
Priority to CN201810887448.8A priority Critical patent/CN109146083B/en
Publication of CN109146083A publication Critical patent/CN109146083A/en
Application granted granted Critical
Publication of CN109146083B publication Critical patent/CN109146083B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The embodiment of the specification provides a feature coding method and a feature coding device, wherein the method comprises the following steps: acquiring variable values of characteristic variables related to a business target, wherein the variable values are of a non-numerical type; selecting a target coding mode corresponding to a variable value from a plurality of characteristic coding modes according to a corresponding relation between a plurality of value sets of a predetermined characteristic variable and the plurality of characteristic coding modes, wherein the plurality of value sets are divided according to the discrimination degree of various possible values of the characteristic variable on a service target, which is evaluated in advance, and the plurality of characteristic coding modes are used for coding the values in the corresponding value sets into a plurality of vector spaces; encoding the variable value into a target vector in a target vector space by using a target encoding mode, wherein the target vector space is a vector space corresponding to the target encoding mode in a plurality of vector spaces; a feature vector of the feature variable is determined based on the target vector. The method can not only prevent the model from losing useful information, but also reduce the length of the feature code and simultaneously have certain generalization.

Description

Feature encoding method and apparatus
Technical Field
One or more embodiments of the present description relate to the field of computers, and more particularly, to feature encoding methods and apparatus.
Background
In classical data modeling scenarios, it is often encountered that much of the data is presented as non-numerical type variables. For example, a user purchases a commodity, the commodity belongs to a certain primary category and a certain secondary category, the commodity has its own name and type, the commodity belongs to a certain store (corresponding to a nickname nick or an identification id of the store), the user logs in on a certain Internet Protocol (IP) address, a certain wireless-fidelity (WIFI) or a certain physical device, and finally, the transaction is found to be a false transaction, a cash register transaction or a card theft transaction, and the like.
The scenario described above contains a large amount of behavioral information that often appears in data structures as non-numeric type variables. The classical machine learning algorithm usually needs to encode the features of the variables of the non-numerical types as the input of the model, and how to encode the variables of the non-numerical types usually affects the final performance of the model.
In the prior art, a one-hot encoding method is generally adopted when a non-numerical variable is encoded. And a one-hot coding mode, wherein all types are enumerated to form a vector, the length of the vector is the number of the types, the bit of the vector corresponding to a hit type is 1, and the rest is 0.
When non-numerical variable values are particularly many, such as IP addresses or device ids, one-hot feature coding directly performed on all values often generates a very sparse large matrix, resulting in too many parameters of the model and bringing a great cost to subsequent model deployment.
Therefore, it is desirable to have an improved scheme that allows the model to reduce the length of feature codes without losing useful information, and that has some generalization.
Disclosure of Invention
One or more embodiments of the present disclosure describe a feature encoding method and apparatus, which can not only make the model not lose useful information, but also reduce the length of feature encoding, and have a certain generalization.
In a first aspect, a feature encoding method is provided, including:
obtaining variable values of characteristic variables related to a business target, wherein the variable values are of non-numerical types;
selecting a target coding mode corresponding to the variable value from the multiple characteristic coding modes according to a predetermined corresponding relation between multiple value sets of the characteristic variable and the multiple characteristic coding modes, wherein the multiple value sets are divided according to the pre-evaluated discrimination degree of various possible values of the characteristic variable on the service target, and the multiple characteristic coding modes are used for coding the values in the corresponding value sets into multiple vector spaces;
encoding the variable value into a target vector in a target vector space using the target encoding scheme, the target vector space being a vector space corresponding to the target encoding scheme among the plurality of vector spaces;
determining a feature vector of the feature variable based on the target vector.
In a possible implementation manner, the degree of distinction of the various possible values of the characteristic variable from the business objective is evaluated according to a prior probability of the various possible values on achievement of the business objective.
In a possible implementation manner, the plurality of value sets include a first value set and a second value set, and an average discrimination degree of values included in the first value set with respect to the service target is higher than that of the second value set; the multiple feature encoding modes comprise a first feature encoding mode and a second feature encoding mode, the first feature encoding mode corresponds to the first value set, the second feature encoding mode corresponds to the second value set, and the encoding compression rate of the first feature encoding mode is smaller than that of the second feature encoding mode.
Further, the first feature encoding mode is a full-scale encoding mode and is used for encoding corresponding values into a first vector space, and the dimensionality of the first vector space is equal to the number of values in the first value set;
the second characteristic coding mode is a compression coding mode and is used for coding the corresponding values to a second vector space, and the dimensionality of the second vector space is smaller than the value number in the second value set.
Further, the first characteristic encoding mode is a one-hot encoding mode, and the second characteristic encoding mode is a hash encoding mode.
Further, the second feature encoding method includes:
mapping the values in the second value set into values in a value space by adopting a numerical hash function;
and taking the modulus of the value to a preset number p, and mapping a modulus result into a p-dimensional vector, wherein the preset number p is less than the value number in the second value set.
Further, the value sets are divided by the following steps:
sorting all possible values of the characteristic variables from large to small according to the discrimination of the values to the business targets;
sequentially selecting a first number of values as the first value set according to the sequence;
and taking other values except the first number as the second value set.
Further, the value sets are divided by the following steps:
acquiring the discrimination of all possible values of the characteristic variables to the service target;
adding the value with the discrimination degree larger than or equal to a first threshold value to the first value set;
and adding the value with the discrimination smaller than the first threshold value to the second value set.
Further, the second value set comprises a first subset and a second subset; the average discrimination of the values contained in the first subset to the service target is higher than that of the second subset; the second characteristic encoding mode includes a first compression encoding mode corresponding to the first subset and a second compression encoding mode corresponding to the second subset, and an encoding compression ratio of the first compression encoding mode is smaller than that of the second compression encoding mode.
Further, the second feature encoding mode is a hash encoding mode, and:
the first compression encoding method includes:
mapping the values in the first subset into a first value by adopting a numerical hash function;
taking a modulus of the first numerical value to a first number, and mapping a modulus result into a vector of a dimension of the first number;
the second compression encoding method includes:
mapping the values in the second subset into a second value by adopting the numerical hash function;
taking a modulus of the second numerical value to a second number, and mapping a modulus result into a vector of a second number dimension;
wherein the first number is greater than the second number.
In one possible embodiment, wherein determining the feature vector of the feature variable based on the target vector comprises:
filling at least one other vector space of the plurality of vector spaces except the target vector space with default values to obtain at least one default vector;
and splicing and combining the target vector and the at least one default vector to obtain the feature vector.
In a second aspect, a feature encoding apparatus is provided, including:
the system comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring variable values of characteristic variables related to a business target, and the variable values are of non-numerical types;
a selecting unit, configured to select, according to a predetermined correspondence between multiple value sets of the feature variables and multiple feature coding modes, a target coding mode corresponding to the variable value obtained by the obtaining unit from the multiple feature coding modes, where the multiple value sets are divided according to pre-evaluated degrees of distinction of various possible values of the feature variables with respect to the service target, and the multiple feature coding modes are used to code the values in the corresponding value sets into multiple vector spaces;
an encoding unit configured to encode the variable value acquired by the acquisition unit into a target vector in a target vector space, which is a vector space corresponding to the target encoding scheme among the plurality of vector spaces, using the target encoding scheme selected by the selection unit;
and the determining unit is used for determining the characteristic vector of the characteristic variable based on the target vector obtained by the encoding unit.
In a third aspect, there is provided a computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of the first aspect.
In a fourth aspect, there is provided a computing device comprising a memory having stored therein executable code and a processor that, when executing the executable code, implements the method of the first aspect.
According to the method and the device provided by the embodiment of the specification, after the variable value of the characteristic variable related to the business target is obtained, the variable value is of a non-numerical type, firstly, according to the corresponding relation between a plurality of value sets of the characteristic variable and a plurality of characteristic coding modes, the target coding mode corresponding to the variable value is selected from the plurality of characteristic coding modes, then, the variable value is coded into the target vector in the target vector space by using the target coding mode, the target vector space is the vector space corresponding to the target coding mode in the plurality of vector spaces, and then, the characteristic vector of the characteristic variable is determined based on the target vector. The value sets are divided according to the pre-evaluated discrimination degree of various possible values of the characteristic variables to the service target, and the characteristic coding modes are used for coding the values in the corresponding value sets into a plurality of vector spaces, so that the model does not lose useful information, the length of characteristic coding can be reduced, and certain generalization performance is achieved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a schematic diagram illustrating an implementation scenario of an embodiment disclosed herein;
FIG. 2 illustrates a flow diagram of a feature encoding method according to one embodiment;
fig. 3 shows a schematic block diagram of a feature encoding apparatus according to an embodiment.
Detailed Description
The scheme provided by the specification is described below with reference to the accompanying drawings.
Fig. 1 is a schematic view of an implementation scenario of an embodiment disclosed in this specification. As shown in fig. 1, when training a machine learning model, training data is input to the machine learning model, where the training data includes variables of non-numerical types, such as a commodity name, a shop name, and an I P address of a buyer, and these variables of non-numerical types can be input to the machine learning model after feature coding.
It is to be understood that the machine learning model shown in fig. 1 is only an example, and is not used to limit the machine learning algorithm in the embodiments of the present specification. For example, the machine learning algorithm may adopt supervised learning, unsupervised learning or reinforcement learning, which is not described herein.
Fig. 2 shows a flowchart of a feature encoding method according to an embodiment, as shown in fig. 2, the feature encoding method in this embodiment includes the following steps: step 21, obtaining variable values of characteristic variables related to the business target, wherein the variable values are of non-numerical type; step 22, selecting a target coding mode corresponding to the variable value from the multiple feature coding modes according to a predetermined correspondence between multiple value sets of the feature variable and the multiple feature coding modes, wherein the multiple value sets are divided according to pre-evaluated discrimination degrees of various possible values of the feature variable for the service target, and the multiple feature coding modes are used for coding the values in the corresponding value sets into multiple vector spaces; step 23, encoding the variable value into a target vector in a target vector space by using the target encoding mode, wherein the target vector space is a vector space corresponding to the target encoding mode in the plurality of vector spaces; and step 24, determining a feature vector of the feature variable based on the target vector. Specific execution modes of the above steps are described below.
First, in step 21, a variable value of a characteristic variable related to a business target is obtained, and the variable value is of a non-numerical type. It is understood that the feature variables related to the business target may include not only non-numerical type feature variables but also numerical type feature variables, and since the feature encoding method in the embodiment of the present specification is applied to encode non-numerical type feature variables, only obtaining variable values of non-numerical type feature variables is described in step 21.
Furthermore, the same variable will often exhibit different effects for different business objectives, such as a scholarly calendar being a very typical non-numeric type of variable, the higher the scholarly calendar the lower the credit risk, but the higher the scholarly calendar the less likely it is to be related to consumption preferences. When the business objective is to determine whether the credit risk is too low, learning as a feature variable associated with the business objective; when the business objective is to determine the consumption preferences of the user, the learned history is not a feature variable associated with the business objective.
Next, in step 22, according to a predetermined correspondence relationship between a plurality of value sets of the characteristic variables and a plurality of characteristic encoding modes, a target encoding mode corresponding to the variable value is selected from the plurality of characteristic encoding modes, wherein the plurality of value sets are divided according to pre-evaluated degrees of distinction of various possible values of the characteristic variables with respect to the service target, and the plurality of characteristic encoding modes are used for encoding the values in the corresponding value sets into a plurality of vector spaces.
In one example, the degree of distinction of the various possible values of the characteristic variable from the business objective is evaluated according to a prior probability of the various possible values on achievement of the business objective.
Specifically, after a specific business target to be solved is determined, the distinguishing capability of different values of each characteristic variable for the business target can be effectively evaluated, and the distinguishing degree of the various possible values of the characteristic variable for the business target is obtained. It can be understood that the stronger the distinguishing capability of a value for a business target, the higher the distinguishing degree of the value for the business target. For example, in a cash-out scene, a perpetrator often purchases a large amount of commodities of a non-flagship store, sometimes a seller provides cash-out services for a buyer, and even some commodities are specialized in cash-out, that is, the distribution of each commodity and each seller on the prior probability of cash-out is different. If a certain characteristic variable has N different values, different distinguishing capabilities of the N different values on the service target can be obtained.
The prior probability refers to a probability obtained from past experience and analysis. The prior probability can be divided into objective prior probability and subjective prior probability, wherein the prior probability obtained by using past historical data is called objective prior probability; when the historical data is not obtained or the data is incomplete, the prior probability obtained by judgment of subjective experience of people is called subjective prior probability.
It is understood that the prior probability mentioned in the embodiments of the present specification may be an objective prior probability or a subjective prior probability.
In one example, the plurality of value sets include a first value set and a second value set, and an average discrimination of values included in the first value set with respect to the service objective is higher than that of the second value set; the multiple feature encoding modes comprise a first feature encoding mode and a second feature encoding mode, the first feature encoding mode corresponds to the first value set, the second feature encoding mode corresponds to the second value set, and the encoding compression rate of the first feature encoding mode is smaller than that of the second feature encoding mode. That is, if the discrimination of the value to the service target is high, a characteristic encoding mode with a low encoding compression rate is adopted.
It is to be understood that, in the embodiments of the present specification, the number of the value sets included in the plurality of value sets is not limited, and may be two, three, or more.
Further, the first feature encoding mode is a full-scale encoding mode and is used for encoding corresponding values into a first vector space, and the dimensionality of the first vector space is equal to the number of values in the first value set; the second characteristic coding mode is a compression coding mode and is used for coding the corresponding values to a second vector space, and the dimensionality of the second vector space is smaller than the value number in the second value set.
For example, the first characteristic encoding mode is a one-hot encoding mode, and the second characteristic encoding mode is a hash encoding mode.
In one example, the second feature encoding mode includes:
mapping the values in the second value set into values in a value space by adopting a numerical hash function;
and taking the modulus of the value to a preset number p, and mapping a modulus result into a p-dimensional vector, wherein the preset number p is less than the value number in the second value set.
In one example, the plurality of value sets are divided by:
sorting all possible values of the characteristic variables from large to small according to the discrimination of the values to the business targets;
sequentially selecting a first number of values as the first value set according to the sequence;
and taking other values except the first number as the second value set.
For example, the characteristic variables represent commodities, there are 10 ten thousand commodities, and after sorting from top to bottom according to the degree of distinction of each commodity with respect to the business target, for example, the first 100 commodities with the highest degree of distinction can be preferentially selected for one-hot coding.
In another example, wherein the plurality of sets of values are divided by:
acquiring the discrimination of all possible values of the characteristic variables to the service target;
adding the value with the discrimination degree larger than or equal to a first threshold value to the first value set;
and adding the value with the discrimination smaller than the first threshold value to the second value set.
For example, the characteristic variable represents commodities, there are 10 ten thousand commodities, the degree of distinction of each commodity from the business target is determined according to the ability of distinguishing the commodity from the business target (for example, a value between 0 and 10), and for example, commodities with a degree of distinction greater than 5 may be preferentially selected for one-hot encoding.
Further, the second value set comprises a first subset and a second subset; the average discrimination of the values contained in the first subset to the service target is higher than that of the second subset; the second characteristic encoding mode includes a first compression encoding mode corresponding to the first subset and a second compression encoding mode corresponding to the second subset, and an encoding compression ratio of the first compression encoding mode is smaller than that of the second compression encoding mode. That is, the value set adopting the compression encoding mode can be continuously divided into subsets adopting different compression encoding modes, so that the vector dimension after feature encoding is further compressed.
Specifically, the second characteristic encoding mode is a hash encoding mode, and:
the first compression encoding method includes:
mapping the values in the first subset into a first value by adopting a numerical hash function;
taking a modulus of the first numerical value to a first number, and mapping a modulus result into a vector of a dimension of the first number;
the second compression encoding method includes:
mapping the values in the second subset into a second value by adopting the numerical hash function;
taking a modulus of the second numerical value to a second number, and mapping a modulus result into a vector of a second number dimension;
wherein the first number is greater than the second number.
For example, feature coding of the first K values (sorted from large to small according to the degree of distinction) of N different values of a feature variable is completed through a one-hot coding mode, and the remaining N-K values are not coded. The remaining N-K values are divided into two categories, one category is the value with behavior discrimination capability, but the discrimination capability is not so strong compared with the previous K values (assuming that there are M values, namely, the K +1 to K + M), and the second category is the value without discrimination capability at all (namely, the K + M +1 to N).
For the two types of values, hash (Hashing) coding is carried out to compress the characteristics, and the specific method is as follows:
determining a characteristic vector corresponding to a value through a formula idx ═ Hash (val)% p, wherein val represents the value of a characteristic variable, the value is a non-numerical value type, Hash () represents a Hash function,% represents modulus, p is a constant, and idx represents the coded characteristic vector.
The value of the original non-numerical type is mapped to a very large numerical space through a numerical hash function, then a modulus is taken for p, so that the non-numerical type can be mapped to an interval of [0, p-1], the mapped bit corresponds to 1, and other bits represent 0. This completes the numerical encoding of the values of the non-numerical type.
Because the expression capacities of the two types of value intervals from the K +1 to the K + M and from the K + M +1 to the N are not completely the same, if the same hash function is used, the two types of characteristics cannot be distinguished well, and therefore different p can be selected according to the actual number of the two types to control the fineness degree and the number of the characteristic codes. The method using the hash coding mode actually compresses the original features once, so that the dimension of a feature space is reduced.
Next, in step 23, the variable value is encoded into a target vector in a target vector space by using the target encoding method, wherein the target vector space is a vector space corresponding to the target encoding method in the plurality of vector spaces. It can be understood that different encoding modes (e.g., one-hot encoding mode and hash encoding mode) correspond to different vector spaces, and are not described herein again.
Finally, in step 24, a feature vector of the feature variable is determined based on the target vector. It can be understood that, in general, feature vectors corresponding to different values of a feature variable should have the same dimension, and therefore, in this embodiment of the present specification, the feature vector of the feature variable may be further determined based on the target vector.
In one example, wherein determining the feature vector of the feature variable based on the target vector comprises:
filling at least one other vector space of the plurality of vector spaces except the target vector space with default values to obtain at least one default vector;
and splicing and combining the target vector and the at least one default vector to obtain the feature vector.
For example, there are 100 values for one feature variable, where the 100 values are a, b, c … … d, e, f, and g respectively in descending order of a service target, it is predetermined that a one-hot coding method is adopted for the value sets (a, b, and c), and a hash coding method with a dimension of 4 is adopted for the remaining value sets, then the one-hot coding is performed on the value a to obtain a target vector (1, 0, 0), and a default vector (0, 0, 0, 0) is obtained assuming that the default value is 0, and then the default vector (0, 0, 0, 0, 0) is obtained after concatenation and combination, and the dimension (7 in this example) of the feature vector obtained by this hierarchical coding method is far smaller than the dimension (100 in this example) of the feature vector obtained by directly adopting the one-hot coding method for the entire value interval.
It should be noted that the above numerical examples are only for convenience of understanding and are not intended to limit the embodiments of the present specification.
It can be understood that similar encoding of all non-numeric type variables can complete the numeralization of the whole variable space, and all non-numeric variables can be applied in specific services as input of a machine learning method after being encoded.
According to the method provided by the embodiment of the specification, after the variable value of the characteristic variable related to the business target is obtained, the variable value is of a non-numerical type, firstly, according to the corresponding relation between a plurality of value sets of the characteristic variable and a plurality of characteristic coding modes, the target coding mode corresponding to the variable value is selected from the plurality of characteristic coding modes, then, the variable value is coded into the target vector in the target vector space by using the target coding mode, the target vector space is the vector space corresponding to the target coding mode in the plurality of vector spaces, and then, the characteristic vector of the characteristic variable is determined based on the target vector. The value sets are divided according to the pre-evaluated discrimination degree of various possible values of the characteristic variables to the service target, and the characteristic coding modes are used for coding the values in the corresponding value sets into a plurality of vector spaces, so that the model does not lose useful information, the length of characteristic coding can be reduced, and certain generalization performance is achieved.
According to an embodiment of another aspect, a feature encoding apparatus is also provided. Fig. 3 shows a schematic block diagram of a feature encoding apparatus according to an embodiment. As shown in fig. 3, the apparatus 300 includes:
an obtaining unit 31, configured to obtain a variable value of a feature variable related to a service target, where the variable value is a non-numerical type;
a selecting unit 32, configured to select, according to a predetermined correspondence between multiple value sets of the feature variables and multiple feature coding modes, a target coding mode corresponding to the variable value acquired by the acquiring unit 31 from the multiple feature coding modes, where the multiple value sets are divided according to pre-evaluated degrees of distinction of various possible values of the feature variables with respect to the service target, and the multiple feature coding modes are used to code the values in the corresponding value sets into multiple vector spaces;
an encoding unit 33 configured to encode the variable value acquired by the acquisition unit into a target vector in a target vector space, which is a vector space corresponding to the target encoding scheme among the plurality of vector spaces, using the target encoding scheme selected by the selection unit 32;
a determining unit 34, configured to determine a feature vector of the feature variable based on the target vector obtained by the encoding unit 33.
In one example, the degree of distinction of the various possible values of the characteristic variable from the business objective is evaluated according to a prior probability of the various possible values on achievement of the business objective.
In an example, the plurality of value sets include a first value set and a second value set, and an average discrimination degree of values included in the first value set with respect to the service objective is higher than that of the second value set; the multiple feature encoding modes comprise a first feature encoding mode and a second feature encoding mode, the first feature encoding mode corresponds to the first value set, the second feature encoding mode corresponds to the second value set, and the encoding compression rate of the first feature encoding mode is smaller than that of the second feature encoding mode.
Further, the first feature encoding mode is a full-scale encoding mode and is used for encoding corresponding values into a first vector space, and the dimensionality of the first vector space is equal to the number of values in the first value set;
the second characteristic coding mode is a compression coding mode and is used for coding the corresponding values to a second vector space, and the dimensionality of the second vector space is smaller than the value number in the second value set.
For example, the first characteristic encoding mode is a one-hot encoding mode, and the second characteristic encoding mode is a hash encoding mode.
Optionally, the second feature encoding manner includes:
mapping the values in the second value set into values in a value space by adopting a numerical hash function;
and taking the modulus of the value to a preset number p, and mapping a modulus result into a p-dimensional vector, wherein the preset number p is less than the value number in the second value set.
In one example, the plurality of value sets are divided by:
sorting all possible values of the characteristic variables from large to small according to the discrimination of the values to the business targets;
sequentially selecting a first number of values as the first value set according to the sequence;
and taking other values except the first number as the second value set.
In another example, wherein the plurality of sets of values are divided by:
acquiring the discrimination of all possible values of the characteristic variables to the service target;
adding the value with the discrimination degree larger than or equal to a first threshold value to the first value set;
and adding the value with the discrimination smaller than the first threshold value to the second value set.
Further, the second value set comprises a first subset and a second subset; the average discrimination of the values contained in the first subset to the service target is higher than that of the second subset; the second characteristic encoding mode includes a first compression encoding mode corresponding to the first subset and a second compression encoding mode corresponding to the second subset, and an encoding compression ratio of the first compression encoding mode is smaller than that of the second compression encoding mode.
For example, the second characteristic encoding method is a hash encoding method, and:
the first compression encoding method includes:
mapping the values in the first subset into a first value by adopting a numerical hash function;
taking a modulus of the first numerical value to a first number, and mapping a modulus result into a vector of a dimension of the first number;
the second compression encoding method includes:
mapping the values in the second subset into a second value by adopting the numerical hash function;
taking a modulus of the second numerical value to a second number, and mapping a modulus result into a vector of a second number dimension;
wherein the first number is greater than the second number.
In an example, the determining unit 34 is specifically configured to:
filling at least one other vector space of the plurality of vector spaces except the target vector space with default values to obtain at least one default vector;
and splicing and combining the target vector and the at least one default vector to obtain the feature vector.
With the apparatus provided in this specification, after the obtaining unit 31 obtains the variable value of the feature variable related to the business target, the variable value is of a non-numerical type, the selecting unit 32 first selects the target coding mode corresponding to the variable value obtained by the obtaining unit 31 from the multiple feature coding modes according to the predetermined correspondence between the multiple value sets of the feature variable and the multiple feature coding modes, then the encoding unit 33 encodes the variable value into the target vector in the target vector space by using the target coding mode selected by the selecting unit 32, where the target vector space is the vector space corresponding to the target coding mode in the multiple vector spaces, and then the determining unit 34 determines the feature vector of the feature variable based on the target vector obtained by the encoding unit 33. The value sets are divided according to the pre-evaluated discrimination degree of various possible values of the characteristic variables to the service target, and the characteristic coding modes are used for coding the values in the corresponding value sets into a plurality of vector spaces, so that the model does not lose useful information, the length of characteristic coding can be reduced, and certain generalization performance is achieved.
According to an embodiment of another aspect, there is also provided a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method described in connection with fig. 2.
According to an embodiment of yet another aspect, there is also provided a computing device comprising a memory having stored therein executable code, and a processor that, when executing the executable code, implements the method described in connection with fig. 2.
Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in this invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.
The above-mentioned embodiments, objects, technical solutions and advantages of the present invention are further described in detail, it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the present invention should be included in the scope of the present invention.

Claims (22)

1. A feature encoding method, comprising:
obtaining variable values of characteristic variables related to a business target, wherein the variable values are of non-numerical types;
selecting a target coding mode corresponding to the variable value from the multiple characteristic coding modes according to a predetermined corresponding relation between multiple value sets of the characteristic variable and the multiple characteristic coding modes, wherein the multiple value sets are divided according to the pre-evaluated discrimination degree of various possible values of the characteristic variable on the service target, and the multiple characteristic coding modes are used for coding the values in the corresponding value sets into multiple vector spaces;
encoding the variable value into a target vector in a target vector space using the target encoding scheme, the target vector space being a vector space corresponding to the target encoding scheme among the plurality of vector spaces;
determining a feature vector of the feature variable based on the target vector;
the plurality of value sets comprise a first value set and a second value set, and the average discrimination of the values contained in the first value set to the service target is higher than that of the second value set; the multiple feature encoding modes comprise a first feature encoding mode and a second feature encoding mode, the first feature encoding mode corresponds to the first value set, the second feature encoding mode corresponds to the second value set, and the encoding compression rate of the first feature encoding mode is smaller than that of the second feature encoding mode.
2. The method of claim 1, wherein the degree of discrimination of the various possible values of the characteristic variable from the business objective is evaluated based on a prior probability of the various possible values being on the achievement of the business objective.
3. The method of claim 1, wherein the first signature encoding is a full-scale encoding for encoding corresponding values into a first vector space, the first vector space having dimensions equal to the number of values in the first set of values;
the second characteristic coding mode is a compression coding mode and is used for coding the corresponding values to a second vector space, and the dimensionality of the second vector space is smaller than the value number in the second value set.
4. The method according to claim 3, wherein the first characteristic encoding scheme is a one-hot encoding scheme, and the second characteristic encoding scheme is a hash encoding scheme.
5. The method of claim 4, wherein the second eigen coding mode comprises:
mapping the values in the second value set into values in a value space by adopting a numerical hash function;
and taking the modulus of the value to a preset number p, and mapping a modulus result into a p-dimensional vector, wherein the preset number p is less than the value number in the second value set.
6. The method of any one of claims 1-5, wherein the plurality of sets of values are divided by:
sorting all possible values of the characteristic variables from large to small according to the discrimination of the values to the business targets;
sequentially selecting a first number of values as the first value set according to the sequence;
and taking other values except the first number as the second value set.
7. The method of any one of claims 1-5, wherein the plurality of sets of values are divided by:
acquiring the discrimination of all possible values of the characteristic variables to the service target;
adding the value with the discrimination degree larger than or equal to a first threshold value to the first value set;
and adding the value with the discrimination smaller than the first threshold value to the second value set.
8. The method of claim 3, wherein the second set of values comprises a first subset and a second subset; the average discrimination of the values contained in the first subset to the service target is higher than that of the second subset; the second characteristic encoding mode includes a first compression encoding mode corresponding to the first subset and a second compression encoding mode corresponding to the second subset, and an encoding compression ratio of the first compression encoding mode is smaller than that of the second compression encoding mode.
9. The method of claim 8, wherein the second signature encoding is a hash encoding, and:
the first compression encoding method includes:
mapping the values in the first subset into a first value by adopting a numerical hash function;
taking a modulus of the first numerical value to a first number, and mapping a modulus result into a vector of a dimension of the first number;
the second compression encoding method includes:
mapping the values in the second subset into a second value by adopting the numerical hash function;
taking a modulus of the second numerical value to a second number, and mapping a modulus result into a vector of a second number dimension;
wherein the first number is greater than the second number.
10. The method of claim 1, wherein determining a feature vector of the feature variable based on the target vector comprises:
filling at least one other vector space of the plurality of vector spaces except the target vector space with default values to obtain at least one default vector;
and splicing and combining the target vector and the at least one default vector to obtain the feature vector.
11. A feature encoding apparatus comprising:
the system comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring variable values of characteristic variables related to a business target, and the variable values are of non-numerical types;
a selecting unit, configured to select, according to a predetermined correspondence between multiple value sets of the feature variables and multiple feature coding modes, a target coding mode corresponding to the variable value obtained by the obtaining unit from the multiple feature coding modes, where the multiple value sets are divided according to pre-evaluated degrees of distinction of various possible values of the feature variables with respect to the service target, and the multiple feature coding modes are used to code the values in the corresponding value sets into multiple vector spaces;
an encoding unit configured to encode the variable value acquired by the acquisition unit into a target vector in a target vector space, which is a vector space corresponding to the target encoding scheme among the plurality of vector spaces, using the target encoding scheme selected by the selection unit;
a determining unit, configured to determine a feature vector of the feature variable based on the target vector obtained by the encoding unit;
the plurality of value sets comprise a first value set and a second value set, and the average discrimination of the values contained in the first value set to the service target is higher than that of the second value set; the multiple feature encoding modes comprise a first feature encoding mode and a second feature encoding mode, the first feature encoding mode corresponds to the first value set, the second feature encoding mode corresponds to the second value set, and the encoding compression rate of the first feature encoding mode is smaller than that of the second feature encoding mode.
12. The apparatus of claim 11, wherein the degree of discrimination of the business objective by the various possible values of the feature variable is evaluated according to a prior probability of the various possible values being on achievement of the business objective.
13. The apparatus of claim 11, wherein the first characteristic encoding is a full-scale encoding for encoding corresponding values into a first vector space, the first vector space having dimensions equal to the number of values in the first set of values;
the second characteristic coding mode is a compression coding mode and is used for coding the corresponding values to a second vector space, and the dimensionality of the second vector space is smaller than the value number in the second value set.
14. The apparatus according to claim 13, wherein the first characteristic encoding scheme is a one-hot encoding scheme, and the second characteristic encoding scheme is a hash encoding scheme.
15. The apparatus of claim 14, wherein the second eigen coding mode comprises:
mapping the values in the second value set into values in a value space by adopting a numerical hash function;
and taking the modulus of the value to a preset number p, and mapping a modulus result into a p-dimensional vector, wherein the preset number p is less than the value number in the second value set.
16. The apparatus of any one of claims 11-15, wherein the plurality of sets of values are divided by:
sorting all possible values of the characteristic variables from large to small according to the discrimination of the values to the business targets;
sequentially selecting a first number of values as the first value set according to the sequence;
and taking other values except the first number as the second value set.
17. The apparatus of any one of claims 11-15, wherein the plurality of sets of values are divided by:
acquiring the discrimination of all possible values of the characteristic variables to the service target;
adding the value with the discrimination degree larger than or equal to a first threshold value to the first value set;
and adding the value with the discrimination smaller than the first threshold value to the second value set.
18. The apparatus of claim 13, wherein the second set of values comprises a first subset and a second subset; the average discrimination of the values contained in the first subset to the service target is higher than that of the second subset; the second characteristic encoding mode includes a first compression encoding mode corresponding to the first subset and a second compression encoding mode corresponding to the second subset, and an encoding compression ratio of the first compression encoding mode is smaller than that of the second compression encoding mode.
19. The apparatus of claim 18, wherein the second signature encoding is a hash encoding, and:
the first compression encoding method includes:
mapping the values in the first subset into a first value by adopting a numerical hash function;
taking a modulus of the first numerical value to a first number, and mapping a modulus result into a vector of a dimension of the first number;
the second compression encoding method includes:
mapping the values in the second subset into a second value by adopting the numerical hash function;
taking a modulus of the second numerical value to a second number, and mapping a modulus result into a vector of a second number dimension;
wherein the first number is greater than the second number.
20. The apparatus of claim 11, wherein the determining unit is specifically configured to:
filling at least one other vector space of the plurality of vector spaces except the target vector space with default values to obtain at least one default vector;
and splicing and combining the target vector and the at least one default vector to obtain the feature vector.
21. A computer-readable storage medium, on which a computer program is stored, which, when the computer program is executed in a computer, causes the computer to carry out the method of any one of claims 1-10.
22. A computing device comprising a memory and a processor, wherein the memory has stored therein executable code that, when executed by the processor, implements the method of any of claims 1-10.
CN201810887448.8A 2018-08-06 2018-08-06 Feature encoding method and apparatus Active CN109146083B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810887448.8A CN109146083B (en) 2018-08-06 2018-08-06 Feature encoding method and apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810887448.8A CN109146083B (en) 2018-08-06 2018-08-06 Feature encoding method and apparatus

Publications (2)

Publication Number Publication Date
CN109146083A CN109146083A (en) 2019-01-04
CN109146083B true CN109146083B (en) 2021-07-23

Family

ID=64791926

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810887448.8A Active CN109146083B (en) 2018-08-06 2018-08-06 Feature encoding method and apparatus

Country Status (1)

Country Link
CN (1) CN109146083B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109934628B (en) * 2019-03-08 2021-03-19 智者四海(北京)技术有限公司 Feature processing method and device
CN110675931A (en) * 2019-08-28 2020-01-10 吉林金域医学检验所有限公司 Information coding method, device, equipment and storage medium for detection report
CN110970100A (en) * 2019-11-04 2020-04-07 广州金域医学检验中心有限公司 Method, device and equipment for detecting item coding and computer readable storage medium
CN111611449B (en) * 2020-05-08 2023-08-29 百度在线网络技术(北京)有限公司 Information encoding method, apparatus, electronic device, and computer-readable storage medium
CN113220947A (en) * 2021-05-27 2021-08-06 支付宝(杭州)信息技术有限公司 Method and device for encoding event characteristics

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106897776A (en) * 2017-01-17 2017-06-27 华南理工大学 A kind of continuous type latent structure method based on nominal attribute

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TW201004361A (en) * 2008-07-03 2010-01-16 Univ Nat Cheng Kung Encoding device and method thereof for stereoscopic video

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106897776A (en) * 2017-01-17 2017-06-27 华南理工大学 A kind of continuous type latent structure method based on nominal attribute

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于Xgboost的商业销售预测;叶倩怡 等;《南昌大学学报(理科版)》;20170630;第41卷(第3期);第275-281页 *
非数值型特征如何进行编码;luguanyou;《非数值型特征如何进行编码?_luguanyou的博客-CSDN博客(https://blog.csdn.net/luguanyou/article/details/80598122)》;20180606;第1-9页 *

Also Published As

Publication number Publication date
CN109146083A (en) 2019-01-04

Similar Documents

Publication Publication Date Title
CN109146083B (en) Feature encoding method and apparatus
CN109684554B (en) Method for determining potential users of news and news pushing method
CN110659744B (en) Training event prediction model, and method and device for evaluating operation event
CN110866181B (en) Resource recommendation method, device and storage medium
CN111783875A (en) Abnormal user detection method, device, equipment and medium based on cluster analysis
CN110728290B (en) Method and device for detecting security of data model
CN105447498A (en) A client device configured with a neural network, a system and a server system
CN109903168B (en) Method for recommending insurance products based on machine learning and related equipment
US8805767B1 (en) Machine learning memory management and distributed rule evaluation
CN113255908B (en) Method, neural network model and device for service prediction based on event sequence
CN114626553A (en) Training method and device of financial data monitoring model and computer equipment
CN114782201A (en) Stock recommendation method and device, computer equipment and storage medium
US11995150B2 (en) Information processing method and information processing system
JP7056345B2 (en) Data analysis systems, methods, and programs
KR20230069578A (en) Sign-Aware Recommendation Apparatus and Method using Graph Neural Network
CN112634057A (en) Fund similarity calculation method, platform, device and readable storage medium
CN110275881B (en) Method and device for pushing object to user based on Hash embedded vector
CN115905864A (en) Abnormal data detection model training method and device and computer equipment
US20220164654A1 (en) Energy- and memory-efficient training of neural networks
CN118057373A (en) Data processing method, risk assessment device and electronic equipment
Azhar et al. Data compression techniques for stock market prediction
CN109191192B (en) Data estimation method, apparatus and computer-readable storage medium
CN110147881B (en) Language processing method, device, equipment and storage medium
CN114723960B (en) Additional verification method and system for enhancing bank account security
CN114115730B (en) Application container storage engine platform

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20200927

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant after: Innovative advanced technology Co.,Ltd.

Address before: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant before: Advanced innovation technology Co.,Ltd.

Effective date of registration: 20200927

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant after: Advanced innovation technology Co.,Ltd.

Address before: A four-storey 847 mailbox in Grand Cayman Capital Building, British Cayman Islands

Applicant before: Alibaba Group Holding Ltd.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant