CN110837894B - Feature processing method, device and storage medium - Google Patents

Feature processing method, device and storage medium Download PDF

Info

Publication number
CN110837894B
CN110837894B CN201911029966.7A CN201911029966A CN110837894B CN 110837894 B CN110837894 B CN 110837894B CN 201911029966 A CN201911029966 A CN 201911029966A CN 110837894 B CN110837894 B CN 110837894B
Authority
CN
China
Prior art keywords
feature
segment
target
candidate
segments
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911029966.7A
Other languages
Chinese (zh)
Other versions
CN110837894A (en
Inventor
郑立凡
吕培立
董井然
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201911029966.7A priority Critical patent/CN110837894B/en
Publication of CN110837894A publication Critical patent/CN110837894A/en
Application granted granted Critical
Publication of CN110837894B publication Critical patent/CN110837894B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

The application relates to a feature processing method, a device and a storage medium, wherein the method comprises the following steps: acquiring object information of a plurality of objects, wherein the object information of each object comprises a label of the object and a plurality of item of characteristic information respectively corresponding to the object and a plurality of single characteristics; segmenting a plurality of objects and a plurality of pieces of feature information corresponding to the same single feature to obtain candidate segment sets respectively corresponding to each single feature; screening candidate segments in each candidate segment set based on the labels of each object to obtain target segment sets corresponding to each single feature respectively; combining the target segments in each target segment set; and constructing a target combination feature set based on the combination result of each target segment. According to the method and the device, under the condition of large data volume, the user features in any dimension can be automatically subjected to feature cross combination, so that a corresponding target combination feature set is generated.

Description

Feature processing method, device and storage medium
Technical Field
The present disclosure relates to the field of machine learning technologies, and in particular, to a feature processing method, a device, and a storage medium.
Background
Feature Cross (Feature Cross) refers to a composite Feature formed by two or more Feature combinations, and the Feature combinations can provide prediction capability beyond that which can be provided by the features alone, so that the Feature Cross can enhance the expression capability of the model and improve the prediction effect of the machine learning model.
By means of data mining, various characteristics of users, such as age layers, academic layers, income layers and the like, can be extracted from data and daily behaviors of a large number of users, so that the data volume to be processed is huge when the characteristics are crossed. The existing method for performing feature intersection combination under the condition of large data volume needs to perform manual processing and cannot customize feature intersection, so that an effective feature processing method is required to be provided to solve the technical problems existing in the prior art when feature intersection is performed.
Disclosure of Invention
The technical problem to be solved by the application is to provide a feature processing method, a device and a storage medium, which can automatically perform feature cross combination on user features with any dimension under the condition of large data volume, so as to generate a corresponding target combination feature set, and facilitate the subsequent direct determination of corresponding target combination features according to the acquired user feature information.
In order to solve the above technical problems, in one aspect, the present application provides a feature processing method, where the method includes:
acquiring object information of a plurality of objects, wherein the object information of each object comprises a label of the object and a plurality of item of characteristic information respectively corresponding to the object and a plurality of single characteristics;
segmenting a plurality of objects and a plurality of pieces of feature information corresponding to the same single feature to obtain candidate segment sets respectively corresponding to each single feature; wherein each candidate segment set comprises at least two candidate segments;
screening candidate segments in each candidate segment set based on the labels of each object to obtain target segment sets corresponding to each single feature respectively;
combining the target segments in each target segment set;
and constructing a target combination feature set based on the combination result of each target segment.
In another aspect, the present application provides a feature processing apparatus, the apparatus including:
the object information acquisition module is used for acquiring object information of a plurality of objects, wherein the object information of each object comprises a label of the object and a plurality of item of characteristic information respectively corresponding to the object and a plurality of single characteristics;
The candidate segment set construction module is used for segmenting a plurality of pieces of characteristic information corresponding to a plurality of objects and the same single characteristic to obtain candidate segment sets respectively corresponding to each single characteristic; wherein each candidate segment set comprises at least two candidate segments;
the target segment set construction module is used for screening candidate segments in each candidate segment set based on the labels of each object to obtain target segment sets corresponding to each single feature respectively;
the target segment combination module is used for combining target segments in each target segment set;
and the target combined feature set construction module is used for constructing a target combined feature set based on the combined result of each target segment.
In another aspect, the present application provides a computer storage medium having stored therein at least one instruction, at least one program, code set, or instruction set that is loaded by a processor and that performs a feature processing method as described above.
In another aspect, the present application provides an apparatus comprising a processor and a memory having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, loaded and executed by the processor to implement a feature processing method as described above.
The implementation of the embodiment of the application has the following beneficial effects:
after object information of a plurality of objects is acquired, the method automatically segments based on multi-item feature information corresponding to the same single feature to obtain a candidate segment set corresponding to each single feature; screening the candidate segments in each candidate set based on preset screening conditions to obtain target segment sets which correspond to each single feature and meet the screening conditions; and combining the target segments in each target segment set to generate a plurality of combined features, thereby constructing a target combined feature set. According to the method and the device, under the condition of large data volume, the user features in any dimension can be automatically subjected to feature cross combination, so that a corresponding target combination feature set is generated, corresponding target combination features can be conveniently and directly determined from the target combination feature set according to the acquired user feature information, and therefore the machine learning model effect of training based on the target combination features can be enhanced.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions and advantages of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are only some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic view of an application scenario provided in an embodiment of the present application;
FIG. 2 is a flow chart of a feature processing method provided in an embodiment of the present application;
FIG. 3 is a flowchart of a candidate segment set determination method according to an embodiment of the present application;
FIG. 4 is a flowchart of a method for segmenting feature information according to feature types according to an embodiment of the present application;
FIG. 5 is a flowchart of a method for generating a target segment set according to an embodiment of the present application;
FIG. 6 is a flowchart of a candidate segment contribution calculation method according to an embodiment of the present application;
FIG. 7 is a flowchart of a method for combining target segments according to an embodiment of the present application;
FIG. 8 is a flow chart of a method for model training based on combined features provided in an embodiment of the present application;
FIG. 9 is a schematic diagram of a feature processing apparatus according to an embodiment of the present application;
FIG. 10 is a schematic diagram of a candidate segment set construction module provided in an embodiment of the present application;
FIG. 11 is a schematic diagram of a segmentation processing module according to an embodiment of the present disclosure;
FIG. 12 is a schematic diagram of a target segment set construction module provided in an embodiment of the present application;
fig. 13 is a schematic diagram of a contribution value determining module provided in an embodiment of the present application;
FIG. 14 is a schematic diagram of a target segment combining module provided in an embodiment of the present application;
FIG. 15 is a schematic diagram of a model training module provided in an embodiment of the present application;
fig. 16 is a schematic view of an apparatus structure according to an embodiment of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the present application will be described in further detail with reference to the accompanying drawings. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present application based on the embodiments herein.
It should be noted that the terms "first," "second," and the like in the description and claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The following explanation is first made on technical terms involved in the embodiments of the present application:
machine Learning (ML): machine learning is the science of letting a computer learn and act like a human, learning underlying knowledge under a large amount of data through a model, and optimizing the model with an optimization algorithm. Currently, the method is widely applied to various fields such as shopping recommendation, search ranking, advertisement clicking, credit risk assessment, image recognition, automatic driving and the like.
WOE (Weight Of Evidence, evidence weight): a method for measuring the difference between normal sample and default sample distribution.
IV (Information Value ): the sum of the KL distances of the positive sample on the feature distribution and the negative sample on the feature distribution, in short, is that the higher the IV value of a certain feature is, the stronger the predictive power thereof is.
Characteristic engineering: feature engineering is the process of using knowledge in the data domain to create features that enable machine learning algorithms to function, and is the basis of machine learning applications.
Big data processing: big data is characterized by huge volume of data, various data types, parallel processing and the like. The processing frameworks commonly used at present are Hadoop, spark and the like.
Referring to fig. 1, a schematic diagram of an application scenario provided in an embodiment of the present application is shown, where the scenario includes at least a server 110 and a terminal 120.
In the embodiment of the present disclosure, the server 110 may include a server that operates independently, or a distributed server, or a server cluster including a plurality of servers. Specifically, the server 110 may be configured to obtain object information from each terminal 120, and perform feature combination based on the object information.
In this embodiment of the present disclosure, the terminal 120 may include a smart phone, a desktop computer, a tablet computer, a notebook computer, a digital assistant, a smart wearable device, or other types of physical devices, or may include software running in the physical devices, such as an application program, a website, or the like. The operating system running on the terminal in the embodiment of the present application may include, but is not limited to, an android system, an IOS system, linux, windows, and the like.
In order to solve the problems in the prior art that only numerical features are supported, discrete features are not supported, a feature crossing mode cannot be customized, and feature crossing processing needs to be performed manually in feature crossing combination, the embodiment of the application provides a feature processing method, an execution subject of which may be a server in fig. 1, and specifically, referring to fig. 2, the method includes:
S210, acquiring object information of a plurality of objects, wherein the object information of each object comprises a label of the object and multiple item of characteristic information respectively corresponding to the object and the multiple single characteristics.
The object information acquired here can be specifically regarded as a user sample information, which includes response tag information of each user and a plurality of pieces of characteristic information of each user. The response label can be whether the user responds to a certain information promotion, marketing activity and the like, for a certain user, when the user responds to the activity, the corresponding response label is 1, and when the user does not respond to the activity, the corresponding response label is 0; the user's various characteristic information may specifically include: the user age, sex, city, academic, purchasing behavior and the like.
S220, segmenting multiple items of characteristic information corresponding to multiple objects and the same single characteristic to obtain candidate segment sets respectively corresponding to each single characteristic; wherein each candidate segment set includes at least two candidate segments.
For the user sample information, the same item of characteristic information of different users forms characteristic information of corresponding single characteristics in the current user sample, the numerical value or the category of the characteristic information corresponding to each single characteristic may be different, and the characteristic information needs to be segmented respectively; it should be noted that, the feature information corresponding to each single feature may be considered as a group, and the segmentation is performed by segmenting the intra-group information, and a specific segmentation method may refer to fig. 3, which shows a candidate segment set determining method, where the method includes:
S310, determining the feature type of each single feature.
For each single feature, the feature types are generally different, and the feature types of the single feature are determined according to the feature information corresponding to the single feature.
S320, carrying out segmentation processing on the feature information corresponding to the single feature according to the feature type of the single feature to obtain a plurality of segments.
For single features of different feature types, determining a segmentation method corresponding to the feature type, and generating a plurality of segments under each single feature.
S330, constructing the candidate segment set corresponding to the single feature based on a plurality of segments.
In the embodiment of the application, the feature type of the specific single feature may include a numerical type and a category type, for example, for the age, the corresponding feature information is a number, such as 20, 45, etc.; for gender, the corresponding characteristic information is category information, such as male and female; for the city, the corresponding characteristic information is category information, such as Jiangsu province, guangdong province, and the like. The characteristic information is digital, and the characteristic type is a numerical type; the feature information is classified, and the feature type is a class type.
Referring to fig. 4, a method for segmenting feature information according to feature type is shown, the method comprising:
s410, when the feature type of the single feature is a numerical value type, segmenting the feature information based on the numerical value range of the feature information corresponding to the single feature to obtain a plurality of segments.
Taking age characteristics as an example, in each item of age characteristic information of the corresponding user sample, the age range is 18-75 years old, and then the following segmentation can be performed on the age characteristic information:
age≤20;
21≤age≤30;
31≤age≤40;
41≤age≤50;
51≤age≤60;
61≤age≤70;
age≥71;
the number of segments to be segmented, for example, 7 segments, may be preset before the segmentation is performed; when segmentation is carried out, a first segmentation age is separated according to the age range, wherein the residual age is more than or equal to 21 and is a segment, and whether the number of the existing segments reaches 7 is judged; if the number of segments is not equal to or greater than 21, continuing to segment the number of segments, wherein the number of segments is not less than 21 and not greater than 30, the remaining number of segments is not less than 31, judging whether the number of segments reaches 7, and continuing to segment the number of segments not less than 31, and the like until the final number of segments reaches the preset 7 segments.
For each segmentation result, each segment can be adjusted according to actual conditions, for example, if the response condition of users over 60 years old is less, as known from the user sample information, 61.ltoreq.age.ltoreq.70 and age.gtoreq.71 can be combined to be age.gtoreq.61; in addition, the segmentation adjustment can be performed according to other conditions.
S420, when the feature type of the single feature is a category type, segmenting the feature information based on the feature category contained in the feature information corresponding to the single feature to obtain a plurality of segments.
For the category type, each category under the single feature can be divided into a segment, for example, the gender comprises two categories of men and women, and then the feature information under the single feature is directly divided into two segments of men and women when the segmentation is carried out. Taking the city as an example, each city may be divided into one segment separately, but in some cases, for example, the city where different users are located is Guangzhou, and Shenzhen, if Guangzhou and Shenzhen are divided into two different segments, but in the later processing, guangzhou and Shenzhen may be combined, because the two cities are relatively close.
S230, screening candidate segments in each candidate segment set based on labels of the objects to obtain target segment sets corresponding to the single features respectively.
Based on the above, when the segmentation is initially performed, the corresponding feature information under each single feature may be refined to the greatest extent to obtain all possible segmentation results, but for consideration of data volume and stability and interpretability of the feature, it is generally required to screen candidate segments that can better represent the single feature from the segments of each single feature, and combine other candidate segments, for example, feature a has 25 segments, feature B has 30 segments, then feature AB after feature a and feature B intersect may have 750 segments, and excessive segments may cause the following problems:
1. The characteristic is unstable, namely the characteristic value calculated by the time node is greatly different from the characteristic value calculated by the next time node;
2. the simulation may be overfitted;
3. the meaning of each segment is difficult to interpret;
it is therefore necessary to filter each candidate segment and to merge the corresponding segments.
A specific segment screening method may refer to fig. 5, which illustrates a target segment set generating method, where the method includes:
s510, determining the contribution value of each candidate segment in the target segment set corresponding to each single feature based on the label of each object.
In the embodiment of the application, the selection of each segment is realized based on the contribution value of each segment in the single feature.
S520, selecting candidate segments meeting preset conditions as target segments based on the contribution values of the candidate segments.
The specific method for selecting the candidate segment meeting the preset condition as the target segment can comprise the following steps:
sequencing the candidate segments according to the sequence from big to small of the contribution value of the candidate segments, and selecting a preset number of candidate segments with the front sequencing as the target segments;
or alternatively, the first and second heat exchangers may be,
and selecting the candidate segment with the contribution value larger than a preset value as the target segment.
S530, merging the rest candidate segments except the target segment in the candidate segments into the target segment.
For example, for a single feature, the time division of the initial segment into 6 candidate segments includes: segment 1, segment 2, segment 3, segment 4, segment 5, and segment 6, the contribution value of each candidate segment is now calculated, based on the contribution value of each candidate segment, if 4 target segments are selected from: segment 1, segment 2, segment 3, and segment 4, where segment 5 and segment 6 may be combined into segment 4 to obtain 4 segments; if 4 target segments are selected: segment 1, segment 3, segment 4, and segment 6, then the remaining segment 2 may be incorporated into segment 1, or may be incorporated into segment 3, or segment 5 may be incorporated into segment 4, or may be incorporated into segment 6, and embodiments of the present application are not specifically limited.
S540, generating a target segment set corresponding to the single feature based on each target segment.
After screening and merging, target segment sets corresponding to the single features are obtained, and subsequent segment combination among the single features can be performed based on the target segment sets.
Referring to fig. 6, a candidate segment contribution value calculation method is shown, the method includes:
S610, determining the total number of the responding objects and the total number of the non-responding objects based on the labels of the objects.
In the user sample data information, according to the response label of each user, the total number of the responding users and the total number of the non-responding users in the sample can be determined.
S620, for each candidate segment in the candidate segment set corresponding to each single feature, determining a target object in each candidate segment, wherein the target object comprises a response object and an unresponsive object.
S630, based on the labels of the target objects, the number of the responding objects and the number of the non-responding objects in each candidate segment are respectively determined.
Taking the above age segmentation as an example, the following age segmentation is included:
age≤20;
21≤age≤30;
31≤age≤40;
41≤age≤50;
51≤age≤60;
age≥61;
and respectively determining the users in each age section based on the user sample information, and determining the number of responding users and the number of non-responding users in each age section according to the response labels of the users.
S640, determining the coding value of each candidate segment based on the number of the response objects, the number of the non-response objects, the total number of the response objects and the total number of the non-response objects in each candidate segment.
The calculation of the encoding value of each segment in the present application may specifically be calculation of the WOE value (Weight of Evidence, evidence weight) of each segment, and for the WOE value of a certain candidate segment of each single feature, the calculation may be performed by the following formula:
Wherein,is the ratio of the number of responding users in the segment to the total number of responding users, p ni Is the ratio of the number of unresponsive users in the segment to the total number of unresponsive users, y i To respond to the number of users in this segment, y T To respond to the total number of users in this segment, n i For the number of unresponsive users in this segment, n T For the total number of unresponsive users in this segment.
Based on the above formula (1), WOE values for each candidate segment are calculated.
S650, calculating information values of the candidate segments based on the coding values of the candidate segments.
Based on the WOE value of each candidate segment, the formula for calculating the information value IV (Information Value ) of each candidate segment is:
wherein,is the ratio of the number of responding users in the segment to the total number of responding users, p ni Is the ratio of the number of unresponsive users in the segment to the total number of unresponsive users.
S660, determining the information value of each segment as the contribution value of each candidate segment.
Based on the calculation of the information value of each candidate segment, the contribution value of each candidate segment is obtained, and the candidate segment is screened based on the size of the contribution value of the candidate segment.
S240, combining the target segments in each target segment set.
In this embodiment, the combining refers to cross-combining the target segments in each target segment set, so as to obtain combined feature information formed by combining single feature information, and specifically, please refer to fig. 7, which shows a target segment combining method, where the method includes:
s710, exhausting the combination modes of the target segments in each target segment set based on a preset target segment combination method; the target segment combination method is to respectively take one target segment from each target segment set for combination.
S720, obtaining a plurality of combination features based on the exhaustion result of the target segment combination.
For example, there are three single features, whose corresponding target segment sets are respectively:
set a: { A1, A2, A3, A4};
set B: { B1, B2};
set C: { C1, C2, C3};
each time, a piece of segment features are respectively taken out from a set to be combined to obtain a combined feature, for example, A1B1C1, A1B1C2, A1B1C3 and the like, and finally, 4 x 2 x 3 = 24 combined features can be obtained.
S250, constructing a target combination feature set based on a combination result of each target segment.
For multiple combined features in the target combined feature set, each combined feature may be sequentially labeled as a corresponding combined feature segment, e.g., for the 24 combined features described above, may be sequentially labeled as combined feature segment 1 through combined feature segment 24.
Based on the operation, the method is equivalent to establishing a single-feature combination table, and based on the multi-feature information of any user in a plurality of single features, the combination features corresponding to the user can be found by searching the feature combination table, so that an automatic multi-dimensional feature cross combination tool under a large data scale is provided.
Referring to fig. 8, a method for model training based on combined features is shown, the method comprising:
s810, acquiring object information of a test object, wherein the object information comprises a label of the test object and multiple items of characteristic information respectively corresponding to the test object and multiple single characteristics.
S820, determining corresponding target combination features in the target combination feature set based on multiple feature information respectively corresponding to the test object and the multiple single features.
S830, training a preset machine learning model based on the target combination features and the labels of the test objects.
According to the above description of the embodiments of the present application, after the feature combination table is established, when new user information is acquired, the combined feature information corresponding to the new user may be determined directly based on multiple pieces of feature information of the new user.
And taking a large amount of combined characteristic information of users as input of a preset model, and continuously adjusting parameters of the preset model in the training process until output of the preset model is matched with response labels of the users, so as to obtain a relevant machine learning model. Because the combined characteristic information can provide the prediction capability beyond that provided by the single characteristics, the combined characteristic information is used as the input of model training, and the preset model is trained, the expression capability of the model can be enhanced, and the prediction result is more accurate.
The implementation process of the application can be realized through the following algorithm:
let the data set be (Y, X) 1 ,X 2 ,X 3 ,...,X d ) Wherein Y and X 1 ,X 2 ,X 3 ,...,X d Are all column vectors, Y is a label corresponding to each user, (X) 1 ,X 2 ,X 3 ,...,X d ) Is a single feature set.
1. For characteristics (X) 1 ,X 2 ,X 3 ,...,X d ) Preprocessing, such as filling of missing values, replacement of abnormal values and the like;
2. segmenting according to the type of the feature;
for example, the current feature is a numerical feature, firstly determining whether there is a specified segment, and when there is no specified segment, segmenting feature information of the current feature according to a score list, for example, segmenting in 10% and 20% of a numerical range; when the specified segment exists, the specified segment mode is directly adopted, and the segment obtained by adopting the specified mode can be adjusted afterwards.
3. According to (X 1 ,X 2 ,X 3 ,...,X d ) Calculating the IV value of the candidate segment of each single feature, and selecting a target segment;
for each single feature, a set of all possible segments may first be generated, from which then the satisfactory segments are selected based on the IV values.
Taking a×b as an example, generating segment sets respectively, and when segment screening is performed, for a: screening out target segments based on IV values of the segments of A, generating a new segment set of A, and keeping the segments of B unchanged; for B: screening out the target segment based on the IV value of each segment of B, and generating a new segment set of B, wherein the segments of A are kept unchanged. And (3) measuring different characteristic segments by using IV values in the algorithm, so that the optimal segments are continuously selected until the threshold value of the number of segments is reached.
When the updating of the segments is not needed anymore, step 4 is performed.
4. Cross-combining the target segments;
5. and outputting and storing the combined characteristic segmentation information, and directly processing the subsequent data.
The following describes a specific procedure of the embodiment of the present application with a specific example, first providing user sample data as shown in table 1, where the user sample data includes city characteristics of each user, the number of clicking a piece of information, and the response situation of each user to a certain promotion activity:
Table 1 user sample data
Feature cross-combining the city feature and the click frequency feature in table 1 to obtain table 2:
TABLE 2 feature cross-over combination results
Combined feature segmentation Urban area Number of clicks
1 Shanghai x≤100
2 Shanghai 100<x≤300
3 Shanghai x>300
4 Beijing x≤100
5 Beijing 100<x≤300
6 Beijing x>300
7 Guangzhou Shenzhen type Chinese character x≤100
8 Guangzhou Shenzhen type Chinese character 100<x≤300
9 Guangzhou Shenzhen type Chinese character x>300
The leftmost column in Table 2 is the segment identifier corresponding to each combined feature, for example, the combined feature segment 1 corresponds to the feature of Shanghai, x is less than or equal to 100, where Guangzhou and Shenzhen are combined into one feature, mainly considering that the two cities are relatively similar.
For the information of each user in table 1, the information can be mapped onto a section of combined feature segments according to the feature information of each user based on the mapping relation of table 2, as shown in table 3:
TABLE 3 user sample feature intersection results
User identification Label (Label) Urban area Number of clicks At-site combined feature segmentation
1 0 Shanghai 100 1
2 0 Guangzhou style 200 8
3 0 Beijing 307 6
4 1 Beijing 103 5
5 0 Shenzhen (Shenzhen) 300 8
For example, for user 4, whose city is Beijing, the number of clicks is 103, beijing is first found in Table 2, 103 falls between 100 < x.ltoreq.300, so it is determined that it corresponds to the 5 th segment of the combined feature segment.
Also for user data in the test set, the combined feature segments where each item of test user data is located can be determined, as shown in table 4:
Table 4 feature cross results for test set users
User identification Urban area Number of clicks At-site combined feature segmentation
1 Shanghai 106 2
2 Guangzhou style 25 7
3 Beijing 700 6
Therefore, for any user, based on each item of characteristic information, the combination characteristic segments corresponding to the user can be automatically determined, so that the corresponding combination characteristics are determined, a more specific prediction result can be predicted based on a prediction model obtained by training a preset model by the combination characteristics, for example, for the above example, the following may be predicted: users in the city of Beijing and having clicks between 100 and 300 have a greater likelihood of responding to the relevant promotional event than just predicting users with relevant single features.
Feature engineering is one of means for effectively improving a machine learning model, feature intersection is an important content in feature engineering, and the intersected features can often provide prediction capability beyond that which can be provided by the features alone; the embodiment of the application mainly provides the capability of realizing the self-definition/automatic crossing of different types of features (discrete type and continuous type) under the condition of large data volume, and any multi-dimensional features (namely N features can be supported to be crossed at one time under the condition of enough calculation force and storage space).
According to the method and the device, under the condition of large data volume, the user features in any dimension can be automatically subjected to feature cross combination, so that a corresponding target combination feature set is generated, corresponding target combination features can be conveniently and directly determined from the target combination feature set according to the acquired user feature information, and therefore the machine learning model effect of training based on the target combination features can be enhanced.
The embodiment further provides a feature processing apparatus, referring to fig. 9, the apparatus includes:
an object information obtaining module 910, configured to obtain object information of a plurality of objects, where the object information of each object includes a tag of the object and a plurality of feature information corresponding to the object and a plurality of single features respectively;
the candidate segment set construction module 920 is configured to segment multiple pieces of feature information corresponding to multiple objects and the same single feature, so as to obtain candidate segment sets respectively corresponding to each single feature; wherein each candidate segment set comprises at least two candidate segments;
the target segment set construction module 930 is configured to screen candidate segments in each candidate segment set based on the labels of each object, to obtain target segment sets corresponding to each single feature respectively;
A target segment combining module 940, configured to combine target segments in each target segment set;
the target combined feature set construction module 950 is configured to construct a target combined feature set based on the combined result of each target segment.
Referring to fig. 10, the candidate segment set construction module 920 includes:
a feature type determination module 1010 for determining a feature type of each individual feature;
the segmentation processing module 1020 is configured to perform segmentation processing on feature information corresponding to the single feature according to a feature type of the single feature, so as to obtain a plurality of segments;
a first construction module 1030 is configured to construct the candidate segment set corresponding to the single feature based on a plurality of segments.
The feature types of the single feature include a numeric type and a category type, and accordingly, referring to fig. 11, the segmentation processing module 1020 includes:
a first segmentation module 1110, configured to segment the feature information based on a numerical range of feature information corresponding to the single feature when the feature type of the single feature is a numerical type, so as to obtain a plurality of segments;
and the second segmentation module 1120 is configured to segment the feature information based on the feature class included in the feature information corresponding to the single feature to obtain a plurality of segments when the feature type of the single feature is a class type.
Referring to fig. 12, the target segment set construction module 930 includes:
a contribution value determining module 1210, configured to determine a contribution value of each candidate segment in the target segment set corresponding to each single feature based on the label of each object;
the target segment selection module 1220 is configured to select, based on the contribution value of each candidate segment, a candidate segment that meets a preset condition as a target segment; the method is particularly used for sequencing the candidate segments according to the sequence from big to small of the contribution value of the candidate segments, and selecting a preset number of candidate segments with the front sequencing as the target segments; or selecting the candidate segment with the contribution value larger than a preset value as the target segment.
A segment merging module 1230, configured to merge remaining candidate segments of the candidate segments except the target segment into the target segment;
a target segment set generating module 1240, configured to generate a target segment set corresponding to the single feature based on each target segment.
Referring to fig. 13, the contribution value determining module 1210 includes:
a first determining module 1310, configured to determine, based on the tags of the objects, a total number of the responsive objects and a total number of the non-responsive objects;
A target object determining module 1320, configured to determine, for each candidate segment in the candidate segment set corresponding to each single feature, a target object in each candidate segment, where the target object includes a responsive object and an unresponsive object;
a second determining module 1330 configured to determine, based on the labels of the target objects, the number of responsive objects and the number of non-responsive objects in each candidate segment, respectively;
a code value determining module 1340 configured to determine a code value of each candidate segment based on the number of the response objects, the number of the non-response objects, the total number of the response objects, and the total number of the non-response objects in each candidate segment;
an information value calculation module 1350 for calculating an information value of each candidate segment based on the encoded value of each candidate segment;
a third determining module 1360 is configured to determine an information value of each segment as the contribution value of each candidate segment.
Referring to fig. 14, the target segment combining module 940 includes:
an exhaustion module 1410, configured to exhaust a combination manner of the target segments in each target segment set based on a preset target segment combination method; the target segment combination method is to respectively take one target segment from each target segment set for combination;
The combined feature generation module 1420 is configured to obtain a plurality of combined features based on the exhaustive result of the combination of the target segments.
Referring to fig. 15, the apparatus further includes a model training module 1500, the model training module 1500 includes:
a first obtaining module 1510, configured to obtain object information of a test object, where the object information includes a tag of the test object and multiple feature information corresponding to multiple single features of the test object;
a fourth determining module 1520, configured to determine, based on a plurality of feature information corresponding to the test object and the plurality of single features, a corresponding target combined feature in the target combined feature set;
the first training module 1530 is configured to train a preset machine learning model based on the target combined feature and the label of the test object.
The device provided in the above embodiment can execute the method provided in any embodiment of the present application, and has the corresponding functional modules and beneficial effects of executing the method. Technical details not described in detail in the above embodiments may be found in the methods provided in any of the embodiments of the present application.
The present embodiment also provides a computer readable storage medium having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, the at least one instruction, at least one program, set of codes, or set of instructions loaded by a processor and performing any of the methods described above in the present embodiment.
The embodiment also provides a device, referring to fig. 16 for a structural diagram, the device 1600 may have a relatively large difference according to different configurations or performances, and may include oneOr more central processing units (central processing units, CPU) 1622 (e.g., one or more processors) and memory 1632, one or more storage media 1630 (e.g., one or more mass storage devices) storing applications 1642 or data 1644. Wherein memory 1632 and storage medium 1630 may be transitory or persistent. The program stored on the storage medium 1630 may include one or more modules (not shown), each of which may include a series of instruction operations in the device. Still further, the central processor 1622 may be configured to communicate with a storage medium 1630 to execute a series of instruction operations on the device 1600 in the storage medium 1630. The device 1600 may also include one or more power supplies 1626, one or more wired or wireless network interfaces 1650, one or more input/output interfaces 1658, and/or one or more operating systems 1641, e.g., windows Server TM ,Mac OS X TM ,Unix TM ,Linux TM ,FreeBSD TM Etc. Any of the methods described above for this embodiment may be implemented based on the apparatus shown in fig. 16.
The present specification provides method operational steps as described in the examples or flowcharts, but may include more or fewer operational steps based on conventional or non-inventive labor. The steps and sequences recited in the embodiments are merely one manner of performing the sequence of steps and are not meant to be exclusive of the sequence of steps performed. In actual system or interrupt product execution, the methods illustrated in the embodiments or figures may be performed sequentially or in parallel (e.g., in the context of parallel processors or multi-threaded processing).
The structures shown in this embodiment are only partial structures related to the present application and do not constitute limitations of the apparatus to which the present application is applied, and a specific apparatus may include more or less components than those shown, or may combine some components, or may have different arrangements of components. It should be understood that the methods, apparatuses, etc. disclosed in the embodiments may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, and the division of the modules is merely a division of one logic function, and may be implemented in other manners, such as multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or unit modules.
Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The above embodiments are merely for illustrating the technical solution of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the corresponding technical solutions.

Claims (14)

1. A feature processing method, comprising:
acquiring object information of a plurality of objects, wherein the object information of each object comprises a label of the object and a plurality of item of characteristic information respectively corresponding to the object and a plurality of single characteristics;
segmenting a plurality of objects and a plurality of pieces of feature information corresponding to the same single feature to obtain candidate segment sets respectively corresponding to each single feature; wherein each candidate segment set comprises at least two candidate segments;
determining the total number of the responding objects and the total number of the non-responding objects based on the labels of the objects; for each candidate segment in the candidate segment set corresponding to each single feature, determining a target object in each candidate segment, wherein the target object comprises a response object and an unresponsive object; determining the number of the response objects and the number of the non-response objects in each candidate segment based on the labels of the target objects; determining evidence weights for each candidate segment based on the number of responsive objects, the number of non-responsive objects, the total number of responsive objects, and the total number of non-responsive objects in each candidate segment; calculating the information value of each candidate segment based on the evidence weight of each candidate segment; determining the information value of each segment as the contribution value of each candidate segment;
Selecting candidate segments meeting preset conditions as target segments based on the contribution values of the candidate segments; merging remaining candidate segments of the candidate segments except the target segment into the target segment; generating a target segment set corresponding to the single feature based on each target segment;
combining the target segments in each target segment set;
and constructing a target combination feature set based on the combination result of each target segment.
2. The method according to claim 1, wherein segmenting the plurality of pieces of feature information corresponding to the same single feature from the plurality of objects to obtain the candidate segment sets corresponding to each single feature respectively comprises:
determining a feature type of each single feature;
according to the feature type of the single feature, carrying out segmentation processing on feature information corresponding to the single feature to obtain a plurality of segments;
the set of candidate segments corresponding to the single feature is constructed based on a plurality of segments.
3. A feature handling method according to claim 2, wherein the feature types of the single feature include a numeric type and a category type;
correspondingly, the step of carrying out segmentation processing on the feature information corresponding to the single feature according to the feature type of the single feature to obtain a plurality of segments comprises the following steps:
When the feature type of the single feature is a numerical value type, segmenting the feature information based on the numerical value range of the feature information corresponding to the single feature to obtain a plurality of segments;
and when the feature type of the single feature is a category type, segmenting the feature information based on the feature category contained in the feature information corresponding to the single feature to obtain a plurality of segments.
4. The method of claim 1, wherein selecting the candidate segment meeting the preset condition as the target segment based on the contribution value of each candidate segment comprises:
sequencing the candidate segments according to the sequence from big to small of the contribution value of the candidate segments, and selecting a preset number of candidate segments with the front sequencing as the target segments;
or alternatively, the first and second heat exchangers may be,
and selecting the candidate segment with the contribution value larger than a preset value as the target segment.
5. The method of claim 1, wherein combining the target segments in each set of target segments comprises:
based on a preset target segment combination method, the combination modes of target segments in each target segment set are exhausted; the target segment combination method is to respectively take one target segment from each target segment set for combination;
Based on the exhaustive result of the target segment combination, a plurality of combination features are obtained.
6. A method of feature processing as claimed in claim 1, further comprising:
acquiring object information of a test object, wherein the object information comprises a label of the test object and multiple items of characteristic information respectively corresponding to the test object and multiple single characteristics;
determining corresponding target combination features in the target combination feature set based on multiple feature information respectively corresponding to the test object and the multiple single features;
training a preset machine learning model based on the target combination features and the labels of the test objects.
7. A feature processing apparatus, comprising:
the object information acquisition module is used for acquiring object information of a plurality of objects, wherein the object information of each object comprises a label of the object and a plurality of item of characteristic information respectively corresponding to the object and a plurality of single characteristics;
the candidate segment set construction module is used for segmenting a plurality of pieces of characteristic information corresponding to a plurality of objects and the same single characteristic to obtain candidate segment sets respectively corresponding to each single characteristic; wherein each candidate segment set comprises at least two candidate segments;
The target segment set construction module is used for determining the total number of the responding objects and the total number of the non-responding objects based on the labels of the objects; for each candidate segment in the candidate segment set corresponding to each single feature, determining a target object in each candidate segment, wherein the target object comprises a response object and an unresponsive object; determining the number of the response objects and the number of the non-response objects in each candidate segment based on the labels of the target objects; determining evidence weights for each candidate segment based on the number of responsive objects, the number of non-responsive objects, the total number of responsive objects, and the total number of non-responsive objects in each candidate segment; calculating the information value of each candidate segment based on the evidence weight of each candidate segment; determining the information value of each segment as the contribution value of each candidate segment; selecting candidate segments meeting preset conditions as target segments based on the contribution values of the candidate segments; merging remaining candidate segments of the candidate segments except the target segment into the target segment; generating a target segment set corresponding to the single feature based on each target segment;
The target segment combination module is used for combining target segments in each target segment set;
and the target combined feature set construction module is used for constructing a target combined feature set based on the combined result of each target segment.
8. The apparatus of claim 7, wherein the candidate segment set construction module comprises:
the feature type determining module is used for determining the feature type of each single feature;
the segmentation processing module is used for carrying out segmentation processing on the feature information corresponding to the single feature according to the feature type of the single feature to obtain a plurality of segments;
a first construction module for constructing the candidate segment set corresponding to the single feature based on a plurality of segments.
9. The apparatus of claim 8, wherein the feature type of the single feature comprises a numeric type and a category type, and wherein the segmentation processing module comprises:
the first segmentation module is used for segmenting the feature information based on the numerical range of the feature information corresponding to the single feature when the feature type of the single feature is a numerical type, so as to obtain a plurality of segments;
and the second segmentation module is used for segmenting the feature information based on the feature category contained in the feature information corresponding to the single feature to obtain a plurality of segments when the feature type of the single feature is a category type.
10. The apparatus of claim 7, wherein the target segment set construction module is configured to rank the candidate segments in order of from a higher to a lower contribution value of each candidate segment, and select a preset number of candidate segments ranked first as the target segment; or selecting the candidate segment with the contribution value larger than a preset value as the target segment.
11. The apparatus of claim 7, wherein the target segment combining module comprises:
the exhaustion module is used for exhausting the combination modes of the target segments in each target segment set based on a preset target segment combination method; the target segment combination method is to respectively take one target segment from each target segment set for combination;
and the combined characteristic generating module is used for obtaining a plurality of combined characteristics based on the exhaustion result of the target segment combination.
12. The apparatus of claim 7, further comprising a model training module, the model training module comprising:
the first acquisition module is used for acquiring object information of a test object, wherein the object information comprises a label of the test object and multiple pieces of characteristic information respectively corresponding to the test object and multiple single characteristics;
A fourth determining module, configured to determine, based on multiple feature information corresponding to the test object and multiple single features, a corresponding target combined feature in the target combined feature set;
and the first training module is used for training a preset machine learning model based on the target combination characteristics and the labels of the test objects.
13. A computer device, characterized in that it comprises a processor and a memory in which at least one instruction, at least one program, a set of codes or a set of instructions is stored, which is loaded and executed by the processor to implement the characteristic processing method according to any one of claims 1 to 6.
14. A computer-readable storage medium, characterized in that at least one instruction, at least one program, code set or instruction set is stored in the storage medium, the at least one instruction, at least one program, code set or instruction set being loaded by a processor and executing the characteristic processing method according to any one of claims 1 to 6.
CN201911029966.7A 2019-10-28 2019-10-28 Feature processing method, device and storage medium Active CN110837894B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911029966.7A CN110837894B (en) 2019-10-28 2019-10-28 Feature processing method, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911029966.7A CN110837894B (en) 2019-10-28 2019-10-28 Feature processing method, device and storage medium

Publications (2)

Publication Number Publication Date
CN110837894A CN110837894A (en) 2020-02-25
CN110837894B true CN110837894B (en) 2024-02-13

Family

ID=69575625

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911029966.7A Active CN110837894B (en) 2019-10-28 2019-10-28 Feature processing method, device and storage medium

Country Status (1)

Country Link
CN (1) CN110837894B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113656697B (en) * 2021-08-24 2023-12-12 北京字跳网络技术有限公司 Object recommendation method, device, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018086470A1 (en) * 2016-11-10 2018-05-17 腾讯科技(深圳)有限公司 Keyword extraction method and device, and server
CN109598095A (en) * 2019-01-07 2019-04-09 平安科技(深圳)有限公司 Method for building up, device, computer equipment and the storage medium of scorecard model
CN109815267A (en) * 2018-12-21 2019-05-28 天翼征信有限公司 The branch mailbox optimization method and system, storage medium and terminal of feature in data modeling
CN110163378A (en) * 2019-03-04 2019-08-23 腾讯科技(深圳)有限公司 Characteristic processing method, apparatus, computer readable storage medium and computer equipment
CN110263265A (en) * 2019-04-10 2019-09-20 腾讯科技(深圳)有限公司 User tag generation method, device, storage medium and computer equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018086470A1 (en) * 2016-11-10 2018-05-17 腾讯科技(深圳)有限公司 Keyword extraction method and device, and server
CN109815267A (en) * 2018-12-21 2019-05-28 天翼征信有限公司 The branch mailbox optimization method and system, storage medium and terminal of feature in data modeling
CN109598095A (en) * 2019-01-07 2019-04-09 平安科技(深圳)有限公司 Method for building up, device, computer equipment and the storage medium of scorecard model
CN110163378A (en) * 2019-03-04 2019-08-23 腾讯科技(深圳)有限公司 Characteristic processing method, apparatus, computer readable storage medium and computer equipment
CN110263265A (en) * 2019-04-10 2019-09-20 腾讯科技(深圳)有限公司 User tag generation method, device, storage medium and computer equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
傅涛 ; 孙文静 ; 孙亚民 ; .基于分箱统计的FCM算法及其在网络入侵检测中的应用.计算机科学.2008,第36-39页. *

Also Published As

Publication number Publication date
CN110837894A (en) 2020-02-25

Similar Documents

Publication Publication Date Title
CN110162593B (en) Search result processing and similarity model training method and device
CN109902849B (en) User behavior prediction method and device, and behavior prediction model training method and device
CN105095219B (en) Micro-blog recommendation method and terminal
CN106651057B (en) Mobile terminal user age prediction method based on installation package sequence list
US11514063B2 (en) Method and apparatus of recommending information based on fused relationship network, and device and medium
CN104778186B (en) Merchandise items are mounted to the method and system of standardized product unit
US20220075838A1 (en) Taxonomy-based system for discovering and annotating geofences from geo-referenced data
CN110009486B (en) Method, system, equipment and computer readable storage medium for fraud detection
WO2021068563A1 (en) Sample date processing method, device and computer equipment, and storage medium
WO2019233077A1 (en) Ranking of business object
CN112905897A (en) Similar user determination method, vector conversion model, device, medium and equipment
CN108389113B (en) Collaborative filtering recommendation method and system
CN111582912A (en) Portrait modeling method based on deep embedding clustering algorithm
CN110837894B (en) Feature processing method, device and storage medium
CN111737584B (en) Updating method and device of behavior prediction system
CN113128526A (en) Image recognition method and device, electronic equipment and computer-readable storage medium
CN111445280A (en) Model generation method, restaurant ranking method, system, device and medium
CN113743968A (en) Information delivery method, device and equipment
CN111667018A (en) Object clustering method and device, computer readable medium and electronic equipment
WO2023284516A1 (en) Information recommendation method and apparatus based on knowledge graph, and device, medium, and product
Yang et al. An academic social network friend recommendation algorithm based on decision tree
CN115376668A (en) Big data business analysis method and system applied to intelligent medical treatment
CN109033078A (en) The recognition methods of sentence classification and device, storage medium, processor
CN114330519A (en) Data determination method and device, electronic equipment and storage medium
CN114693404A (en) Collaborative measurement-based commodity personalized recommendation method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40021486

Country of ref document: HK

SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant