CN113792952A

CN113792952A - Method and apparatus for generating a model

Info

Publication number: CN113792952A
Application number: CN202110200440.1A
Authority: CN
Inventors: 陈伯梁
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Priority date: 2021-02-23
Filing date: 2021-02-23
Publication date: 2021-12-14

Abstract

Embodiments of the present disclosure disclose methods, apparatuses, devices and storage media for generating a model. One embodiment of the method comprises: acquiring an original characteristic set; calculating the evaluation index of each feature in the feature set, and sequencing the features according to the evaluation indexes from large to small to obtain a feature sequence; determining a mutation point of an evaluation index from the characteristic sequence as an evaluation index threshold; screening out the characteristics of which the evaluation indexes are larger than the evaluation index threshold value from the characteristic sequence as the characteristics for model training; obtaining a training sample set according to the characteristics for model training; and carrying out model training by using the training sample set to obtain a trained model. The method and the device can improve the accuracy and efficiency of feature screening, so that a model with high accuracy is trained.

Description

Method and apparatus for generating a model

Technical Field

Embodiments of the present disclosure relate to the field of computer technologies, and in particular, to a method and an apparatus for generating a model.

Background

In the existing model training process, the characteristic quantity is huge, so that difficulty is caused to model training. In a typical case, such as a commercial customer gender tag prediction scenario, the model features include: the method comprises the steps of obtaining user attributes, user levels, commodity attribute related characteristics of orders placed by users, user behavior characteristics such as orders placed, additional purchase, click and browsing, particularly behavior characteristics which are particularly sparse, wherein each user only purchases and clicks several commodities under different categories, and the rest characteristics are 0, so that the problem of huge characteristic quantity is caused.

Disclosure of Invention

Embodiments of the present disclosure propose methods and apparatuses for generating models.

In a first aspect, an embodiment of the present disclosure provides a method for generating a model, including: acquiring an original characteristic set; calculating the evaluation index of each feature in the feature set, and sequencing the features according to the evaluation indexes from large to small to obtain a feature sequence; determining a mutation point of an evaluation index from the characteristic sequence as an evaluation index threshold; screening out the characteristics of which the evaluation indexes are larger than the evaluation index threshold value from the characteristic sequence as the characteristics for model training; obtaining a training sample set according to the characteristics for model training; and carrying out model training by using the training sample set to obtain a trained model.

In some embodiments, determining a mutation point of the evaluation index from the feature sequence as an evaluation index threshold comprises: taking the subscript index of each sequenced feature as an abscissa, taking the evaluation index of each feature as an ordinate to determine the position of each feature in a coordinate system, and calculating the slope between each feature and the previous feature as the slope corresponding to each feature; using the subscript index of each sorted characteristic as an independent variable and the corresponding slope as a dependent variable, and performing polynomial function fitting to obtain fitting curves of all characteristic slopes; and analyzing a slope mutation point of the fitting curve, and taking an evaluation index corresponding to the slope mutation point as an evaluation index threshold value.

In some embodiments, determining a mutation point of the evaluation index from the feature sequence as an evaluation index threshold comprises: using the subscript index of each sorted feature as an abscissa, using the evaluation index of each feature as an ordinate to determine the position of each feature in a coordinate system, and calculating the slope between each feature and other features; finding a target feature from the feature sequence, so that the feature sequence is divided into a first interval and a second interval through the target feature, and the ratio of the average slope between the features in the first interval to the average slope between the features in the second interval is the largest; and determining the evaluation index corresponding to the target feature as an evaluation index threshold value.

In some embodiments, the method further comprises: and performing significance test on each feature according to the evaluation index.

In some embodiments, the model is a gender prediction model, the input to the model is a characteristic of the user, and the output is the gender of the user.

In some embodiments, the method further comprises: performing performance evaluation on the trained model to obtain an evaluation result; and if the evaluation result does not meet the target expectation, re-determining the evaluation index threshold.

In some embodiments, the evaluation index includes a chi-squared value or an information entropy.

In a second aspect, an embodiment of the present disclosure provides an apparatus for generating a model, including: an acquisition unit configured to acquire an original feature set; the computing unit is configured to compute an evaluation index of each feature in the feature set, and sort the features according to the order of the evaluation indexes from large to small to obtain a feature sequence; a determining unit configured to determine a mutation point of the evaluation index from the feature sequence as an evaluation index threshold; a screening unit configured to screen out, from the feature sequence, features of which the evaluation indexes are larger than an evaluation index threshold value as features for model training; and the training unit is configured to acquire a training sample set according to the characteristics for model training, and perform model training by using the training sample set to obtain a trained model.

In some embodiments, the determining unit is further configured to: taking the subscript index of each sequenced feature as an abscissa, taking the evaluation index of each feature as an ordinate to determine the position of each feature in a coordinate system, and calculating the slope between each feature and the previous feature as the slope corresponding to each feature; using the subscript index of each sorted characteristic as an independent variable and the corresponding slope as a dependent variable, and performing polynomial function fitting to obtain fitting curves of all characteristic slopes; and analyzing a slope mutation point of the fitting curve, and taking an evaluation index corresponding to the slope mutation point as an evaluation index threshold value.

In some embodiments, the determining unit is further configured to: using the subscript index of each sorted feature as an abscissa, using the evaluation index of each feature as an ordinate to determine the position of each feature in a coordinate system, and calculating the slope between each feature and other features; finding a target feature from the feature sequence, so that the feature sequence is divided into a first interval and a second interval through the target feature, and the ratio of the average slope between the features in the first interval to the average slope between the features in the second interval is the largest; and determining the evaluation index corresponding to the target feature as an evaluation index threshold value.

In some embodiments, the apparatus further comprises a verification unit configured to: and performing significance test on each feature according to the evaluation index.

In some embodiments, the apparatus further comprises an evaluation unit configured to: performing performance evaluation on the trained model to obtain an evaluation result; and if the evaluation result does not meet the target expectation, re-determining the evaluation index threshold.

In a third aspect, an embodiment of the present disclosure provides an electronic device for generating a model, including: one or more processors; a storage device having one or more programs stored thereon which, when executed by one or more processors, cause the one or more processors to implement a method as in any one of the first aspects.

In a fourth aspect, embodiments of the disclosure provide a computer readable medium having a computer program stored thereon, wherein the program when executed by a processor implements a method as in any one of the first aspect.

According to the method and the device for generating the model, the evaluation indexes of the features are calculated, the mutation points of the evaluation indexes are found, feature screening is carried out, the accuracy and the efficiency of the feature screening are extracted, therefore, the training time of the model can be reduced, and the model with high precision can be obtained even if less features are used for training.

Drawings

Other features, objects and advantages of the disclosure will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present disclosure may be applied;

FIG. 2 is a flow diagram of one embodiment of a method for generating a model according to the present disclosure;

3a-3c are schematic diagrams of the mutation points of the evaluation index of the method for generating a model according to the present disclosure;

FIG. 4 is a flow diagram of yet another embodiment of a method for generating a model according to the present disclosure;

FIG. 5 is a schematic diagram of an embodiment of an apparatus for generating models according to the present disclosure;

FIG. 6 is a schematic block diagram of a computer system suitable for use with an electronic device implementing embodiments of the present disclosure.

Detailed Description

The present disclosure is described in further detail below with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that, in the present disclosure, the embodiments and features of the embodiments may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

FIG. 1 illustrates an exemplary system architecture 100 to which the methods and apparatus for generating models of embodiments of the present application may be applied.

As shown in fig. 1, the system architecture 100 may include

terminals

101, 102, a network 103, a database server 104, and a server 105. The network 103 serves as a medium for providing communication links between the

terminals

101, 102, the database server 104 and the server 105. Network 103 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user 110 may use the

terminals

101, 102 to interact with the server 105 over the network 103 to receive or send messages or the like. The

terminals

101, 102 may have various client applications installed thereon, such as a model training application, a shopping application, a payment application, a web browser, an instant messenger, and the like.

Here, the

terminals

101 and 102 may be hardware or software. When the

terminals

101 and 102 are hardware, they may be various electronic devices with display screens, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III), laptop portable computers, desktop computers, and the like. When the

terminals

101 and 102 are software, they can be installed in the electronic devices listed above. It may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.

Database server 104 may be a database server that provides various services. For example, a database server may have a sample set stored therein. The sample set contains a large number of samples. Wherein each sample may include various characteristics. In this way, the user 110 may also select characteristics of the sample from the sample set stored by the database server 104 via the

terminals

101, 102.

The server 105 may also be a server providing various services, such as a background server providing support for various applications displayed on the

terminals

101, 102. The background server may train the initial model using the characteristics of the samples in the sample set sent by the

terminals

101, 102, and may send the training results (e.g., the generated shopper gender identification model) to the

terminals

101, 102. In this way, the user may apply the generated model for gender detection.

Here, the database server 104 and the server 105 may be hardware or software. When they are hardware, they can be implemented as a distributed server cluster composed of a plurality of servers, or as a single server. When they are software, they may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.

It should be noted that the method for generating the model provided in the embodiment of the present application is generally performed by the server 105. Accordingly, the means for generating the model is also typically provided in the server 105.

It is noted that database server 104 may not be provided in system architecture 100, as server 105 may perform the relevant functions of database server 104.

It should be understood that the number of terminals, networks, database servers, and servers in fig. 1 are merely illustrative. There may be any number of terminals, networks, database servers, and servers, as desired for implementation.

With continued reference to FIG. 2, a flow 200 of one embodiment of a method for generating a model according to the present disclosure is shown. The method for generating the model comprises the following steps:

step 201, an original feature set is obtained.

In this embodiment, the executing entity (e.g., the server shown in fig. 1) of the method for generating a model may obtain the original feature set from the database server through a wired connection or a wireless connection. The models with different purposes relate to different original feature sets, for example, a sex identification model of an e-commerce user is taken as an example, sex-related features of the e-commerce user are extracted, marking is carried out on feature records to obtain model training data, and the extracted original feature sets cover the following:

1. user-related features:

v. user attributes

V user level

Age of user

User preference

Value score of users

...

2. User commodity related features:

commodity colour for ordering by user

Sales volume for check-out by the user

Commodity hot sales degree for check-out by user

Weight of commodity ordered by user

Commodity size for check-out by user

Number of commodity clicks for user to place order

...

3. User behavior related cross-features:

the user behavior characteristics such as purchase, click, browse and the like of all commodities, multi-level categories, ordering of brands, purchase, click, browse and the like can reach tens of millions;

marking data is carried out on the data, and the gender of the user is marked. As shown in table 1 below:

TABLE 1

Step 202, calculating an evaluation index of each feature in the feature set, and sorting the features according to the order of the evaluation indexes from large to small to obtain a feature sequence.

In the present embodiment, the evaluation index may be a chi-squared value, an information entropy, or the like. Reference may be made to the evaluation index used in the filter method in the prior art and the calculation method thereof. And then, sorting the features according to the sequence of the evaluation indexes from large to small, wherein the subscript indexes of the obtained feature sequences are sorted from small to large, namely, the subscript indexes of the features with larger evaluation indexes are smaller.

The chi-squared values calculated from the data in table 1 are shown in table 2, and the ranking results are shown in table 3.

TABLE 2

TABLE 3

The chi-square calculation is used for screening all the developed goods, categories and brand features, because the feature quantity of the part is huge and the useless features are too much, the calculation method is as follows:

in the table below, the training set data corresponds to the user name, the plurality of merchandise characteristics, and the gender identification (1: male, 0: female), respectively:

user' s	Shaving apparatus	Notebook computer	Lipstick	Sex
					Sheet m
	2	1	0	1
					Wang h	1	0	0	1
Lay y	0	1	4	0
					...	...	...	...	...

TABLE 4

Firstly, calculating the number of users corresponding to each characteristic field and gender:

TABLE 5

And then, calculating according to a chi-square calculation formula (wherein A, B, C, D and N are corresponding letters in the table):

the chi-squared value for each commodity characteristic is then obtained.

Other evaluation index calculation methods:

for example, in the information gain method, the information gain represents the mutual information of classes and features in the training set, and is slightly different from the chi-square calculation process, the calculation is directly performed on the two-dimensional table of the training set, and the calculation process is as follows:

(1) inputting a training data set

(2) Calculating the empirical entropy of the data set D:

(where C is the corresponding class)

(3) Computing the empirical entropy conditional entropy H (D | A) of feature A on dataset D

(where k is the number of categories and n is the number of data set records)

(4) Calculating information gain: g (D, a) ═ H (D) — H (D | a)

(5) Output information gain g (D, A) of feature A to training data set D

And step 203, determining a mutation point of the evaluation index from the characteristic sequence as an evaluation index threshold value.

In this embodiment, the evaluation index may be directly used as a dependent variable, the index of each sorted feature may be used as an independent variable, polynomial function fitting may be performed to obtain a curve of the evaluation index, and then the mutation point of the evaluation index may be solved by a mathematical method. Or converting the mutation point for solving the evaluation index into the mutation point for solving the slope. Because the curve begins to ramp down, the speed ramp down is suddenly slowed after a certain point and gradually flattens. The turning point is the slope changing point.

In some optional implementations of this embodiment, determining a mutation point of the evaluation index from the feature sequence as an evaluation index threshold includes: using the subscript index of each sorted feature as an abscissa, using the evaluation index of each feature as an ordinate to determine the position of each feature in a coordinate system, and calculating the slope between each feature and other features; finding a target feature from the feature sequence such that the feature sequence is divided into a first interval and a second interval by the target feature, wherein a ratio of an average slope between features in the first interval to an average slope between features in the second interval is the largest; and determining the evaluation index corresponding to the target feature as an evaluation index threshold value.

And (3) representing the descending trend of the category discrimination by the slope of the two-point line segment:

denotes [ i, i + n ] of the interval]The parameter describes the general variation trend of the category distinction in a certain interval.

The slope change point can be calculated by the following formula:

represents the slope from the 1 st point to the i th point, i.e., the average rate of change with ID {1, 2.., i };

represents the slope from the i-th point to the i + 1-th point, i.e., the average slope with ID { i, i +1 }.

The local mutation points are solved by the method.

In some optional implementations of this embodiment, determining a mutation point of the evaluation index from the feature sequence as an evaluation index threshold includes: taking the subscript index of each sequenced feature as an abscissa, taking the evaluation index of each feature as an ordinate to determine the position of each feature in a coordinate system, and calculating the slope between each feature and the previous feature as the slope corresponding to each feature; using the subscript index of each sorted characteristic as an independent variable and the corresponding slope as a dependent variable, and performing polynomial function fitting to obtain fitting curves of all characteristic slopes; and analyzing a slope mutation point of the fitting curve, and taking an evaluation index corresponding to the slope mutation point as an evaluation index threshold value.

First, the slope is calculated. And after descending, calculating the slope of a line segment formed by each feature and the previous feature in the coordinate graph as the slope corresponding to the current feature.

The calculation formula is as follows:

wherein x is_iIndex of subscript, y, representing the current feature_iEvaluation index, x, representing the current feature_i-1Index of subscript, y, representing the preceding feature_i-1An evaluation index representing the previous feature. For the first feature, since there is no previous feature, the slope corresponding to the first feature is an excessively large value.

The calculated slopes are shown in table 6:

TABLE 6

And then, taking the index of the subscript of each sequenced feature as an independent variable and the corresponding slope as a dependent variable, and performing polynomial function fitting to obtain all feature slope fitting functions F (x). The fitting may be performed by using an optimization algorithm such as stochastic gradient descent to obtain a polynomial function.

By applying the global mutation point analysis method provided by the invention, the chi-square threshold value is calculated, and the specific formula is as follows:

wherein, max: and (3) representing a subscript index corresponding to the first feature after all the features are sorted according to the evaluation index, min: and (4) representing all the characteristics according to the subscript index corresponding to the last characteristic after the characteristics are sorted according to the evaluation index, and t represents the mutation point to be solved.

[ equation derivation procedure ]:

fitting a curve to obtain an approximate function representation of the function, assuming F (x);

finding the total mutation point, and converting the problem into the average slope from a certain point to a first point and from the point to a last point in the function, wherein the former is expressed as: afore _ fun, which is expressed as: after _ fun. Because of the continuous function, the average slope in a certain range can be converted into an integral calculation method:

final target ═ afere _ fun/after _ fun (7)

Thus, the target calculation formula is obtained:

wherein, T represents the characteristic ID and the evaluation index value corresponding to the final global mutation point, max: and (3) representing a subscript index corresponding to the first feature after all the features are sorted according to the evaluation index, min: the index of the index corresponding to the last characteristic after all the characteristics are sorted according to the evaluation index is shown, t represents the mutation point to be solved, and F (x) represents the fitting function of the evaluation index.

The data distribution is shown in FIG. 3a, and the fit function plot is shown in FIG. 3b, corresponding to Table 6. The abscissa is an index of the feature, and the

abscissas

1, 2 and 3 represent the names item _ ord2, item _ ord3 and item _ ord4 of the feature in turn. The ordinate represents the slope.

The fitting function is:

F(x)＝0.67x^5+0.82x^4-0.7x^3+0.2x^2-0.004x-0.0046967953

finally, threshold calculation is performed:

subscript 7 corresponds to a chi-squared value of 4.

If other evaluation criteria are possible, the results are shown in FIG. 3 c.

The method can obtain the global catastrophe point of the curve on the premise of high calculation efficiency and small occupied resource.

And 204, screening out the features of which the evaluation indexes are larger than the evaluation index threshold value from the feature sequence as features for model training.

In this embodiment, the features extracted from the chi-squared sorted list according to the chi-squared threshold obtained in the above step are shown in table 5:

TABLE 7

Step 205, obtaining a training sample set according to the features for model training.

In this embodiment, for example, the selected features include user attributes, user levels, characteristics related to the commodity attributes ordered by the user, and characteristics related to ordering, buying, clicking, and browsing by the user. Each sample in the constructed sample set includes the features selected in step 204. And category labels for these features are also included in the sample (e.g., 0 for male and 1 for female in a gender prediction model).

And step 206, performing model training by using the training sample set to obtain a trained model.

In this embodiment, the features of the training samples are used as input, the class labels are used as expected output, and the neural network model is trained, and the model can be classified according to the input features. For the gender prediction model, inputting the characteristics of the user (such as user-related characteristics, user commodity-related characteristics, user behavior-related cross characteristics and the like), obtaining a prediction result as the gender of the user, comparing the predicted gender with expected output, calculating a loss value, adjusting the network parameters of the model if the loss value is greater than a preset threshold value, and continuing training. The training process is prior art and will not be described herein.

The method provided by the above embodiment of the present disclosure provides a global mutation point statistical analysis method. And (5) finding a qualified threshold value by analyzing the global mutation point to carry out feature screening. The problem of poor algorithm stability caused by the fact that the threshold value of the characteristic evaluation index in the Filter algorithm can only be set in a qualitative mode in the prior art is solved.

In some optional implementations of this embodiment, the method further includes: each feature is tested for significance (e.g., p-value, etc.) based on the evaluation index.

With further reference to FIG. 4, a flow 400 of yet another embodiment of a method for generating a model is shown. The process 400 of the method for generating a model includes the steps of:

step 401, an original feature set is obtained.

And 402, calculating the evaluation index of each feature in the feature set, and sequencing the features according to the evaluation indexes from large to small to obtain a feature sequence.

And step 403, determining a mutation point of the evaluation index from the characteristic sequence as an evaluation index threshold.

And step 404, screening out the features of which the evaluation indexes are larger than the evaluation index threshold value from the feature sequence as features for model training.

The

steps

401 and 404 are substantially the same as the

steps

201 and 204, and therefore, the description thereof is omitted.

Step 405, a training sample set is obtained according to the selected features.

In this embodiment, the selected features are the features obtained in step 404 for model training. For example, the selected characteristics include user attributes, user levels, commodity attribute related characteristics of the user ordering, and user ordering, shopping, clicking and browsing. Each sample in the constructed sample set includes the features selected in step 404. And category labels for these features are also included in the sample (e.g., 0 for male and 1 for female in a gender prediction model).

And 406, performing model training by using the training sample set to obtain a trained model.

In this embodiment, the features of the training samples are used as input, the class labels are used as expected output, and the neural network model is trained, and the model can be classified according to the input features. And inputting the characteristics of the user into the gender prediction model, obtaining the prediction result as the gender of the user, comparing the predicted gender with the expected output, calculating a loss value, adjusting the network parameters of the model if the loss value is greater than a preset threshold value, and continuing training. The training process is prior art and will not be described herein.

And 407, performing performance evaluation on the trained model to obtain an evaluation result.

In this embodiment, the performance of the model can be evaluated by a common technical means in the prior art, for example, the model is verified through a verification set, so as to obtain evaluation results such as recall rate and accuracy.

In step 408, if the evaluation result does not meet the target expectation, the evaluation index threshold is determined again.

In the present embodiment, the target expectation may include that various indexes all reach a predetermined threshold, for example, the recall rate reaches more than 80%, the accuracy rate reaches more than 80%, and the like. If the evaluation result does not meet the target expectation, the performance of the model is not met, possibly due to misselection of the evaluation index threshold. Step 403 may be executed again to obtain a new evaluation index threshold, and then the evaluation index is selected again, and the model is retrained by selecting the sample again according to the new evaluation index. Step 403-.

As can be seen from fig. 4, compared to the embodiment corresponding to fig. 2, the flow 400 of the method for generating a model in the present embodiment represents the step of evaluating the model. Therefore, the scheme described in the embodiment can re-screen the features according to the evaluation result, so that the accuracy of the model is improved.

With further reference to fig. 5, as an implementation of the methods shown in the above figures, the present disclosure provides an embodiment of an apparatus for generating a model, which corresponds to the method embodiment shown in fig. 2, and which is particularly applicable to various electronic devices.

As shown in fig. 5, the apparatus 500 for generating a model of the present embodiment includes: acquisition section 501, calculation section 502, determination section 503, screening section 504, and training section 505. The acquiring unit 501 is configured to acquire an original feature set; a calculating unit 502 configured to calculate an evaluation index of each feature in the feature set, and sort the features according to the order of the evaluation indexes from large to small to obtain a feature sequence; a determining unit 503 configured to determine a mutation point of the evaluation index from the feature sequence as an evaluation index threshold; a screening unit 504 configured to screen out, from the feature sequence, features with evaluation indexes larger than the evaluation index threshold value as features for model training; a training unit 505 configured to: acquiring a training sample set according to the characteristics used for model training; and carrying out model training by using the training sample set to obtain a trained model.

In this embodiment, the specific processing of the obtaining unit 501, the calculating unit 502, the determining unit 503, the screening unit 504, and the training unit 505 of the apparatus 500 for generating a model may refer to step 201, step 202, step 203, step 204, step 205, and step 206 in the corresponding embodiment of fig. 2.

In some optional implementations of this embodiment, the determining unit 503 is further configured to: taking the subscript index of each sequenced feature as an abscissa, taking the evaluation index of each feature as an ordinate to determine the position of each feature in a coordinate system, and calculating the slope between each feature and the previous feature as the slope corresponding to each feature; using the subscript index of each sorted characteristic as an independent variable and the corresponding slope as a dependent variable, and performing polynomial function fitting to obtain fitting curves of all characteristic slopes; and analyzing a slope mutation point of the fitting curve, and taking an evaluation index corresponding to the slope mutation point as an evaluation index threshold value.

In some optional implementations of this embodiment, the determining unit 503 is further configured to: using the subscript index of each sorted feature as an abscissa, using the evaluation index of each feature as an ordinate to determine the position of each feature in a coordinate system, and calculating the slope between each feature and other features; finding a target feature from the feature sequence, so that the feature sequence is divided into a first interval and a second interval through the target feature, and the ratio of the average slope between the features in the first interval to the average slope between the features in the second interval is the largest; and determining the evaluation index corresponding to the target feature as an evaluation index threshold value.

In some optional implementations of this embodiment, the apparatus further comprises a verification unit (not shown in the drawings) configured to: and performing significance test on each feature according to the evaluation index.

In some optional implementations of this embodiment, the model is a gender prediction model, the input of the model is the characteristics of the user, and the output is the gender of the user.

In some optional implementations of this embodiment, the apparatus further comprises an evaluation unit (not shown in the drawings) configured to: performing performance evaluation on the trained model to obtain an evaluation result; and if the evaluation result does not meet the target expectation, re-determining the evaluation index threshold.

Referring now to FIG. 6, a schematic diagram of an electronic device (e.g., the server of FIG. 1) 600 suitable for use in implementing embodiments of the present disclosure is shown. The server shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 6, electronic device 600 may include a processing means (e.g., central processing unit, graphics processor, etc.) 601 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage means 608 into a Random Access Memory (RAM) 603. In the RAM603, various programs and data necessary for the operation of the electronic apparatus 600 are also stored. The processing device 601, the ROM 602, and the RAM603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

Generally, the following devices may be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 607 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 608 including, for example, tape, hard disk, etc.; and a communication device 609. The communication means 609 may allow the electronic device 600 to communicate with other devices wirelessly or by wire to exchange data. While fig. 6 illustrates an electronic device 600 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided. Each block shown in fig. 6 may represent one device or may represent multiple devices as desired.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 609, or may be installed from the storage means 608, or may be installed from the ROM 602. The computer program, when executed by the processing device 601, performs the above-described functions defined in the methods of embodiments of the present disclosure. It should be noted that the computer readable medium described in the embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In embodiments of the disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In embodiments of the present disclosure, however, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring an original characteristic set; calculating the evaluation index of each feature in the feature set, and sequencing the features according to the evaluation indexes from large to small to obtain a feature sequence; determining a mutation point of an evaluation index from the characteristic sequence as an evaluation index threshold; screening out the characteristics of which the evaluation indexes are larger than the evaluation index threshold value from the characteristic sequence as the characteristics for model training; obtaining a training sample set according to the characteristics for model training; and carrying out model training by using the training sample set to obtain a trained model.

Computer program code for carrying out operations for embodiments of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present disclosure may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes an acquisition unit, a calculation unit, a determination unit, a screening unit, and a training unit. Where the names of these units do not in some cases constitute a limitation of the unit itself, for example, an acquisition unit may also be described as a "unit that acquires an original set of features".

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is possible without departing from the inventive concept. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Claims

1. A method for generating a model, comprising:

acquiring an original characteristic set;

calculating the evaluation index of each feature in the feature set, and sequencing the features according to the order of the evaluation indexes from large to small to obtain a feature sequence;

determining a mutation point of an evaluation index from the characteristic sequence as an evaluation index threshold value;

screening out the features of which the evaluation indexes are larger than the evaluation index threshold value from the feature sequence as features for model training;

obtaining a training sample set according to the characteristics for model training;

and carrying out model training by using the training sample set to obtain a trained model.

2. The method of claim 1, wherein the determining an evaluation index mutation point from the signature sequence as an evaluation index threshold comprises:

taking the subscript index of each sequenced feature as an abscissa, taking the evaluation index of each feature as an ordinate to determine the position of each feature in a coordinate system, and calculating the slope between each feature and the previous feature as the slope corresponding to each feature;

using the subscript index of each sorted characteristic as an independent variable and the corresponding slope as a dependent variable, and performing polynomial function fitting to obtain fitting curves of all characteristic slopes;

and analyzing a slope mutation point of the fitting curve, and taking an evaluation index corresponding to the slope mutation point as an evaluation index threshold value.

3. The method of claim 1, wherein the determining an evaluation index mutation point from the signature sequence as an evaluation index threshold comprises:

using the subscript index of each sorted feature as an abscissa, using the evaluation index of each feature as an ordinate to determine the position of each feature in a coordinate system, and calculating the slope between each feature and other features;

finding a target feature from the feature sequence such that the feature sequence is divided into a first interval and a second interval by the target feature, wherein a ratio of an average slope between features in the first interval to an average slope between features in the second interval is the largest;

and determining the evaluation index corresponding to the target feature as an evaluation index threshold value.

4. The method of claim 1, wherein the method further comprises:

and performing significance test on each feature according to the evaluation index.

5. The method of claim 1, wherein the method further comprises:

performing performance evaluation on the trained model to obtain an evaluation result;

and if the evaluation result does not meet the target expectation, re-determining the evaluation index threshold value.

6. The method of claim 1, wherein the model is a gender prediction model, the input to the model is a characteristic of the user, and the output is the gender of the user.

7. The method of any of claims 1-6, wherein the evaluation index comprises a chi-squared value or an information entropy.

8. An apparatus for generating a model, comprising:

an acquisition unit configured to acquire an original feature set;

the computing unit is configured to compute an evaluation index of each feature in the feature set, and sort the features according to the evaluation indexes from large to small to obtain a feature sequence;

a determining unit configured to determine a mutation point of an evaluation index from the feature sequence as an evaluation index threshold;

a screening unit configured to screen out, from the feature sequence, features with evaluation indexes larger than the evaluation index threshold value as features for model training;

and the training unit is configured to acquire a training sample set according to the characteristics for model training, and perform model training by using the training sample set to obtain a trained model.

9. An electronic device for generating a model, comprising:

one or more processors;

a storage device having one or more programs stored thereon,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.

10. A computer-readable medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the method of any one of claims 1-7.