CN111459922A

CN111459922A - User identification method, device, equipment and storage medium

Info

Publication number: CN111459922A
Application number: CN202010097654.6A
Authority: CN
Inventors: 余雯; 黄承伟
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-02-17
Filing date: 2020-02-17
Publication date: 2020-07-28
Also published as: WO2021164232A1

Abstract

The application relates to the field of data analysis, specifically uses a user classification model to identify a target user, and discloses a user identification method, a device, equipment and a storage medium, wherein the method comprises the following steps: acquiring offline image data and online data of a user to be identified; performing image processing on the offline image data to obtain offline data of the user to be identified; taking the offline data and the online data as user data of the user to be identified, and performing data preprocessing on the user data to obtain characteristic data of the user to be identified; inputting the characteristic data of the user to be identified into a pre-trained user classification model to obtain the classification probability of the user to be identified; and if the classification probability of the user to be identified is greater than the user classification threshold, determining that the user to be identified is the target user. The accuracy of offline data entry and the accuracy of target user identification based on user characteristics are improved.

Description

User identification method, device, equipment and storage medium

Technical Field

The present application relates to the field of data analysis, and in particular, to a user identification method, apparatus, device, and storage medium.

Background

At present, when identifying a target user for a user, most of existing methods perform feature analysis for the user based on a business rule, so as to identify the target user based on the analyzed user features. However, the method is influenced by personal subjectivity, data omission is easily caused, and all features cannot be completely analyzed, so that the identified user data features are incomplete. Moreover, the establishment of the business rules is based on manual observation and data comparison, so that the user data characteristics cannot be completely and accurately distinguished, and the identification accuracy is low when the target user is identified according to the rules.

Therefore, how to identify the target user based on the characteristics of the user to improve the accuracy of identifying the target user becomes an urgent problem to be solved.

Disclosure of Invention

The application provides a user identification method, a user identification device, user identification equipment and a storage medium, so as to improve the accuracy of target user identification.

In a first aspect, the present application provides a user identification method, including:

acquiring offline image data and online data of a user to be identified;

performing image processing on the offline image data to obtain offline data of the user to be identified;

taking the offline data and the online data as user data of the user to be identified, and performing data preprocessing on the user data to obtain characteristic data of the user to be identified, wherein the preprocessing comprises characteristic factor quantization, abnormal value processing and data cleaning;

inputting the characteristic data of the user to be identified into a pre-trained user classification model to obtain the classification probability of the user to be identified;

and if the classification probability of the user to be identified is greater than the user classification threshold, determining that the user to be identified is the target user.

In a second aspect, the present application further provides a user identification device, including:

the data acquisition module is used for acquiring offline image data and online data of a user to be identified;

the image processing module is used for carrying out image processing on the offline image data to obtain the offline data of the user to be identified;

the characteristic data module is used for taking the offline data and the online data as user data of the user to be identified, and carrying out data preprocessing on the user data to obtain the characteristic data of the user to be identified, wherein the preprocessing comprises characteristic factor quantization, abnormal value processing and data cleaning;

the classification probability module is used for inputting the characteristic data of the user to be identified into a pre-trained user classification model so as to obtain the classification probability of the user to be identified;

and the user determination module is used for determining the user to be identified as the target user if the classification probability of the user to be identified is greater than a user classification threshold value.

In a third aspect, the present application further provides a computer device comprising a memory and a processor; the memory is used for storing a computer program; the processor is configured to execute the computer program and to implement the user identification method as described above when executing the computer program.

In a fourth aspect, the present application also provides a computer-readable storage medium storing a computer program, which when executed by a processor causes the processor to implement the user identification method as described above.

The application discloses a user identification method, a device, equipment and a storage medium, which are characterized in that offline image data and online data of a user to be identified are obtained and are subjected to image processing, so that the offline data of the user to be identified are obtained, the offline data and the online data are used as user data of the user to be identified, the user data are subjected to data preprocessing, so that characteristic data of the user to be identified are obtained, and then the characteristic data of the user to be identified are input into a pre-trained user classification model, so that the classification probability of the user to be identified is obtained; and if the classification probability of the user to be identified is greater than the user classification threshold, determining the user to be identified as the target user. The offline image data are subjected to image processing, accuracy and entry speed of the offline image data during entry are improved, and the user features are identified and classified by utilizing the pre-trained user classification model, so that the target user is determined based on the identification result of the user features, and the determination accuracy of the target user is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic flow chart of a training method for a user classification model according to an embodiment of the present disclosure;

FIG. 2 is a flow chart illustrating steps of outputting weight values of feature factors using a random forest algorithm;

fig. 3 is a schematic flow chart of a user identification method provided in an embodiment of the present application;

FIG. 4 is a schematic flow diagram of sub-steps of the user identification method provided in FIG. 3;

FIG. 5 is a schematic block diagram of a model training apparatus provided in an embodiment of the present application;

fig. 6 is a schematic block diagram of a user identification device provided by an embodiment of the present application;

fig. 7 is a schematic block diagram of a computer device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The flow diagrams depicted in the figures are merely illustrative and do not necessarily include all of the elements and operations/steps, nor do they necessarily have to be performed in the order depicted. For example, some operations/steps may be decomposed, combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

It is to be understood that the terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

The embodiment of the application provides a user identification method, a user identification device, computer equipment and a storage medium. The user identification method can be used for identifying the user characteristics of the user and classifying the user, so that the target user is determined based on the identification result of the user characteristics, and the identification accuracy of the target user is improved. In the present application, for convenience of description, the target user is taken as a repeat purchasing user as an example for detailed description.

Some embodiments of the present application will be described in detail below with reference to the accompanying drawings. The embodiments described below and the features of the embodiments can be combined with each other without conflict.

Referring to fig. 1, fig. 1 is a schematic flowchart illustrating a method for training a user classification model according to an embodiment of the present disclosure. According to the training method of the user classification model, the sample data is trained, so that the identification accuracy of the model for the target user is improved.

As shown in fig. 1, the training method of the user classification model specifically includes:

s101, obtaining sample data of a sample user and carrying out data preprocessing on the sample data to obtain characteristic data of the sample user.

Specifically, the data preprocessing comprises characteristic factor quantification, abnormal value processing and data cleaning. And the data preprocessing is carried out on the sample data, so that the interference of original error data on the training of the user classification model can be reduced, and the identification accuracy of the user classification model obtained by training is improved. The feature data includes historical purchase records, feature factors, and corresponding feature values for the sample user. And after data preprocessing is carried out on the sample data, the characteristic data of the sample user is obtained.

The data cleaning can comprise data overrun value detection, feature validity detection and data null value detection on the data of the historical purchase records, and the data is cleaned. In a specific implementation process, the null check may be to check and delete a record with a historical purchase record of 0, where a historical purchase record of 0 indicates that the sample user does not meet the characteristics.

The abnormal value processing may be to determine whether the sample data has the situation of non-collected or abnormal data, then perform missing value processing, and select different missing value processing methods according to different business rules. For example, the default value replaces: for the condition of the sum of money/product grade of some historical purchase records, setting default values according to common business rules for calculation; if the missing value is of a non-numeric type, the missing data is filled in with the mode of the attribute.

The characteristic factor quantization may be to perform a numerical representation on each characteristic factor in sample data of the sample user. Specifically, the characteristic factors may include industry, industry subclass, industry major class, compensation level, and the like. Four characteristic factors of industry, industry subclass, industry major class and compensation grade can be respectively represented by 1, 2, 3 and 4.

S102, determining a user identifier corresponding to the sample user according to the historical purchase record of the sample user.

Specifically, the user identifier corresponding to the sample user is determined according to the historical purchase record in the feature data of the sample user. The user identification comprises a target user and a non-target user, wherein the target user is a repeated purchasing user, for example, and the non-target user is a non-repeated purchasing user, for example.

In some embodiments, sample users identified as repeat purchasing users may be treated as positive samples, and sample users identified as non-repeat purchasing users may be treated as negative samples. And determining the user identifier corresponding to the sample user according to the historical purchase record of the sample user, wherein the determination can be made according to the product purchase condition of the sample user at a certain period in the past. For example, before 12 months in 2018, a user bought a product one or more times, and then checked whether the user purchased again in 1-3 months in 2019, if the user purchased again, the user of the user is determined to be a repeat purchasing user as a positive sample, and if the user did not purchase again, the user of the user is determined to be a non-repeat purchasing user as a negative sample.

S103, training a classification model according to the user identification, the characteristic factors and the characteristic values corresponding to the characteristic factors of the sample user, and taking the classification model obtained through training as a pre-trained user classification model.

Specifically, a classification algorithm is adopted to train a classification model according to the user identification of the sample user, the characteristic factors and the characteristic values corresponding to the characteristic factors, and the classification algorithm can be a logistic regression algorithm, an XGboost algorithm, a Gaussian naive Bayes classification algorithm, a random forest algorithm, a GBDT algorithm and the like.

In another embodiment, after step S102, the method may further include:

and S104, classifying the characteristic factors, the corresponding characteristic values and the corresponding user identifications of the sample users by adopting a random forest algorithm so as to output the weight values of the characteristic factors.

Specifically, a random forest algorithm is used for outputting the weight value of each characteristic factor, and the importance of each characteristic factor can be calculated by using the data outside the bag.

In some embodiments, referring to fig. 2, outputting the weight values of the feature factors by using the random forest algorithm specifically includes:

s1041, segmenting the characteristic factors, the corresponding characteristic values and the corresponding user identifications of the sample users by adopting a K-fold cross segmentation method to obtain a plurality of non-intersected training subsets.

Specifically, the K-fold cross-slicing method is a stratfiedkold method, and by means of hierarchical sampling, the proportion of sample users identified by each user in each training subset can be ensured to be the same as that in the original data set. The original data set refers to a set of feature data of all sample users.

S1042, performing multiple random back sampling on the training subsets by adopting a repeated sampling method to obtain multiple target training subsets, and taking the feature factors, the corresponding feature values and the corresponding user identifications of the sample users which are not extracted as the data outside the bag.

Specifically, the sample data in the training subset is randomly sampled by using a repeated sampling method, the sample data is put back after each sampling and then sampled for the next time, the sampled sample data is used as a target training subset, the data size of the target training subset is the same as that of the sample data in the training subset, elements of different target training subsets can be repeated, and elements of the same target training subset can also be repeated. And the characteristic data of the sample user which is not drawn each time is taken as the data outside the bag.

For example, the sample data in the training subset is N, and the sample data is sampled N times with a playback, so that N sample data are obtained as a target training subset.

In some embodiments, after step S1042, the method may further include: and carrying out sampling balance processing on the data in the target training subset to obtain a processed target training subset.

The sampling balancing processing may refer to constructing a new sample for the data in the target training subset by using a SMOTE algorithm. The SMOTE algorithm analyzes a few types of samples and artificially synthesizes new samples according to the few types of samples to be added into a data set, and the algorithm flow can be as follows:

for each sample x in the minority class, calculating a sample set S from the sample x to the minority class by using Euclidean distance as a standard_minThe K neighbors of the distance between all samples are obtained.

Setting a sampling ratio according to the sample imbalance ratio to determine a sampling multiplying factor N, and randomly selecting a plurality of samples from K neighbors of each sample x of a minority class on the assumption that the selected neighbor x is adjacent to the selected sample x_n。

For each randomly selected neighbor x_nAnd respectively constructing new samples according to the following formulas with the original samples:

x_new＝x+rand(0,1)*|x-x_n|

and S1043, based on the characteristic factors and the target training subsets, obtaining a random forest model by adopting a random forest algorithm.

Specifically, for each target training subset, a random forest algorithm is adopted to generate a decision tree based on the characteristic factors, a random forest model is formed according to the generated decision trees, the final prediction result of the characteristic factors is determined according to the mean value of the prediction values of the decision trees, and the prediction result is used as the output result of the random forest model.

And S1044, calculating the weight value of the characteristic factor by utilizing the data outside the bag based on the output result of the random forest model.

According to the output result of the random forest model, the weight value of the characteristic factor is calculated by using corresponding data outside the bag to calculate the error of the data outside the bag, which is recorded as errOOB1, randomly adding noise interference to the characteristic factor X of all samples of the data outside the bag, calculating the error of the data outside the bag again, which is recorded as errOOB2, and assuming that K trees exist in the random forest, the importance of the characteristic factor X is ∑ (errOOB2-errOOB 1)/K.

The reason why this expression can be used as the measure of importance of the corresponding feature factor is that if the accuracy outside the bag is greatly reduced after noise is randomly added to a certain feature factor, it means that this feature factor has a great influence on the classification result of the sample, that is, its importance is higher.

S105, when the weight value of the characteristic factor is larger than a factor classification threshold value, determining the characteristic factor as a target characteristic factor.

Specifically, the factor classification threshold may be freely set by the user. And sorting the plurality of characteristic factors from high to low according to the weight values of the plurality of characteristic factors, and taking the characteristic factor with the weight value larger than a factor classification threshold value as a target characteristic factor, wherein the target characteristic factor is used for identifying a target user.

S106, screening the characteristic data of the sample user according to the target characteristic factor to obtain a characteristic value corresponding to the sample user and the target characteristic factor.

Specifically, after the target characteristic factor is determined, the characteristic data of the sample user is screened according to the target characteristic factor, so that a characteristic value corresponding to the target characteristic factor of the sample user is obtained. And screening the characteristic data of the sample user according to the target characteristic factor, reducing the interference of irrelevant characteristic factors and characteristic values when training the user classification model, and reducing the computation load during training.

S107, training a classification model according to the user identification of the sample user, the target characteristic factor and the characteristic value corresponding to the target characteristic factor, and taking the classification model obtained through training as a pre-trained user classification model.

Specifically, the user classification model may be obtained by training using a logistic regression algorithm, an XGBoost algorithm, a gaussian naive bayes classification algorithm, a random forest algorithm, and a GBDT algorithm.

In the training method for the user classification model provided in the above embodiment, data preprocessing is performed on the obtained sample data, the dirty data in the sample data is removed, and data filling is performed on the dirty data, so that the quality of the sample data is improved, and the interference of the original error data on the trained user classification model is reduced. And outputting the weight value of the characteristic factor by adopting a random forest algorithm, and when the weight value is greater than a factor classification threshold value, taking the characteristic factor as a target characteristic factor and screening the characteristic data of the sample user based on the target characteristic factor, so that the training of the user classification model is carried out according to the sample data obtained by screening, the interference of irrelevant characteristic factors and characteristic values during the training of the user classification model is reduced, and the operation amount during the training is reduced.

Referring to fig. 3, fig. 3 is a schematic flowchart of a user identification method according to an embodiment of the present application. The user identification method can be used for carrying out image processing on the offline image data of the user to be identified, and identifying the user by utilizing a pre-trained user classification model according to the user data of the user to be identified, which is obtained after processing, so as to identify the target user from the user.

As shown in fig. 3, the user identification method specifically includes: step S201 to step S205.

S201, acquiring offline image data and online data of a user to be identified.

Specifically, the offline image data of the user to be identified may include a basic information form image filled offline by the user to be identified, a visit record image of the user to be identified by a service person, an offline activity check-in form image of the user to be identified, and the like. In some embodiments, the basic information form, the call record and the offline activity check-in form of the user to be identified can be scanned or photographed to obtain the offline image data. The online data includes online purchase data, user portrait data, etc. of the user to be identified.

S202, performing image processing on the offline image data to obtain offline data of the user to be identified.

Specifically, the image processing of the offline image data refers to identifying from the image data of the user to be identified to obtain the offline data of the user to be identified.

In some embodiments, referring to fig. 4, step S202 specifically includes:

s2021, preprocessing the offline image data.

Specifically, the preprocessing includes binarization, noise removal, and tilt correction. And respectively preprocessing the basic information form image filled in the subscriber line to be identified, the visit record image of the subscriber to be identified and the offline activity check-in form image of the subscriber to be identified by the service personnel, so as to obtain the preprocessed basic information form image filled in the subscriber line to be identified, the preprocessed visit record image and the preprocessed offline activity check-in form image.

The binarization is to process the color image, so that the image only includes foreground information and background information, the foreground information is defined as black, and the background information is white. Since the information amount of the color image is large, in order to improve the accuracy and the recognition speed of character recognition, the color image may be first subjected to binarization processing.

S2022, performing layout analysis and character recognition on the preprocessed offline image data to obtain a recognition result.

Specifically, the layout analysis refers to segmentation and line division of the preprocessed offline image. The character recognition means that feature extraction is performed on the lower image data, so that character data in the lower image data are extracted, and the extracted character data are used as a recognition result.

And respectively carrying out layout analysis and character recognition on the preprocessed basic information form image, the preprocessed visit record image and the preprocessed offline activity check-in form image filled in the offline of the subscriber to be recognized so as to respectively obtain corresponding recognition results of the basic information form image, the preprocessed visit record image and the preprocessed offline activity check-in form image. The identification result of the basic information form image comprises the basic information such as the name, the sex, the age and the like of the user, and the identification result of the visit record image comprises the information such as the evaluation of the user on the currently purchased product, whether the user intends to continuously purchase the product and the like.

And S2023, determining offline data according to the recognition result.

Specifically, the recognition result is subjected to layout restoration and post-processing so that the recognition result can be arranged in a format in the offline image data, and the recognition result is corrected in accordance with the relationship of a specific language context, thereby determining the offline data.

And respectively determining form data corresponding to the basic information form image filled by the user to be identified according to the identification results corresponding to the basic information form image, the visit record image and the offline activity check-in form image filled by the user to be identified, visiting data corresponding to the visit record image and check-in data corresponding to the offline activity check-in form image, and taking the form data, the visiting data and the check-in data as offline data together.

S203, taking the offline data and the online data as user data of the user to be identified, and performing data preprocessing on the user data to obtain the characteristic data of the user to be identified.

Specifically, taking the offline data and the online data as the user data of the user to be identified means that the offline data identified through image identification is entered, so that the offline data and the online data are taken as the user data of the user to be identified together. The speed and efficiency of offline data entry are improved, and the accuracy of offline data entry is also improved.

Wherein the preprocessing comprises characteristic factor quantification, abnormal value processing and data cleaning. The process of data cleansing requires processing in conjunction with the characteristic factors of each user. Zero value filling can be carried out on some interaction behavior characteristic factors under the condition that the interaction behavior characteristic factors cannot be collected or are lacked; for user revenue, assets can be populated with averages; null values may be left unknown for gender, school calendar, occupation, etc.; for data such as the time of the last purchase, a larger value should be filled to indicate that no purchase has been made recently.

The characteristic data of the user to be identified comprises a historical purchase record, a characteristic factor and a corresponding characteristic value of the user to be identified. The historical purchase record is a record of a user purchasing a company product, and comprises time of purchasing the product and detailed information of the product, such as personal insurance products, the name, purchase mode, the number of purchased products, and the insurance of the user.

In some embodiments, entering the offline data specifically includes:

associating the offline data of the same user to be identified according to the form data, the visit data and the sign-in data of the user to be identified; associating the table data of the same user to be identified with the online data according to the table data of the user to be identified and the online data; and recording the visit data and the check-in data of the user to be identified based on the incidence relation between the form data and the online data of the user to be identified.

Specifically, offline data of the same user to be identified is correlated according to form data, visit data and name, gender and other information in check-in data of the user to be identified, online data matched with the user to be identified is retrieved according to basic information such as name, gender, age and the like in the form data of the user to be identified, a correlation relationship between the form data and the online data of the user to be identified is established, and finally, the visit data and the check-in data correlated with the form data are recorded based on the correlation relationship between the form data and the online data of the user to be identified, so that the form data, the visit data and the check-in data and the online data are jointly used as user data of the user to be identified.

The online data of the user to be identified is called through the form data, the accuracy and the matching degree of data entry are improved, the offline data are entered based on the incidence relation among the form data, the visit data and the check-in data and the incidence relation among the form data and the online data, and the efficiency of data entry is improved.

S204, inputting the characteristic data of the user to be recognized into a pre-trained user classification model to obtain the classification probability of the user to be recognized.

Specifically, the feature data of the user to be recognized is input into a pre-trained user classification model, the user to be recognized is predicted by the user classification model, and the classification probability of the user to be recognized is output. The classification probability is the probability that the user to be identified is a target user or the probability that the user to be identified is a non-target user.

S205, if the classification probability of the user to be identified is greater than a user classification threshold, determining that the user to be identified is a target user.

Specifically, if the classification probability of the user to be identified, which is output by the user classification model, is greater than the user classification threshold, it is determined that the user to be identified is the target user. The user classification threshold may be preset by a service person, or may be selected by classifying and predicting sample data. It should be noted that the user classification threshold may be adjusted according to the application scenario and the service requirement.

In some embodiments, when the user classification threshold is selected after performing classification prediction on the sample data, specifically, the method may include:

obtaining the classification probability of the pre-trained user classification model for the sample user; respectively calculating a plurality of confusion matrixes corresponding to the test threshold values based on a plurality of test threshold values; calculating a plurality of kolmogorov-schlmogorov test values from a plurality of said confusion matrices; and taking a test threshold corresponding to the maximum value in the plurality of Kolmogorov-Scomruff test values as a user classification threshold.

Specifically, the classification probability of the pre-trained user classification model for a sample user is obtained, a plurality of test thresholds are set, when the classification probability of a certain sample user output by the user classification model is greater than the test thresholds, the sample user is defined as a target user, otherwise, the sample user is defined as a non-target user, and accordingly, confusion matrices corresponding to the plurality of test thresholds are respectively calculated. The confusion matrix is specifically shown in table 1:

TABLE 1

Wherein, True Positive (TP) represents that the True value is Positive, and the model is also regarded as the number of Positive; false Positive (FP) indicates that the true value is negative and the model considers the number of Positive; FalseNegicive (FN) indicates that the true value is positive and the model considers the number of negative; true Negative (TN) indicates that the True value is Negative, and the model is also considered as the number of Negative.

And respectively calculating Kolmogorov-Sporov test values corresponding to a plurality of test thresholds, namely KS values according to the obtained confusion matrix. The calculation formula is as follows:

KS＝max(TPR-FPR)

wherein TPR represents the real interest rate and FPR represents the anti-positive interest rate.

Since the KS value reflects the optimum discrimination effect of the model, the larger the KS value is, the better the prediction effect at that time is. Therefore, for the KS values calculated by the multiple test thresholds, the test threshold corresponding to the KS maximum value can be selected from the KS values, and the test threshold is used as the user classification threshold.

In the user identification method provided in the above embodiment, offline image data and online data of a user to be identified are obtained, then offline image data is subjected to data processing, offline data of the user to be identified is obtained, then the offline data and the online data are used as user data of the user to be identified, data preprocessing is performed on the user data, so as to obtain feature data of the user to be identified, the feature data of the user to be identified is input into a pre-trained user classification model, so as to obtain the classification probability of the user to be identified, and if the classification probability of the user to be identified is greater than a user classification threshold, the user to be identified is determined as a target user. The offline data and the online data are used as user data of the user to be identified, and the offline data entry speed and accuracy are improved. And then inputting the characteristic data of the user to be identified into a pre-trained user classification model, and judging whether the user to be identified is a target user according to the classification probability output by the user classification model, so that the identification accuracy of the target user is improved.

Referring to fig. 5, fig. 5 is a schematic block diagram of a model training apparatus according to an embodiment of the present application, which may be configured in a server for executing the aforementioned training method of a user classification model.

As shown in fig. 5, the model training apparatus 300 includes: a data processing module 301, a user identification module 302, a model training module 303, a user classification module 304, a target features module 305, a data screening module 306, and a target model training module 307.

The data processing module 301 is configured to obtain sample data of a sample user and perform data preprocessing on the sample data to obtain feature data of the sample user, where the feature data includes a historical purchase record, a feature factor, and a corresponding feature value of the sample user.

A user identifier module 302, configured to determine a user identifier corresponding to the sample user according to the historical purchase record of the sample user, where the user identifier includes a target user and a non-target user.

And the model training module 303 is configured to train a classification model according to the user identifier of the sample user, the feature factor, and the feature value corresponding to the feature factor, and use the classification model obtained through training as a pre-trained user classification model.

And the user classification module 304 is configured to classify the feature factors, the corresponding feature values, and the corresponding user identifiers of the sample users by using a random forest algorithm, so as to output weight values of the feature factors.

The user classification module 304 includes a subset division submodule 3041, a target subset submodule 3042, a random forest model submodule 3043, and a weight calculation submodule 3044.

Specifically, the subset partitioning submodule 3041 is configured to partition the feature factor, the corresponding feature value, and the corresponding user identifier of the sample user by using a K-fold cross-segmentation method, so as to obtain a plurality of disjoint training subsets.

The target subset submodule 3042 is configured to perform multiple random repeat sampling on the multiple training subsets by using a repeated sampling method to obtain multiple target training subsets, and use the feature factors and corresponding feature values of the sample users that are not extracted and the corresponding user identifiers as data outside the bag.

And a random forest model submodule 3043, configured to obtain a random forest model by using a random forest algorithm based on the feature factors and the plurality of target training subsets.

A weight calculating submodule 3044, configured to calculate, based on an output result of the random forest model, a weight value of the feature factor by using the data outside the bag.

A target feature module 305, configured to determine the feature factor as a target feature factor when the weight value of the feature factor is greater than a factor classification threshold.

And a data screening module 306, configured to screen feature data of the sample user according to the target feature factor, so as to obtain a feature value corresponding to the target feature factor for the sample user.

And the target model training module 307 is configured to train a classification model according to the user identifier of the sample user, the target feature factor, and the feature value corresponding to the target feature factor, and use the classification model obtained through training as a pre-trained user classification model.

Referring to fig. 6, fig. 6 is a schematic block diagram of a user identification device according to an embodiment of the present application, where the user identification device is configured to perform the user identification method. Wherein, the user identification device can be configured in a server or a terminal.

The server may be an independent server or a server cluster. The terminal can be an electronic device such as a mobile phone, a tablet computer, a notebook computer, a desktop computer, a personal digital assistant and a wearable device.

As shown in fig. 6, the user identification apparatus 400 includes: a data acquisition module 401, an image processing module 402, a feature data module 403, a classification probability module 404, and a user determination module 405.

The data acquiring module 401 is configured to acquire offline image data and online data of a user to be identified.

An image processing module 402, configured to perform image processing on the offline image data to obtain offline data of the user to be identified.

The image processing module 402 includes a preprocessing sub-module 4021, a recognition result sub-module 4022, and a data determination sub-module 4023.

Specifically, the preprocessing submodule 4021 is configured to perform preprocessing on the offline image data, where the preprocessing includes binarization, noise removal, and tilt correction; the recognition result sub-module 4022 is configured to perform layout analysis and character recognition on the preprocessed offline image data to obtain a recognition result; the data determining sub-module 4023 is configured to determine offline data according to the recognition result.

A feature data module 403, configured to use the offline data and the online data as user data of the user to be identified, and perform data preprocessing on the user data to obtain feature data of the user to be identified, where the preprocessing includes feature factor quantization, abnormal value processing, and data cleaning.

A classification probability module 404, configured to input the feature data of the user to be identified into a pre-trained user classification model, so as to obtain a classification probability of the user to be identified.

A user determining module 405, configured to determine that the user to be identified is the target user if the classification probability of the user to be identified is greater than a user classification threshold.

It should be noted that, as will be clear to those skilled in the art, for convenience and brevity of description, the specific working processes of the model training apparatus and each module described above and the specific working processes of the user identification apparatus and each module described above may refer to the training method of the user classification model and the corresponding processes in the embodiment of the user identification method, and are not described herein again.

The user identification means described above may be implemented in the form of a computer program which is executable on a computer device as shown in fig. 7.

Referring to fig. 7, fig. 7 is a schematic block diagram of a computer device according to an embodiment of the present disclosure. The computer device may be a server or a terminal.

Referring to fig. 7, the computer device includes a processor, a memory, and a network interface connected through a system bus, wherein the memory may include a nonvolatile storage medium and an internal memory.

The non-volatile storage medium may store an operating system and a computer program. The computer program includes program instructions that, when executed, cause a processor to perform any of the user identification methods.

The processor is used for providing calculation and control capability and supporting the operation of the whole computer equipment.

The internal memory provides an environment for the execution of a computer program on a non-volatile storage medium, which when executed by a processor causes the processor to perform any of the methods for user identification.

The network interface is used for network communication, such as sending assigned tasks and the like. Those skilled in the art will appreciate that the architecture shown in fig. 7 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

It should be understood that the Processor may be a Central Processing Unit (CPU), and the Processor may be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, etc. Wherein a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Wherein, in one embodiment, the processor is configured to execute a computer program stored in the memory to implement the steps of:

acquiring offline image data and online data of a user to be identified; performing image processing on the offline image data to obtain offline data of the user to be identified; taking the offline data and the online data as user data of the user to be identified, and performing data preprocessing on the user data to obtain characteristic data of the user to be identified, wherein the preprocessing comprises characteristic factor quantization, abnormal value processing and data cleaning; inputting the characteristic data of the user to be identified into a pre-trained user classification model to obtain the classification probability of the user to be identified; and if the classification probability of the user to be identified is greater than the user classification threshold, determining that the user to be identified is the target user.

In one embodiment, when implementing the image processing on the offline image data to obtain the offline data of the user to be identified, the processor is configured to implement:

preprocessing the offline image data, wherein the preprocessing comprises binarization, noise removal and inclination correction; performing layout analysis and character recognition on the preprocessed offline image data to obtain a recognition result; and determining offline data according to the recognition result.

In one embodiment, the processor is further configured to implement:

acquiring sample data of a sample user and carrying out data preprocessing on the sample data to obtain characteristic data of the sample user, wherein the characteristic data comprises a historical purchase record, a characteristic factor and a corresponding characteristic value of the sample user; determining a user identifier corresponding to the sample user according to the historical purchase record of the sample user, wherein the user identifier comprises a target user and a non-target user; and training a classification model according to the user identification, the characteristic factors and the characteristic values corresponding to the characteristic factors of the sample user, and taking the classification model obtained by training as a pre-trained user classification model.

In one embodiment, before implementing the training of the classification model according to the user identifications, the feature factors and the feature values corresponding to the feature factors of the sample users, the processor is further configured to implement:

classifying the characteristic factors, the corresponding characteristic values and the corresponding user identifications of the sample users by adopting a random forest algorithm so as to output weight values of the characteristic factors; when the weight value of the characteristic factor is larger than a factor classification threshold value, determining the characteristic factor as a target characteristic factor; screening the feature data of the sample user according to the target feature factor to obtain a feature value corresponding to the sample user and the target feature factor; when the processor is used for training the classification model according to the user identification, the characteristic factors and the characteristic values corresponding to the characteristic factors of the sample user, the processor is used for realizing that: and training a classification model according to the user identification of the sample user, the target characteristic factor and the characteristic value corresponding to the target characteristic factor.

In one embodiment, the processor is further configured to implement:

In one embodiment, the processor, when implementing the classifying the feature factors and the corresponding feature values of the sample users and the corresponding user identifiers by using the random forest algorithm to output the weight values of the feature factors, is configured to implement:

segmenting the characteristic factors, the corresponding characteristic values and the corresponding user identifications of the sample users by adopting a K-fold cross segmentation method to obtain a plurality of disjoint training subsets; performing multiple random back sampling on the training subsets by adopting a repeated sampling method to obtain multiple target training subsets, and taking the characteristic factors, the corresponding characteristic values and the corresponding user identifications of the sample users which are not extracted as out-of-bag data; obtaining a random forest model by adopting a random forest algorithm based on the characteristic factors and the plurality of target training subsets; and calculating the weight value of the characteristic factor by using the data outside the bag based on the output result of the random forest model.

In one embodiment, before implementing the deriving a random forest model using a random forest algorithm based on the feature factors and the plurality of target training subsets, the processor is further configured to implement:

and carrying out sampling balance processing on the data in the target training subset to obtain a processed target training subset.

The embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, where the computer program includes program instructions, and the processor executes the program instructions to implement any user identification method provided in the embodiment of the present application.

The computer-readable storage medium may be an internal storage unit of the computer device described in the foregoing embodiment, for example, a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the computer device.

While the invention has been described with reference to specific embodiments, the scope of the invention is not limited thereto, and those skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the invention. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for identifying a user, comprising:

acquiring offline image data and online data of a user to be identified;

2. The method according to claim 1, wherein the performing image processing on the offline image data to obtain offline data of the user to be identified comprises:

preprocessing the offline image data, wherein the preprocessing comprises binarization, noise removal and inclination correction;

performing layout analysis and character recognition on the preprocessed offline image data to obtain a recognition result;

and determining offline data according to the recognition result.

3. The user identification method according to claim 1, further comprising:

acquiring sample data of a sample user and carrying out data preprocessing on the sample data to obtain characteristic data of the sample user, wherein the characteristic data comprises a historical purchase record, a characteristic factor and a corresponding characteristic value of the sample user;

determining a user identifier corresponding to the sample user according to the historical purchase record of the sample user, wherein the user identifier comprises a target user and a non-target user;

and training a classification model according to the user identification, the characteristic factors and the characteristic values corresponding to the characteristic factors of the sample user, and taking the classification model obtained by training as a pre-trained user classification model.

4. The method according to claim 3, wherein before the training a classification model according to the user identifications, the feature factors and the feature values corresponding to the feature factors of the sample users, the method further comprises:

classifying the characteristic factors, the corresponding characteristic values and the corresponding user identifications of the sample users by adopting a random forest algorithm so as to output weight values of the characteristic factors;

when the weight value of the characteristic factor is larger than a factor classification threshold value, determining the characteristic factor as a target characteristic factor;

screening the feature data of the sample user according to the target feature factor to obtain a feature value corresponding to the sample user and the target feature factor;

the training of the classification model according to the user identification, the characteristic factors and the characteristic values corresponding to the characteristic factors of the sample user comprises:

and training a classification model according to the user identification of the sample user, the target characteristic factor and the characteristic value corresponding to the target characteristic factor.

5. The user identification method according to claim 3, further comprising:

obtaining the classification probability of the pre-trained user classification model for the sample user;

respectively calculating a plurality of confusion matrixes corresponding to the test threshold values based on a plurality of test threshold values;

calculating a plurality of kolmogorov-schlmogorov test values from a plurality of said confusion matrices;

and taking a test threshold corresponding to the maximum value in the plurality of Kolmogorov-Scomruff test values as a user classification threshold.

6. The method for identifying users according to claim 4, wherein the classifying the characteristic factors and the corresponding characteristic values of the sample users and the corresponding user identifications by using a random forest algorithm to output the weight values of the characteristic factors comprises:

segmenting the characteristic factors, the corresponding characteristic values and the corresponding user identifications of the sample users by adopting a K-fold cross segmentation method to obtain a plurality of disjoint training subsets;

performing multiple random back sampling on the training subsets by adopting a repeated sampling method to obtain multiple target training subsets, and taking the characteristic factors, the corresponding characteristic values and the corresponding user identifications of the sample users which are not extracted as out-of-bag data;

obtaining a random forest model by adopting a random forest algorithm based on the characteristic factors and the plurality of target training subsets;

and calculating the weight value of the characteristic factor by using the data outside the bag based on the output result of the random forest model.

7. The method as claimed in claim 6, wherein before the deriving a random forest model by using a random forest algorithm based on the feature factors and the plurality of target training subsets, the method further comprises:

8. A user identification device, comprising:

9. A computer device, wherein the computer device comprises a memory and a processor;

the memory is used for storing a computer program;

the processor for executing the computer program and implementing the user identification method according to any of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, causes the processor to implement the user identification method according to any one of claims 1 to 7.