CN116244652A - User identification method and device, storage medium and electronic equipment - Google Patents

User identification method and device, storage medium and electronic equipment Download PDF

Info

Publication number
CN116244652A
CN116244652A CN202310146495.8A CN202310146495A CN116244652A CN 116244652 A CN116244652 A CN 116244652A CN 202310146495 A CN202310146495 A CN 202310146495A CN 116244652 A CN116244652 A CN 116244652A
Authority
CN
China
Prior art keywords
data
target
user
model
sample set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310146495.8A
Other languages
Chinese (zh)
Inventor
张振轩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Industrial and Commercial Bank of China Ltd ICBC
Original Assignee
Industrial and Commercial Bank of China Ltd ICBC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Industrial and Commercial Bank of China Ltd ICBC filed Critical Industrial and Commercial Bank of China Ltd ICBC
Priority to CN202310146495.8A priority Critical patent/CN116244652A/en
Publication of CN116244652A publication Critical patent/CN116244652A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • G06Q30/0601Electronic shopping [e-shopping]
    • G06Q30/0631Item recommendations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • Theoretical Computer Science (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Strategic Management (AREA)
  • Marketing (AREA)
  • Economics (AREA)
  • Development Economics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • General Business, Economics & Management (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The application discloses a user identification method and device, a storage medium and electronic equipment, and relates to the technical field of artificial intelligence. The method comprises the following steps: obtaining a plurality of target data of a target user, wherein the target user is a user to be identified, not purchasing a target product, and the target data at least comprises: data representing identity characteristics of the target user, data representing behavioral characteristics of the target user; inputting a plurality of target data of a target user into a target model for recognition processing to obtain a recognition result for recognizing the target user, wherein the target model is a model obtained by learning and training a first model by adopting data processed by an oversampling algorithm, the recognition result indicates whether the target user is the first user, and the probability of purchasing a target product by the first user is larger than a preset probability. By the method and the device, the problem of low accuracy in identifying potential users in the related technology is solved.

Description

User identification method and device, storage medium and electronic equipment
Technical Field
The present application relates to the field of artificial intelligence technologies, and in particular, to a user identification method and apparatus, a storage medium, and an electronic device.
Background
Since 5G commercialization is still in an early stage of development, in practical applications, it is found that the distribution of 5G related data categories is not balanced in many cases, and thus, data sets with greatly different category distributions are unbalanced data. While this phenomenon reflects the objectively existing development phase of current 5G services, in practice, operator staff is more inclined to focus on the situation described by the small class of data and better apply to the unbalanced set of historical data in this objective development situation, resulting in a more classification-capable model. Therefore, the unbalanced data classification problem not only has important theoretical value, but also has strong practical value in the field of customer identification of operators.
Moreover, in the related art, a method of directly inputting unbalanced data into an original classification model in the related art is adopted to identify potential 5G users, so that the accuracy of identifying the potential 5G users is low.
Aiming at the problem of low accuracy in identifying potential users in the related art, no effective solution is proposed at present.
Disclosure of Invention
The main purpose of the application is to provide a user identification method and device, a storage medium and electronic equipment, so as to solve the problem of low accuracy of identifying potential users in the related technology.
In order to achieve the above object, according to one aspect of the present application, there is provided a user identification method. The method comprises the following steps: obtaining a plurality of target data of a target user, wherein the target user is a user to be identified, who does not purchase a target product, and the target data at least comprises: data representing an identity characteristic of the target user, data representing a behavioral characteristic of the target user; inputting a plurality of target data of the target user into a target model for recognition processing to obtain a recognition result for recognizing the target user, wherein the target model is a model obtained by learning and training a first model by adopting data processed by an oversampling algorithm, the recognition result indicates whether the target user is a first user or not, the probability of purchasing the target product by the first user is larger than a preset probability, and the first model is one of the following: linear model, bayesian predictive model, decision tree model, neural network model.
Further, the target model is obtained by: acquiring a plurality of first data, wherein the first data is data used for representing the characteristics of each second user, and the plurality of second users at least comprise: a user who has purchased a target product and a user who has not purchased a target product, the number of users who have purchased a target product being smaller than the number of users who have not purchased a target product, each second user being characterized by at least: identity characteristics of each second user, behavioral characteristics of each second user; obtaining a plurality of second data based on the oversampling algorithm and the plurality of first data, wherein the second data is data for representing the characteristics of each third user, and the third users are users in the users who have purchased target products; and learning and training the first model based on the first data and the second data to obtain the target model.
Further, based on the oversampling algorithm and the plurality of first data, obtaining a plurality of second data includes: summarizing a plurality of data of users who have purchased target products in the plurality of first data to obtain a first sample set; summarizing a plurality of data of users who do not purchase the target product in the plurality of first data to obtain a second sample set; and inputting the first sample set and the second sample set into the oversampling algorithm for processing to obtain the plurality of second data.
Further, inputting the first sample set and the second sample set into the oversampling algorithm for processing, and obtaining the plurality of second data includes: determining a third sample set from the first sample set according to the first sample set and the second sample set, wherein the difficulty of learning and training the third sample set is greater than a preset difficulty; acquiring the distribution density of each sample data in the third sample set; and weighting the sample data in the third sample set based on the distribution density of each sample data in the third sample set to obtain the plurality of second data.
Further, learning and training the first model based on the plurality of first data and the plurality of second data, and obtaining the target model includes: acquiring a plurality of third data in the plurality of second data, wherein the plurality of third data are data with influence degree smaller than a preset influence degree on an output result of the first model; deleting the plurality of third data from the plurality of second data to obtain a plurality of fourth data; and learning and training the first model by adopting the plurality of first data and the plurality of fourth data to obtain the target model.
Further, acquiring the plurality of first data includes: acquiring a plurality of original data representing characteristics of each second user; determining abnormal data in the plurality of original data, wherein the abnormal data at least comprises: missing data of numerical values and data needing to be converted; processing the abnormal data to obtain processed abnormal data; adding the processed abnormal data to the plurality of original data to obtain a plurality of fifth data; and performing amplification processing on the plurality of fifth data to obtain the plurality of first data.
Further, if the number of the target users is a plurality of, after the target data of the target users is input into the target model for recognition processing, the method further comprises: determining the purchase rate of each target user for purchasing the target product; pushing the target products according to the purchase rate of each target user purchasing the target products.
Further, the target product is a 5G package of the mobile phone.
In order to achieve the above object, according to another aspect of the present application, there is provided an identification device of a user. The device comprises: the first acquisition unit is used for acquiring a plurality of target data of target users, wherein the target users are users to be identified, who do not purchase target products, and the target data at least comprise: data representing an identity characteristic of the target user, data representing a behavioral characteristic of the target user; the first processing unit is used for inputting a plurality of target data of the target user into a target model for recognition processing to obtain a recognition result for recognizing the target user, wherein the target model is a model obtained by learning and training a first model by adopting data processed by an oversampling algorithm, the recognition result indicates whether the target user is a first user or not, the probability of purchasing the target product by the first user is larger than a preset probability, and the first model is one of the following: linear model, bayesian predictive model, decision tree model, neural network model.
Further, the target model is obtained by: a second acquisition unit configured to acquire a plurality of first data, where the first data is data representing a feature of each second user, and the plurality of second users at least include: a user who has purchased a target product and a user who has not purchased a target product, the number of users who have purchased a target product being smaller than the number of users who have not purchased a target product, each second user being characterized by at least: identity characteristics of each second user, behavioral characteristics of each second user; a first determining unit, configured to obtain a plurality of second data based on the oversampling algorithm and the plurality of first data, where the second data is data for representing a feature of each third user, and the third user is a user among users who have purchased the target product; and the first training unit is used for learning and training the first model based on the plurality of first data and the plurality of second data to obtain the target model.
Further, the first determination unit includes: the first summarizing module is used for summarizing a plurality of data of the users who have purchased the target products in the plurality of first data to obtain a first sample set; the second summarizing module is used for summarizing a plurality of data of users who do not purchase the target product in the plurality of first data to obtain a second sample set; and the first processing module is used for inputting the first sample set and the second sample set into the oversampling algorithm for processing to obtain the plurality of second data.
Further, the first processing module includes: a first determining submodule, configured to determine a third sample set from the first sample set according to the first sample set and the second sample set, where a difficulty of learning training on the third sample set is greater than a preset difficulty; a first obtaining submodule, configured to obtain a distribution density of each sample data in the third sample set; and the first processing submodule is used for carrying out weighting processing on the sample data in the third sample set based on the distribution density of each sample data in the third sample set to obtain the plurality of second data.
Further, the first training unit includes: the first acquisition module is used for acquiring a plurality of third data in the plurality of second data, wherein the plurality of third data are data with influence degree smaller than a preset influence degree on an output result of the first model; the first deleting module is used for deleting the plurality of third data from the plurality of second data to obtain a plurality of fourth data; and the first training module is used for learning and training the first model by adopting the plurality of first data and the plurality of fourth data to obtain the target model.
Further, the second acquisition unit includes: a second acquisition module for acquiring a plurality of raw data representing a feature of each second user; the first determining module is configured to determine abnormal data in the plurality of original data, where the abnormal data at least includes: missing data of numerical values and data needing to be converted; the second processing module is used for processing the abnormal data to obtain processed abnormal data; the first adding module is used for adding the processed abnormal data into the plurality of original data to obtain a plurality of fifth data; and the third processing module is used for performing amplification processing on the plurality of fifth data to obtain the plurality of first data.
Further, if the number of the target users is plural, the apparatus further includes: a second determining unit configured to determine a purchase rate of each target user purchasing a target product after inputting a plurality of target data of the target user into a target model for identification processing; and the first pushing unit is used for pushing the target products according to the purchase rate of each target user for purchasing the target products.
Further, the target product is a 5G package of the mobile phone.
In order to achieve the above object, according to another aspect of the present application, there is provided a computer-readable storage medium storing a program, wherein the program performs the user identification method of any one of the above.
To achieve the above object, according to another aspect of the present application, there is provided an electronic device including one or more processors and a memory for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method for identifying a user as set forth in any one of the above.
Through the application, the following steps are adopted: obtaining a plurality of target data of a target user, wherein the target user is a user to be identified, not purchasing a target product, and the target data at least comprises: data representing identity characteristics of the target user, data representing behavioral characteristics of the target user; inputting a plurality of target data of a target user into a target model for recognition processing to obtain a recognition result for recognizing the target user, wherein the target model is a model obtained by learning and training a first model by adopting data processed by an oversampling algorithm, the recognition result indicates whether the target user is the first user, the probability of purchasing a target product by the first user is larger than a preset probability, and the first model is one of the following: the problems of low accuracy in identifying potential users in the related technology are solved by the linear model, the Bayesian prediction model, the decision tree model and the neural network model. The method comprises the steps of obtaining a plurality of data used for representing the characteristics of a user, inputting the data used for representing the characteristics of the user into a model obtained by learning and training an original model by adopting data processed by an oversampling algorithm, and identifying whether the user is a potential user with the probability of purchasing a target product larger than the preset probability, so that the effect of improving the accuracy of identifying the potential user is achieved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application, illustrate and explain the application and are not to be construed as limiting the application. In the drawings:
FIG. 1 is a flow chart of a user identification method provided according to an embodiment of the present application;
FIG. 2 is a schematic diagram of filtering an original minority class sample according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a majority class sample of a boundary region nearest to a minority class sample in an embodiment of the present application;
FIG. 4 is a schematic diagram of a minority class sample of a border area nearest to the majority class sample in an embodiment of the present application;
FIG. 5 is a flow chart of an alternative user identification method provided in accordance with an embodiment of the present application;
FIG. 6 is a schematic diagram of a user identification device provided in accordance with an embodiment of the present application;
fig. 7 is a schematic diagram of an electronic device provided according to an embodiment of the present application.
Detailed Description
It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other. The present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
In order to make the present application solution better understood by those skilled in the art, the following description will be made in detail and with reference to the accompanying drawings in the embodiments of the present application, it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, shall fall within the scope of the present application.
It should be noted that the terms "first," "second," and the like in the description and claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate in order to describe the embodiments of the present application described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
It should be noted that, the user information (including, but not limited to, user equipment information, user personal information, etc.) and the data (including, but not limited to, data for analysis, stored data, presented data, data for representing identity characteristics of a user, data for representing behavioral characteristics of a user, etc.) referred to in the present application are information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the relevant data is required to comply with relevant laws and regulations and standards of relevant countries and regions, and is provided with corresponding operation entries for the user to select authorization or rejection.
The present invention will be described with reference to preferred implementation steps, and fig. 1 is a flowchart of a method for identifying a user according to an embodiment of the present application, as shown in fig. 1, where the method includes the following steps:
step S101, obtaining a plurality of target data of a target user, where the target user is a user to be identified who does not purchase a target product, and the target data at least includes: data representing the identity characteristics of the target user, data representing the behavioral characteristics of the target user.
For example, based on a big data platform of a telecom operator, data (a plurality of target data described above) such as basic information of a user, a 5G infrastructure condition, various behavioral characteristics of the user and the like can be obtained, and the target product can be a 5G package of a mobile phone, and the user can be a user to be identified who does not purchase the 5G package product.
Step S102, inputting a plurality of target data of a target user into a target model for recognition processing to obtain a recognition result for recognizing the target user, wherein the target model is a model obtained by learning and training a first model by adopting data processed by an oversampling algorithm, the recognition result indicates whether the target user is a first user, the probability of purchasing a target product by the first user is larger than a preset probability, and the first model is one of the following: linear model, bayesian predictive model, decision tree model, neural network model.
For example, the obtained basic information of the user who does not purchase the 5G package product, the 5G infrastructure condition, various behavior characteristics of the user and other data are input into a new training model (the target model) to identify whether the user who does not purchase the 5G package product is a potential 5G user. And the newly trained model may be a model obtained by learning and training the original model (the first model described above) with data processed by an oversampling algorithm. In addition, over-sampling refers to increasing the number of samples of a certain class in the training set to reduce class imbalance, i.e., by over-sampling the number of samples of a minority class can be increased.
Through the steps S101 to S102, by acquiring a plurality of data for representing the features of the user, inputting the plurality of data for representing the features of the user into the model obtained by learning and training the original model by using the data processed by the oversampling algorithm, and identifying whether the user is a potential user with a probability of purchasing the target product greater than a preset probability, the effect of improving the accuracy of identifying the potential user is achieved.
Optionally, in the method for identifying a user provided in the embodiment of the present application, inputting the first sample set and the second sample set into an oversampling algorithm for processing, and obtaining a plurality of second data includes: determining a third sample set from the first sample set according to the first sample set and the second sample set, wherein the difficulty of learning and training the third sample set is greater than the preset difficulty; acquiring the distribution density of each sample data in the third sample set; and weighting the sample data in the third sample set based on the distribution density of each sample data in the third sample set to obtain a plurality of second data.
For example, the first sample set may be a minority sample set and the second sample set may be a majority sample set. And before the first sample set and the second sample set are input into the oversampling algorithm for processing, the oversampling algorithm needs to be built, and in the embodiment of the application, the method for building the improved oversampling method may be the following steps:
first, since the information provided by the minority class samples in the boundary region is more important when training the model, improvement of classification performance is important because the minority class samples in the boundary region are more difficult to learn. Therefore, a few classes of samples with more difficult learning in the boundary region can be found out first, and new samples can be synthesized by using the samples. In addition, considering that the distribution density is different in different boundary areas, the number of generated new samples of each boundary sample needs to be determined according to the distribution density.
Moreover, the basic steps of the BADW-SMOTE algorithm (modified oversampling algorithm) in the embodiments of the present application are divided into two phases, in the first phase, the algorithm follows a rule from the original minority sample set D min Find out the few kinds of samples which are difficult to learn, and form the samples into a set D imin The method comprises the steps of carrying out a first treatment on the surface of the Second stage, based on sample distribution density pair D imin Weighting few samples in the sample set, selecting a certain number of samples to synthesize new samples, and placing the new samples in D min
And the main steps of the improved oversampling algorithm are described as follows:
the algorithm may be:
BADW-SMOTE(D maj ,D min ,N,k1,min_k1,max_k2,k2,max_k3,k3,k4);
wherein, the input of the algorithm is as follows:
(1)D maj : a set of majority class samples;
(2)D min : a minority class sample set;
(3) N: synthesizing the total amount of the new sample;
(4) k1: identifying a maximum neighbor value of noise;
(5) min_k1: a determination threshold of noise;
(6) max_k2: generating a maximum search neighbor value of a minority class of the majority class set;
(7) k2: generating neighbor values of minority classes of the majority class set;
(8) max_k3: generating a maximum search neighbor value of a majority class of the minority class set;
(9) And k3: generating neighbor values of a majority class of a minority class set;
(10) k4: a neighborhood value of the minority class distribution density weight is calculated.
The procedure starts:
the first stage: the boundary region is difficult to learn the identification of a few types of samples;
(1) For each D min Minority class sample x in (2) i Its k1 neighbor is obtained and the set min_k1_NN (x) i ) That is, min_k1_NN (x) i ) The inner side is equal to x i K1 samples closest to each other.
(2) Culling at min_k1_NN (x) i ) Minority class samples with the number of minority classes lower than min_k1 are obtained to obtain a minority class sample set D after noise is removed minf
D minf =D min -{x i ∈D min :min_k1_NN(x i ) The number of samples of the minority class is less than min_k1}
(3) For D minf Each minority class sample x in (1) i Searching for a most class sample set N of k2 nearest neighbors in samples within the range of max_k2 maj (x i ),N maj (x i ) The inner side is provided with x i K2 majority class samples with nearest Euclidean distance and fit at x i Is within the max_k2 neighbor.
(4) Merging N maj (x i ) The set obtains a plurality of sample sets with decision boundaries which are difficult to learn:
Figure BDA0004089322900000081
(5) For D bmaj Each of the majority class samples y i Search for k3 most in samples within its max_k3 rangeNeighbor minority class sample set N min (y i ),N min (y i ) Inside has a Y with i K2 majority class samples with nearest Euclidean distance and fit in y i Is within the max_k3 neighbor.
(6) Merging N min (y i ) Collecting to obtain a minority class sample set D with a decision boundary difficult to learn imin
Figure BDA0004089322900000082
And a second stage: weighting the samples based on the distribution density;
(1) For D imin Each minority class sample x of (2) i Calculate x i And x j (x j ∈D imin ) Distance d of (2) ij Obtaining x i K of (2) 4 And the neighbors.
(2) Calculating sample x i Distribution density of (c):
Figure BDA0004089322900000091
(3) Calculating sample x i Density weight of (c):
Figure BDA0004089322900000092
(4)Doforj=1,…,N:
(1) from a minority class sample set D imin Selecting one sample x according to density weight i From x i Randomly selecting a sample y from k4 neighbors of (2);
(2) generating a new sample s according to the formula s=x+r× (y-x), where r is a random number between 0 and 1;
(3) Put s into set D mino In (a): d (D) mino =D mino ∪{s}。
(5) Ending the cycle:
and (3) outputting: BADW-SMOTE processing to obtainNew minority class sample set D of (2) mino
The BADW-SMOTE method in the embodiment of the application is based on BN-SMOTE (an unbalanced data oversampling algorithm) and ADASYN (adaptive integrated oversampling method), and adopts some improvements made by the two oversampling algorithms when processing few types of samples which are difficult to learn.
In addition, for sample point x i ,NN(x i ) The aggregate includes the sum x i The k nearest sample points. At this time, most samples and few samples can enter NN (x i ) And (5) collecting. And N is maj (x i ) The set contains only a majority of the class samples; similarly N min (x i ) The set contains only a few classes of samples.
The following is D imin Comprises the following steps:
(1) Specific noise is removed. For D min Each minority class sample x in (1) i Calculate its k1 neighbor set min_k1_nn (x i ),min_k1_NN(x i ) The inner side is provided with x i The nearest k1 samples apart. Screening to obtain a filtered minority sample set D minf
In addition, fig. 2 is a schematic diagram of filtering operation on the original minority samples in the embodiment of the present application, and it is obvious from fig. 2 that only one sample point B of 5 nearest sample points a is a minority sample, and the number is less than min_k1=2. The algorithm therefore determines A as a noise sample and from set D min And (5) removing. Similarly B is also a noise sample and is rejected.
(2) For D minf Each minority class sample x of (2) i Searching for a most class sample set N of k2 nearest neighbors in samples within the range of max_k2 maj (x i ),N maj (x i ) The inner side is provided with x i The nearest k2 majority class samples and satisfy x i Is within the max_k2 neighbor. Merging all N maj (x i ) The collection can be a collection of multiple classes of samples that are located at decision boundaries. Moreover, FIG. 3 is a schematic diagram of a majority class sample of a boundary region nearest to a minority class sample in an embodiment of the present application, e.gFig. 3 shows that k2=3 is assumed, and a majority class sample set D with the nearest decision boundary to the minority class sample can be obtained bmaj
(3) For D bmaj Each of the majority class samples y i Searching a minority class sample set N of k3 nearest neighbors in samples within the range of max_k3 min (y i ),N min (y i ) Inside has a Y with i K2 most closely spaced class samples and fit in y i Is within the max_k3 neighbor. For all N min (y i ) Collecting and merging to obtain a minority class sample set D containing important information at decision boundary imin . The greater k3, D imin The more minority class samples that are included near the decision boundary, the more important classification information they provide to the model, which of course increases the computational complexity of model training. Moreover, fig. 4 is a schematic diagram of a minority sample of a boundary area closest to the majority sample in the embodiment of the present application, and as shown in fig. 4, when k3=3, a target minority sample set D of the first stage of the algorithm may be obtained imin
According to the scheme, a certain number of samples can be obtained quickly and accurately to synthesize new samples according to the improved oversampling algorithm.
Optionally, in the method for identifying a user provided in the embodiment of the present application, obtaining a plurality of second data based on an oversampling algorithm and a plurality of first data includes: summarizing a plurality of data of users who have purchased target products in the plurality of first data to obtain a first sample set; summarizing a plurality of data of users who do not purchase the target product in the plurality of first data to obtain a second sample set; the first sample set and the second sample set are input into an oversampling algorithm for processing, and a plurality of second data are obtained.
Optionally, in the method for identifying a user provided in the embodiment of the present application, the target product is a 5G package of a mobile phone.
For example, the target product may be a 5G package of a mobile phone, the first sample set may be a minority sample set, and the second sample set may be a majority sample set. The basic information of the user who has purchased the 5G package, the 5G infrastructure condition, various behavior characteristics of the user and the like are taken as a minority sample set, the basic information of the user who has not purchased the 5G package, the 5G infrastructure condition, various behavior characteristics of the user and the like are taken as a majority sample set, and the minority sample set and the majority sample set are input into an improved oversampling algorithm to obtain a certain number of samples (a plurality of second data). For example, the number of samples of the users who purchase the 5G package in the minority sample set is 10, and the number of samples of the users who do not purchase the 5G package in the majority sample set is 90, after the sample data is input into the improved oversampling algorithm, 5 users who purchase the 5G package in the minority sample set are obtained, and the data of basic information, 5G infrastructure condition, various behavior characteristics of the users and the like of the 5 users are used as the second data.
According to the scheme, the characteristic data of a certain number of users can be quickly and accurately selected from a few sample sets according to the improved oversampling algorithm.
Optionally, in the method for identifying a user provided in the embodiment of the present application, acquiring a plurality of first data includes: acquiring a plurality of original data representing characteristics of each second user; determining abnormal data in the plurality of original data, wherein the abnormal data at least comprises: missing data of numerical values and data needing to be converted; processing the abnormal data to obtain processed abnormal data; adding the processed abnormal data into a plurality of original data to obtain a plurality of fifth data; and performing amplification processing on the fifth data to obtain first data.
For example, data collection is first performed, data such as basic information of the user, 5G infrastructure conditions, various behavioral characteristics of the user, etc. are collected, and the collected data are shown in table 1.
TABLE 1
Figure BDA0004089322900000111
/>
Figure BDA0004089322900000121
And then preprocessing the collected data, carrying out exploratory analysis on the data, screening the abnormal value, and filling the missing value and the field variable conversion.
The data after the data preprocessing is subjected to feature transformation, specifically, as the original features sometimes cannot reflect deep problems, new features can be constructed by combining specific services and experiences, such as creating interactive items between two kinds of features with strong correlation, such as gender and terminal brands.
Through the scheme, the training model can be helped to more deeply mine the classification information.
Optionally, in the method for identifying a user provided in the embodiment of the present application, the target model is obtained by: acquiring a plurality of first data, wherein the first data is data for representing characteristics of each second user, and the plurality of second users at least comprise: the number of users who have purchased the target product and the number of users who have not purchased the target product is smaller than the number of users who have not purchased the target product, and each second user is characterized by at least: identity characteristics of each second user, behavioral characteristics of each second user; obtaining a plurality of second data based on an oversampling algorithm and the plurality of first data, wherein the second data is data used for representing the characteristics of each third user, and the third users are users in users who have purchased target products; and learning and training the first model based on the plurality of first data and the plurality of second data to obtain a target model.
For example, when the data is unbalanced, the values of the parameters k1, min_k1, max_k2, k2, max_k3, k3, k4 and N may be continuously adjusted under a specific classification algorithm by using a BADW-SMOTE (modified oversampling algorithm) to find the optimal classification effect value in the local parameter space. For example, the number of samples of the user who has purchased the 5G package is 10, and the number of samples of the user who has not purchased the 5G package is 90, then the sample data is input into the improved oversampling algorithm to obtain 5 users who have purchased the 5G package, and the original model (linear model, bayesian, tree model, neural network, etc.) is trained by using the basic information of the 5 users, the 5G infrastructure condition, various behavior characteristics of the users, etc. (the above second data), the sample data of the 10 users who have purchased the 5G package and the sample data of the 90 users who have not purchased the 5G package, and finally the above target model is obtained. In addition, the plurality of first data may be collected sample data of 10 users who have purchased the 5G package and sample data of 90 users who have not purchased the 5G package.
In addition, since the data set further includes many behavior feature data, the behavior data of the user is very sparse, for example, the user ratio of using an app (Application program) is very low, and the euclidean distance is not suitable for the number of active days of the user on the app as a measure of the similarity of the user, which cannot well reflect the user preference situation.
Therefore, the sample similarity is analyzed aiming at the behavior data, and the cosine distance and the Jacquard similarity coefficient can be used for measuring the user similarity. And cosine distance and Jaccard similarity coefficient are respectively used for the distance measurement function in the process of solving the neighbor samples in the BADW-SMOTE algorithm so as to adapt to similarity analysis of the behavior data.
Moreover, when model training is carried out, a ten-fold cross-validation mode can be adopted for model training, and the result is more referential. Various types of models, such as linear models, bayesian models, tree models, neural networks, etc., can be tried, taking into account the number of samples and the optimization results.
Through the scheme, the method can be combined with various classification models, and classification model parameters can be customized.
Optionally, in the method for identifying a user provided in the embodiment of the present application, learning and training the first model based on the plurality of first data and the plurality of second data, and obtaining the target model includes: acquiring a plurality of third data in the plurality of second data, wherein the plurality of third data are data with influence degree smaller than a preset influence degree on an output result of the first model; deleting a plurality of third data from the plurality of second data to obtain a plurality of fourth data; and learning and training the first model by adopting a plurality of first data and a plurality of fourth data to obtain a target model.
For example, as the common features of the new data set can be greatly increased after feature engineering, the dimension disaster is easy to generate due to the fact that the dimension is too high, the subsequent modeling analysis is not facilitated, redundancy is caused by excessive features, training time is consumed, and the model can be input to training after feature screening. The fourth data are data subjected to feature screening.
Through the scheme, the model is trained by utilizing the data subjected to feature screening, so that training time can be conveniently saved, and the accuracy of the training model is improved.
Optionally, in the method for identifying a user provided in the embodiment of the present application, if the number of target users is multiple, after the multiple target data of the target users are input into the target model for identification processing, the method further includes: determining the purchase rate of each target user for purchasing the target product; pushing the target products according to the purchase rate of each target user purchasing the target products.
For example, the output result of the model may be the probability that the user converts to a 5G package client, and ranked according to the level, and the marketer or the intelligent system may use this information to start personalized recommendation in the mobile end of the operator, the e-commerce platform, the short message, the telephone, or other channels.
Through the scheme, the product can be quickly and accurately pushed to the user.
For example, fig. 5 is a flowchart of an alternative user identification method provided according to an embodiment of the present application, and as shown in fig. 5, the alternative user identification method includes the following steps:
(1) And (3) data collection:
the data acquisition device can be used for collecting the original data, and the collected original data can be data for acquiring basic information of a user, 5G infrastructure conditions, various behavior characteristics of the user and the like. And extracting characteristic behavior data by using a log processing device.
(2) Data preprocessing:
and carrying out exploratory analysis on the data, screening the abnormal value, and filling the missing value and the field variable conversion.
(3) Characteristic structure:
the original features sometimes do not reflect deep questions and new features can be built in conjunction with specific businesses and experiences to help training models more deeply mine classification information, such as creating interactive terms between two types of features that are strongly related in existence, such as gender and end-brand crossover. That is, the feature transformation step corresponds to the step of feature engineering with the business system in fig. 5.
(4) Imbalance treatment:
and processing by adopting BADW-SMOTE, and continuously adjusting values of parameters k1, min_k1, max_k2, k2, max_k3, k3, k4 and N under a CART decision tree algorithm to find out an optimal classification effect value in a local parameter space.
Since this data set further includes many behavior feature data, the behavior data of the user itself is very sparse, for example, the user's duty ratio using a certain app is very low, and the euclidean distance is not suitable for the number of active days of the user on the app as a measure of the similarity of the user, and the user preference situation cannot be reflected well.
Therefore, the sample similarity is analyzed aiming at the behavior data, and the cosine distance and the Jacquard similarity coefficient can be used for measuring the user similarity. And cosine distance and Jaccard similarity coefficient can be used for distance measurement function in the process of solving neighbor samples in BADW-SMOTE algorithm so as to adapt to similarity analysis of behavior data.
(5) Feature screening:
the common features of the new data set can be greatly increased after feature engineering, dimension disasters are easy to generate due to the fact that dimension is too high, follow-up modeling analysis is not facilitated, redundancy is caused by excessive features, training time is consumed, and therefore the data set is required to be input into a model for training after feature screening.
(6) Model training:
the model training adopts a ten-fold cross-validation mode, and the result is more referential. Various types of models, such as linear models, bayesian models, tree models, neural networks, etc., can be tried, taking into account the number of samples and the optimization results. And can be combined with various classification models, and the user can customize the parameters of the classification models.
(7) Outputting a result:
the model outputs the possibility that the user is converted into the 5G package client, and the package client is ranked according to the height, and marketing personnel or an intelligent system can start personalized recommendation in channels such as an operator mobile terminal, an electronic commerce platform, a WeChat public number, a short message or a telephone by utilizing the information.
In summary, in the telecom operator customer identification scenario, the 5G customer prediction model selection can be increased by combining the oversampling algorithm with different classification models, and the broad effectiveness of the improved oversampling method can be illustrated. In addition, the problem of data unbalance in a potential 5G user prediction model of a telecom operator can be studied, a plurality of client predictions start from an algorithm level, particularly, a loss function is modified in a cost sensitive learning angle, and the embodiment modifies a similarity measurement function according to user preference behavior data, so that a system user is given more custom parameters to adjust.
In addition, based on a large data platform of a telecom operator, data such as user basic information, 5G infrastructure conditions, various behavior characteristics of the user and the like can be obtained, an improved oversampling algorithm is combined with different classification models, 5G package purchase behaviors of a customer are predicted, and a database is formed. When 5G promotion is performed subsequently, the latest data is input into the model for operation, potential 5G clients with high priority are output, accurate marketing is given to clients with purchase intention through the channels of the mobile end of an operator, each large electronic commerce platform, weChat public numbers and the like, and the promotion conversion rate is improved.
Moreover, there is a general problem of category imbalance for telecom operator 5G user data, i.e. 5G users are far lower than non-5G users, but identifying potential 5G users brings great benefit to the company and has long meaning. So the problem of unbalanced data in the potential 5G users is identified by the telecom operator, and the telecom operator can be better helped to realize accurate marketing from the angle, and the 5G emerging market is preempted.
And has the following advantages:
(1) Can be combined with various classification techniques: the improved oversampling algorithm is characterized in that original data are expanded in a data processing stage, the improved oversampling algorithm can be combined with a plurality of existing classification models during model training, the identification capability can be improved by utilizing the latest classification technology, and the improved oversampling algorithm can be developed along with the classification technology and used by users;
(2) Can be combined with various similarity calculation functions: the embodiment can provide a custom similarity calculation model, because telecommunication user data not only comprises basic identity information, but also comprises user behavior characteristics, and if a unified similarity measurement method is used for calculating sample distances, a similar sample set with larger error can be obtained by the behavior characteristics and the identity characteristics, the similarity calculation function provided by the embodiment can be used as an input parameter to better distinguish the similar sample set;
(3) The accuracy is high: the method can utilize an improved oversampling algorithm to process the original data, and combines different classification models, and the recognition accuracy is higher compared with the classification model without unbalanced data processing after partial sample data verification.
In summary, according to the user identification method provided by the embodiment of the present application, a plurality of target data of a target user are obtained, where the target user is a user to be identified who does not purchase a target product, and the target data at least includes: data representing identity characteristics of the target user, data representing behavioral characteristics of the target user; inputting a plurality of target data of a target user into a target model for recognition processing to obtain a recognition result for recognizing the target user, wherein the target model is a model obtained by learning and training a first model by adopting data processed by an oversampling algorithm, the recognition result indicates whether the target user is the first user, the probability of purchasing a target product by the first user is larger than a preset probability, and the first model is one of the following: the problems of low accuracy in identifying potential users in the related technology are solved by the linear model, the Bayesian prediction model, the decision tree model and the neural network model. The method comprises the steps of obtaining a plurality of data used for representing the characteristics of a user, inputting the data used for representing the characteristics of the user into a model obtained by learning and training an original model by adopting data processed by an oversampling algorithm, and identifying whether the user is a potential user with the probability of purchasing a target product larger than the preset probability, so that the effect of improving the accuracy of identifying the potential user is achieved.
It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer executable instructions, and that although a logical order is illustrated in the flowcharts, in some cases the steps illustrated or described may be performed in an order other than that illustrated herein.
The embodiment of the application also provides a user identification device, and the user identification device of the embodiment of the application can be used for executing the user identification method provided by the embodiment of the application. The following describes a user identification device provided in an embodiment of the present application.
Fig. 6 is a schematic diagram of an identification device of a user according to an embodiment of the present application. As shown in fig. 6, the apparatus includes: a first acquisition unit 601 and a first processing unit 602.
Specifically, the first obtaining unit 601 is configured to obtain a plurality of target data of a target user, where the target user is a user to be identified who does not purchase a target product, and the target data at least includes: data representing identity characteristics of the target user, data representing behavioral characteristics of the target user;
the first processing unit 602 is configured to input a plurality of target data of a target user into a target model for recognition processing, to obtain a recognition result for recognizing the target user, where the target model is a model obtained by learning and training a first model by using data processed by an oversampling algorithm, the recognition result indicates whether the target user is a first user, a probability of purchasing a target product by the first user is greater than a preset probability, and the first model is one of: linear model, bayesian predictive model, decision tree model, neural network model.
In summary, in the user identification device provided in the embodiment of the present application, a plurality of target data of a target user is acquired through the first acquiring unit 601, where the target user is a user to be identified who does not purchase a target product, and the target data at least includes: data representing identity characteristics of the target user, data representing behavioral characteristics of the target user; the first processing unit 602 inputs a plurality of target data of the target user into a target model for recognition processing to obtain a recognition result for recognizing the target user, wherein the target model is a model obtained by learning and training a first model by adopting data processed by an oversampling algorithm, the recognition result indicates whether the target user is a first user, the probability of purchasing a target product by the first user is greater than a preset probability, and the first model is one of the following: the problems of low accuracy in identifying potential users in the related technology are solved by the linear model, the Bayesian prediction model, the decision tree model and the neural network model. The method comprises the steps of obtaining a plurality of data used for representing the characteristics of a user, inputting the data used for representing the characteristics of the user into a model obtained by learning and training an original model by adopting data processed by an oversampling algorithm, and identifying whether the user is a potential user with the probability of purchasing a target product larger than the preset probability, so that the effect of improving the accuracy of identifying the potential user is achieved.
Optionally, in the user identification device provided in the embodiment of the present application, the target model is obtained by: a second acquisition unit configured to acquire a plurality of first data, where the first data is data representing a feature of each second user, and the plurality of second users at least include: the number of users who have purchased the target product and the number of users who have not purchased the target product is smaller than the number of users who have not purchased the target product, and each second user is characterized by at least: identity characteristics of each second user, behavioral characteristics of each second user; the first determining unit is used for obtaining a plurality of second data based on an oversampling algorithm and the plurality of first data, wherein the second data are data used for representing the characteristics of each third user, and the third users are users in users who have purchased target products; the first training unit is used for learning and training the first model based on the plurality of first data and the plurality of second data to obtain a target model.
Optionally, in the user identification device provided in the embodiment of the present application, the first determining unit includes: the first summarizing module is used for summarizing a plurality of data of the users who purchase the target products in the plurality of first data to obtain a first sample set; the second summarizing module is used for summarizing a plurality of data of users who do not purchase the target product in the plurality of first data to obtain a second sample set; the first processing module is used for inputting the first sample set and the second sample set into an oversampling algorithm for processing to obtain a plurality of second data.
Optionally, in the user identification device provided in the embodiment of the present application, the first processing module includes: the first determining submodule is used for determining a third sample set from the first sample set according to the first sample set and the second sample set, wherein the difficulty of learning and training the third sample set is greater than the preset difficulty; the first acquisition submodule is used for acquiring the distribution density of each sample data in the third sample set; and the first processing submodule is used for carrying out weighting processing on the sample data in the third sample set based on the distribution density of each sample data in the third sample set to obtain a plurality of second data.
Optionally, in the user identification device provided in the embodiment of the present application, the first training unit includes: the first acquisition module is used for acquiring a plurality of third data in the plurality of second data, wherein the plurality of third data are data with influence degree smaller than a preset influence degree on an output result of the first model; the first deleting module is used for deleting a plurality of third data from the plurality of second data to obtain a plurality of fourth data; and the first training module is used for learning and training the first model by adopting a plurality of first data and a plurality of fourth data to obtain a target model.
Optionally, in the user identification device provided in the embodiment of the present application, the second obtaining unit includes: a second acquisition module for acquiring a plurality of raw data representing a feature of each second user; the first determining module is configured to determine abnormal data in the plurality of original data, where the abnormal data at least includes: missing data of numerical values and data needing to be converted; the second processing module is used for processing the abnormal data to obtain processed abnormal data; the first adding module is used for adding the processed abnormal data into the plurality of original data to obtain a plurality of fifth data; and the third processing module is used for performing amplification processing on the plurality of fifth data to obtain a plurality of first data.
Optionally, in the user identification device provided in the embodiment of the present application, if the number of target users is multiple, the device further includes: a second determining unit configured to determine a purchase rate at which each target user purchases the target product after inputting a plurality of target data of the target user into the target model for identification processing; and the first pushing unit is used for pushing the target products according to the purchase rate of each target user for purchasing the target products.
Optionally, in the user identification device provided in the embodiment of the present application, the target product is a 5G package of a mobile phone.
The user identification device includes a processor and a memory, the first acquisition unit 601 and the first processing unit 602 are stored in the memory as program units, and the processor executes the program units stored in the memory to realize corresponding functions.
The processor includes a kernel, and the kernel fetches the corresponding program unit from the memory. The kernel can be provided with one or more than one kernel, and the accuracy of identifying potential users is improved by adjusting kernel parameters.
The memory may include volatile memory, random Access Memory (RAM), and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM), among other forms in computer readable media, the memory including at least one memory chip.
An embodiment of the present invention provides a computer-readable storage medium having stored thereon a program that, when executed by a processor, implements the user identification method.
The embodiment of the invention provides a processor which is used for running a program, wherein the program runs to execute the user identification method.
As shown in fig. 7, an embodiment of the present invention provides an electronic device, where the device includes a processor, a memory, and a program stored in the memory and executable on the processor, and when the processor executes the program, the following steps are implemented: obtaining a plurality of target data of a target user, wherein the target user is a user to be identified, who does not purchase a target product, and the target data at least comprises: data representing an identity characteristic of the target user, data representing a behavioral characteristic of the target user; inputting a plurality of target data of the target user into a target model for recognition processing to obtain a recognition result for recognizing the target user, wherein the target model is a model obtained by learning and training a first model by adopting data processed by an oversampling algorithm, the recognition result indicates whether the target user is a first user or not, the probability of purchasing the target product by the first user is larger than a preset probability, and the first model is one of the following: linear model, bayesian predictive model, decision tree model, neural network model.
The processor also realizes the following steps when executing the program: the target model is obtained by the following steps: acquiring a plurality of first data, wherein the first data is data used for representing the characteristics of each second user, and the plurality of second users at least comprise: a user who has purchased a target product and a user who has not purchased a target product, the number of users who have purchased a target product being smaller than the number of users who have not purchased a target product, each second user being characterized by at least: identity characteristics of each second user, behavioral characteristics of each second user; obtaining a plurality of second data based on the oversampling algorithm and the plurality of first data, wherein the second data is data for representing the characteristics of each third user, and the third users are users in the users who have purchased target products; and learning and training the first model based on the first data and the second data to obtain the target model.
The processor also realizes the following steps when executing the program: based on the oversampling algorithm and the plurality of first data, obtaining a plurality of second data includes: summarizing a plurality of data of users who have purchased target products in the plurality of first data to obtain a first sample set; summarizing a plurality of data of users who do not purchase the target product in the plurality of first data to obtain a second sample set; and inputting the first sample set and the second sample set into the oversampling algorithm for processing to obtain the plurality of second data.
The processor also realizes the following steps when executing the program: inputting the first sample set and the second sample set into the oversampling algorithm for processing, and obtaining the plurality of second data comprises: determining a third sample set from the first sample set according to the first sample set and the second sample set, wherein the difficulty of learning and training the third sample set is greater than a preset difficulty; acquiring the distribution density of each sample data in the third sample set; and weighting the sample data in the third sample set based on the distribution density of each sample data in the third sample set to obtain the plurality of second data.
The processor also realizes the following steps when executing the program: learning and training the first model based on the plurality of first data and the plurality of second data, the obtaining the target model comprising: acquiring a plurality of third data in the plurality of second data, wherein the plurality of third data are data with influence degree smaller than a preset influence degree on an output result of the first model; deleting the plurality of third data from the plurality of second data to obtain a plurality of fourth data; and learning and training the first model by adopting the plurality of first data and the plurality of fourth data to obtain the target model.
The processor also realizes the following steps when executing the program: acquiring the plurality of first data includes: acquiring a plurality of original data representing characteristics of each second user; determining abnormal data in the plurality of original data, wherein the abnormal data at least comprises: missing data of numerical values and data needing to be converted; processing the abnormal data to obtain processed abnormal data; adding the processed abnormal data to the plurality of original data to obtain a plurality of fifth data; and performing amplification processing on the plurality of fifth data to obtain the plurality of first data.
The processor also realizes the following steps when executing the program: if the number of the target users is a plurality of, after the target data of the target users are input into a target model for identification processing, the method further comprises the following steps: determining the purchase rate of each target user for purchasing the target product; pushing the target products according to the purchase rate of each target user purchasing the target products.
The processor also realizes the following steps when executing the program: the target product is a 5G package of the mobile phone.
The device herein may be a server, PC, PAD, cell phone, etc.
The present application also provides a computer program product adapted to perform, when executed on a data processing device, a program initialized with the method steps of: obtaining a plurality of target data of a target user, wherein the target user is a user to be identified, who does not purchase a target product, and the target data at least comprises: data representing an identity characteristic of the target user, data representing a behavioral characteristic of the target user; inputting a plurality of target data of the target user into a target model for recognition processing to obtain a recognition result for recognizing the target user, wherein the target model is a model obtained by learning and training a first model by adopting data processed by an oversampling algorithm, the recognition result indicates whether the target user is a first user or not, the probability of purchasing the target product by the first user is larger than a preset probability, and the first model is one of the following: linear model, bayesian predictive model, decision tree model, neural network model.
When executed on a data processing device, is further adapted to carry out a program initialized with the method steps of: the target model is obtained by the following steps: acquiring a plurality of first data, wherein the first data is data used for representing the characteristics of each second user, and the plurality of second users at least comprise: a user who has purchased a target product and a user who has not purchased a target product, the number of users who have purchased a target product being smaller than the number of users who have not purchased a target product, each second user being characterized by at least: identity characteristics of each second user, behavioral characteristics of each second user; obtaining a plurality of second data based on the oversampling algorithm and the plurality of first data, wherein the second data is data for representing the characteristics of each third user, and the third users are users in the users who have purchased target products; and learning and training the first model based on the first data and the second data to obtain the target model.
When executed on a data processing device, is further adapted to carry out a program initialized with the method steps of: based on the oversampling algorithm and the plurality of first data, obtaining a plurality of second data includes: summarizing a plurality of data of users who have purchased target products in the plurality of first data to obtain a first sample set; summarizing a plurality of data of users who do not purchase the target product in the plurality of first data to obtain a second sample set; and inputting the first sample set and the second sample set into the oversampling algorithm for processing to obtain the plurality of second data.
When executed on a data processing device, is further adapted to carry out a program initialized with the method steps of: inputting the first sample set and the second sample set into the oversampling algorithm for processing, and obtaining the plurality of second data comprises: determining a third sample set from the first sample set according to the first sample set and the second sample set, wherein the difficulty of learning and training the third sample set is greater than a preset difficulty; acquiring the distribution density of each sample data in the third sample set; and weighting the sample data in the third sample set based on the distribution density of each sample data in the third sample set to obtain the plurality of second data.
When executed on a data processing device, is further adapted to carry out a program initialized with the method steps of: learning and training the first model based on the plurality of first data and the plurality of second data, the obtaining the target model comprising: acquiring a plurality of third data in the plurality of second data, wherein the plurality of third data are data with influence degree smaller than a preset influence degree on an output result of the first model; deleting the plurality of third data from the plurality of second data to obtain a plurality of fourth data; and learning and training the first model by adopting the plurality of first data and the plurality of fourth data to obtain the target model.
When executed on a data processing device, is further adapted to carry out a program initialized with the method steps of: acquiring the plurality of first data includes: acquiring a plurality of original data representing characteristics of each second user; determining abnormal data in the plurality of original data, wherein the abnormal data at least comprises: missing data of numerical values and data needing to be converted; processing the abnormal data to obtain processed abnormal data; adding the processed abnormal data to the plurality of original data to obtain a plurality of fifth data; and performing amplification processing on the plurality of fifth data to obtain the plurality of first data.
When executed on a data processing device, is further adapted to carry out a program initialized with the method steps of: if the number of the target users is a plurality of, after the target data of the target users are input into a target model for identification processing, the method further comprises the following steps: determining the purchase rate of each target user for purchasing the target product; pushing the target products according to the purchase rate of each target user purchasing the target products.
When executed on a data processing device, is further adapted to carry out a program initialized with the method steps of: the target product is a 5G package of the mobile phone.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, etc., such as Read Only Memory (ROM) or flash RAM. Memory is an example of a computer-readable medium.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises an element.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The foregoing is merely exemplary of the present application and is not intended to limit the present application. Various modifications and changes may be made to the present application by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc. which are within the spirit and principles of the present application are intended to be included within the scope of the claims of the present application.

Claims (11)

1. A method of identifying a user, comprising:
obtaining a plurality of target data of a target user, wherein the target user is a user to be identified, who does not purchase a target product, and the target data at least comprises: data representing an identity characteristic of the target user, data representing a behavioral characteristic of the target user;
Inputting a plurality of target data of the target user into a target model for recognition processing to obtain a recognition result for recognizing the target user, wherein the target model is a model obtained by learning and training a first model by adopting data processed by an oversampling algorithm, the recognition result indicates whether the target user is a first user or not, the probability of purchasing the target product by the first user is larger than a preset probability, and the first model is one of the following: linear model, bayesian predictive model, decision tree model, neural network model.
2. The method according to claim 1, characterized in that the object model is obtained by:
acquiring a plurality of first data, wherein the first data is data used for representing the characteristics of each second user, and the plurality of second users at least comprise: a user who has purchased a target product and a user who has not purchased a target product, the number of users who have purchased a target product being smaller than the number of users who have not purchased a target product, each second user being characterized by at least: identity characteristics of each second user, behavioral characteristics of each second user;
Obtaining a plurality of second data based on the oversampling algorithm and the plurality of first data, wherein the second data is data for representing the characteristics of each third user, and the third users are users in the users who have purchased target products;
and learning and training the first model based on the first data and the second data to obtain the target model.
3. The method of claim 2, wherein deriving a plurality of second data based on the oversampling algorithm and the plurality of first data comprises:
summarizing a plurality of data of users who have purchased target products in the plurality of first data to obtain a first sample set;
summarizing a plurality of data of users who do not purchase the target product in the plurality of first data to obtain a second sample set;
and inputting the first sample set and the second sample set into the oversampling algorithm for processing to obtain the plurality of second data.
4. A method according to claim 3, wherein inputting the first sample set and the second sample set into the oversampling algorithm for processing, the obtaining the plurality of second data comprising:
Determining a third sample set from the first sample set according to the first sample set and the second sample set, wherein the difficulty of learning and training the third sample set is greater than a preset difficulty;
acquiring the distribution density of each sample data in the third sample set;
and weighting the sample data in the third sample set based on the distribution density of each sample data in the third sample set to obtain the plurality of second data.
5. The method of claim 2, wherein learning the first model based on the plurality of first data and the plurality of second data to obtain the target model comprises:
acquiring a plurality of third data in the plurality of second data, wherein the plurality of third data are data with influence degree smaller than a preset influence degree on an output result of the first model;
deleting the plurality of third data from the plurality of second data to obtain a plurality of fourth data;
and learning and training the first model by adopting the plurality of first data and the plurality of fourth data to obtain the target model.
6. The method of claim 2, wherein acquiring the plurality of first data comprises:
Acquiring a plurality of original data representing characteristics of each second user;
determining abnormal data in the plurality of original data, wherein the abnormal data at least comprises: missing data of numerical values and data needing to be converted;
processing the abnormal data to obtain processed abnormal data;
adding the processed abnormal data to the plurality of original data to obtain a plurality of fifth data;
and performing amplification processing on the plurality of fifth data to obtain the plurality of first data.
7. The method according to claim 1, wherein if the number of the target users is plural, after inputting plural target data of the target users into a target model for recognition processing, the method further comprises:
determining the purchase rate of each target user for purchasing the target product;
pushing the target products according to the purchase rate of each target user purchasing the target products.
8. The method of claim 1, wherein the target product is a 5G package for a cell phone.
9. A user identification device, comprising:
the first acquisition unit is used for acquiring a plurality of target data of target users, wherein the target users are users to be identified, who do not purchase target products, and the target data at least comprise: data representing an identity characteristic of the target user, data representing a behavioral characteristic of the target user;
The first processing unit is used for inputting a plurality of target data of the target user into a target model for recognition processing to obtain a recognition result for recognizing the target user, wherein the target model is a model obtained by learning and training a first model by adopting data processed by an oversampling algorithm, the recognition result indicates whether the target user is a first user or not, the probability of purchasing the target product by the first user is larger than a preset probability, and the first model is one of the following: linear model, bayesian predictive model, decision tree model, neural network model.
10. A computer-readable storage medium storing a program, wherein the program performs the user identification method of any one of claims 1 to 8.
11. An electronic device comprising one or more processors and a memory for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of identifying a user of any of claims 1-8.
CN202310146495.8A 2023-02-08 2023-02-08 User identification method and device, storage medium and electronic equipment Pending CN116244652A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310146495.8A CN116244652A (en) 2023-02-08 2023-02-08 User identification method and device, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310146495.8A CN116244652A (en) 2023-02-08 2023-02-08 User identification method and device, storage medium and electronic equipment

Publications (1)

Publication Number Publication Date
CN116244652A true CN116244652A (en) 2023-06-09

Family

ID=86632625

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310146495.8A Pending CN116244652A (en) 2023-02-08 2023-02-08 User identification method and device, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN116244652A (en)

Similar Documents

Publication Publication Date Title
CN109063966B (en) Risk account identification method and device
CN107122369B (en) Service data processing method, device and system
US11868861B2 (en) Offline security value determination system and method
CN115238173B (en) Behavior analysis and medical service pushing method, equipment and medium based on big data
CN110990560B (en) Judicial data processing method and system
CN115641162A (en) Prediction data analysis system and method based on construction project cost
CN117409419A (en) Image detection method, device and storage medium
CN115358481A (en) Early warning and identification method, system and device for enterprise ex-situ migration
CN116244652A (en) User identification method and device, storage medium and electronic equipment
CN116257798A (en) Click rate prediction model training and click rate prediction method, system and equipment
CN116958622A (en) Data classification method, device, equipment, medium and program product
CN114385876A (en) Model search space generation method, device and system
CN114049202A (en) Operation risk identification method and device, storage medium and electronic equipment
CN114330369A (en) Local production marketing management method, device and equipment based on intelligent voice analysis
CN112768090A (en) Filtering system and method for chronic disease detection and risk assessment
CN114596106A (en) Method, device and equipment for identifying pre-silent user
CN113723522B (en) Abnormal user identification method and device, electronic equipment and storage medium
CN110990522B (en) Legal document determining method and system
CN115831339B (en) Medical system risk management and control pre-prediction method and system based on deep learning
CN117649236A (en) Risk prediction method, apparatus and storage medium for transaction
CN116758614A (en) Image detection method and device, storage medium and electronic device
CN115687456A (en) Regional data mining method, device and medium
CN117670370A (en) Financial service providing method and device, storage medium and electronic equipment
CN115049497A (en) Abnormal fund transaction data identification method and device and electronic equipment
CN117522579A (en) Risk user identification method and device, storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination