CN110472137B - Negative sample construction method, device and system of recognition model - Google Patents

Negative sample construction method, device and system of recognition model Download PDF

Info

Publication number
CN110472137B
CN110472137B CN201910606078.0A CN201910606078A CN110472137B CN 110472137 B CN110472137 B CN 110472137B CN 201910606078 A CN201910606078 A CN 201910606078A CN 110472137 B CN110472137 B CN 110472137B
Authority
CN
China
Prior art keywords
user
article
probability distribution
item
candidate set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910606078.0A
Other languages
Chinese (zh)
Other versions
CN110472137A (en
Inventor
孙召伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Life Insurance Company of China Ltd
Original Assignee
Ping An Life Insurance Company of China Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Life Insurance Company of China Ltd filed Critical Ping An Life Insurance Company of China Ltd
Priority to CN201910606078.0A priority Critical patent/CN110472137B/en
Publication of CN110472137A publication Critical patent/CN110472137A/en
Application granted granted Critical
Publication of CN110472137B publication Critical patent/CN110472137B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Abstract

The invention relates to the technical field of machine learning, in particular to a negative sample construction method, device and system of an identification model, which comprises the following steps: acquiring a user set and an article set, and carrying out Cartesian product on the user set and the article set to obtain an article candidate set; wherein the candidate set of items characterizes a set of items in a set of items that each user of the set of users can select; collecting historical behavior data of each user in the user set and characteristic data of each article in the article set, and adjusting the article candidate set according to the historical behavior data and the characteristic data to obtain an adjusted article candidate set; calculating a user liveness probability distribution parameter and an item popularity probability distribution parameter according to the historical behavior data of each user in the item candidate set and the characteristic data of each item; in this way, the constructed sample classes are not only balanced, but also do not affect the output of the negative sample model.

Description

Negative sample construction method, device and system of recognition model
Technical Field
The present invention relates to the field of machine learning technologies, and in particular, to a method, an apparatus, and a system for negative sample construction of an identification model.
Background
With the explosive growth of internet data, effective extraction of the internet data to provide information matched with user behavior is a problem to be solved.
At present, personalized information recommendation is an effective method for solving the problem, and by tracking the historical behavior of a user, extracting the interest characteristics of the user and constructing a negative sample model, and analyzing the matching degree of information and the user characteristics, the information possibly interested is recommended to the user. However, in the process of constructing the negative sample model, the user is subjected to feature matching only by means of the historical behaviors of the user, so that the sample information of the negative sample is too monotonous, unbalance of data types is caused, and the output of the negative sample model is influenced.
Disclosure of Invention
The object of the present invention is to solve at least one of the above technical drawbacks, and in particular, the technical drawbacks of the prior art in which the sample information for constructing the negative sample is too monotonous, so that the class imbalance affects the output of the negative sample model.
The invention provides a negative sample construction method of a sum model, which comprises the following steps:
acquiring a user set and an article set, and carrying out Cartesian product on the user set and the article set to obtain an article candidate set; wherein the candidate set of items characterizes a set of items in a set of items that each user of the set of users can select;
Collecting historical behavior data of each user in the user set and characteristic data of each article in the article set, and adjusting the article candidate set according to the historical behavior data and the characteristic data to obtain an adjusted article candidate set;
calculating a user activity probability distribution parameter and an item popularity probability distribution parameter according to the historical behavior data of each user in the item candidate set and the characteristic data of each item, and generating a user activity table and an item popularity table;
and associating the user activity table with the item popularity table, generating a list of random numbers, and taking the users and the items, the values of which are smaller than the probability distribution values of the user activity and the item popularity corresponding to the random numbers, as negative samples of the identification model.
In one embodiment, the historical behavior data includes a number of days a user logged into the platform;
before the step of adjusting the article candidate set according to the historical behavior data and the characteristic data to obtain the adjusted article candidate set, the method further comprises the following steps:
collecting the number of days for a user to log on a platform through a big data platform, and calculating the activity of the user according to the number of logging on days in a certain time period;
And carrying out probability distribution statistics according to the user liveness to obtain a calculation formula of a first probability distribution parameter as follows:
wherein P (u) represents a first probability distribution parameter, u represents a user, N (u) represents a login day of the user u in a period of T, t| represents a length of the time T, and P (u) ∈ (0, 1).
In one embodiment, the characteristic data includes a number of users that the item was clicked on;
before the step of adjusting the article candidate set according to the historical behavior data and the characteristic data to obtain the adjusted article candidate set, the method further comprises the following steps:
collecting the number of users who click on the article through a big data platform, and calculating the popularity of the article according to the number of users who click on the article in a period of time;
and carrying out probability distribution statistics according to the popularity of the article to obtain a calculation formula of a second probability distribution parameter as follows:
wherein P (i) represents a second probability distribution parameter, i represents an article, N i Indicating the number of users that item i was clicked over a period of time; s represents the item set, P (i) E (0, 1]。
In one embodiment, the step of adjusting the candidate set of items according to the historical behavior data includes:
acquiring a first probability distribution score threshold of user liveness through the first probability distribution parameters;
Determining an abnormally active user according to the historical behavior data, and undersampling the abnormally active user, wherein the abnormally active user is a user of which the probability distribution score value of the user activity is greater than a first probability distribution score value threshold;
and adjusting the user set in the item candidate set according to the undersampling result.
In one embodiment, the step of adjusting the candidate set of items according to the feature data includes:
acquiring a second probability distribution score threshold of the popularity of the object through the second probability distribution parameter;
determining a cold door article according to the characteristic data, and oversampling the cold door article; wherein the cold door object refers to an article with a probability distribution score value of the popularity of the article being less than a second probability distribution score value threshold;
and adjusting the article set of the article candidate set according to the oversampling result.
In one embodiment, the step of calculating the user liveness probability distribution parameters according to the historical behavior data of each user in the item candidate set comprises the following steps:
acquiring first probability distribution parameters of each user in the item candidate set and the user set in the item candidate set; the first probability distribution parameters are obtained through calculation according to historical behavior data of the user;
And calculating a user liveness probability distribution parameter according to the first probability distribution parameter and the user set in the adjusted item candidate set.
In one embodiment, the step of calculating the item popularity probability distribution parameter according to the characteristic data of each item in the item candidate set includes:
acquiring a second probability distribution parameter of each item in the item candidate set and adjusting the item set in the item candidate set; the second probability distribution parameters are calculated according to the characteristic data of the article;
and calculating the item popularity probability distribution parameters according to the second probability distribution parameters and the item set in the item candidate set.
In one embodiment, the step of adjusting the article candidate set according to the historical behavior data and the characteristic data further includes:
determining silent users according to the historical behavior data, oversampling the silent users, and adjusting a user set in the article candidate set according to the oversampling result;
and determining hot articles according to the characteristic data, undersampling the hot articles, and adjusting the article set of the article candidate set according to the undersampling result.
The invention also provides a negative sample construction device of the identification model, which comprises:
the first processing module is used for acquiring a user set and an article set, and carrying out Cartesian product on the user set and the article set to obtain an article candidate set; wherein the candidate set of items characterizes a set of items in a set of items that each user of the set of users can select;
the adjustment module is used for collecting historical behavior data of each user in the user set and characteristic data of each article in the article set, and adjusting the article candidate set according to the historical behavior data and the characteristic data to obtain an adjusted article candidate set;
the second processing module is used for calculating a user activity probability distribution parameter and an article popularity probability distribution parameter according to the historical behavior data of each user in the article candidate set and the characteristic data of each article, and generating a user activity table and an article popularity table;
and the sampling module is used for associating the user activity table with the item popularity table and generating a list of random numbers, and taking the users and the items, the values of which are smaller than the probability distribution values of the user activity and the item popularity corresponding to the random numbers, as negative samples of the identification model.
The invention also provides a negative-sample construction system of an identification model, comprising a computer memory, a computer processor and a computer program stored in the computer memory and executable in the computer processor, the computer processor implementing the steps of the method according to any of the embodiments above when the computer program is executed.
The negative sample construction method, the device and the system of the identification model firstly acquire a user set and an article set, and carry out Cartesian product on the user set and the article set to acquire an article candidate set; wherein the candidate set of items characterizes a set of items in a set of items that each user of the set of users can select; collecting historical behavior data of each user in the user set and characteristic data of each article in the article set, and adjusting the article candidate set according to the historical behavior data and the characteristic data to obtain an adjusted article candidate set; calculating a user activity probability distribution parameter and an item popularity probability distribution parameter according to the historical behavior data of each user in the item candidate set and the characteristic data of each item, and generating a user activity table and an item popularity table; and finally, associating the user activity table with the item popularity table, generating a list of random numbers, and taking the users and the items, the values of which are smaller than the probability distribution values of the user activity and the item popularity corresponding to the random numbers, as negative samples of the identification model.
According to the method and the device, in the process of negative sample construction, user liveness and article popularity are considered at the same time, a constructed data set can be enabled to approach real sample distribution to the greatest extent, the distribution of the article popularity is considered to control the occurrence proportion of long-tail articles in a negative sample construction model to be moderate, real data distribution cannot be affected, the sample size of active users cannot be underestimated and the sample size of silent users cannot be overestimated in the negative sample construction process due to the distribution of the user liveness, and then constructed sample types can be balanced, and accordingly output of the negative sample model cannot be affected.
Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
The foregoing and/or additional aspects and advantages of the invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, in which:
FIG. 1 is a diagram of an application environment for an embodiment of the present invention;
FIG. 2 is a flow diagram of a negative sample construction method of an identification model of one embodiment;
FIG. 3 is a schematic diagram of a negative-sample construction device of an identification model of an embodiment;
FIG. 4 is a schematic diagram of the internal structure of a computer device in one embodiment.
Detailed Description
Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the invention.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It will be understood by those skilled in the art that all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs unless defined otherwise. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
Referring to FIG. 1, FIG. 1 is a diagram of an application environment of an embodiment of the present invention; in this embodiment, the technical solution of the present invention may be implemented based on the server 120, for example, in fig. 1, data interaction is performed between the server 120 and the user terminal 110 through a network. In the embodiment of the present invention, the server 120 obtains sample information of the user terminal 110 through a network, performs probability distribution statistics and proportion adjustment on the obtained sample information, and then performs related operations such as condition query on random numbers in a database; the server 120 referred to herein refers to a device that implements various background functions; in particular, the database referred to herein refers to a database capable of querying the SQL language.
In one embodiment, as shown in fig. 2, fig. 2 is a flowchart of a negative-sample construction method of an identification model according to one embodiment, and in this embodiment, a negative-sample construction method of an identification model is provided, which specifically may include the following steps:
s110: acquiring a user set and an article set, and carrying out Cartesian product on the user set and the article set to obtain an article candidate set; wherein the candidate set of items characterizes a set of items in a set of items that each user of the set of users can select.
In this step, registered users of the application platform are collected through the big data platform, and all registered users form a user set, namely a user set.
And then, acquiring the articles displayed in the application platform through the big data platform, and forming a set of articles, namely an article set, by all the displayed articles.
Based on the above description, the user set and the item set are obtained, and the manner of carrying out Cartesian product on the user set and the item set in the database is as follows:
suppose the user set is (u) 1 ,u 2 ) The article set is (i) 1 ,i 2 ,i 3 ) Constructing a Cartesian product of the user set and the item set:
u 1 i 1
u 1 i 2
u 1 i 3
u 2 i 1
u 2 i 2
u 2 i 3
the objective of the cartesian integration of the user set and the item set is to form a relevant associated item candidate set after the cartesian integration of the two irrelevant sets, where the item candidate set characterizes the item set in the item set that can be selected by each user in the user set.
In this embodiment, after the user set and the item set are subjected to cartesian integration, abnormal data, such as abnormally active users, silent users, cold items, hot items, and the like, can be screened from the obtained item candidate set.
S120: and collecting historical behavior data of each user in the user set and characteristic data of each article in the article set, and adjusting the article candidate set according to the historical behavior data and the characteristic data to obtain an adjusted article candidate set.
In the step, historical behavior data of a user is collected through a big data platform, the historical behavior data comprise the number of times the user logs in the platform, the time for browsing the webpage and the like, and the activity related information of the user is determined according to the number of times the user logs in the platform and the time for browsing the webpage.
For example, the information related to the user's liveness includes the collected information such as the number of times of logging in the platform by the user and the time of browsing the web page, the user liveness of each user is determined by different information such as the number of times of logging in the platform by the user and the time of browsing the web page, and the candidate set of items is adjusted according to the user liveness of different users. The information related to the activity of the user may include one or more of the features described above, and may further include other features, which are not limited by the embodiment of the present invention.
The adjustment may be performed by calculating a probability distribution map of the user liveness based on the historical behavior data of the user, and undersampling or oversampling the abnormally active or silent users in the user liveness based on the probability distribution map.
The method comprises the steps of acquiring characteristic data of the object through a large data platform, wherein the characteristic data of the object comprise information such as click rate, purchase rate and the like of the object to be acted by a user, determining object popularity information according to the information such as the click rate and the purchase rate of the object to be acted by the user, and adjusting object candidate sets according to the object popularity information.
Wherein, the item acted by the user refers to the item clicked, browsed or purchased by the user, the item popularity information refers to determining the popularity of the item according to the number of users who click, browse or purchase the single item, and the feature data may also include other feature data, which is not limited by the embodiment of the present invention.
The adjustment method can also be used for calculating a probability distribution map of the popularity of the article according to the characteristic data of the article, and oversampling or undersampling the cold door article or the hot article in the popularity of the article according to the probability distribution map.
In this embodiment, by collecting historical behavior data of each user in the user set and feature data of each article in the article set, the article candidate set is adjusted according to the historical behavior data and the feature data to obtain an adjusted article candidate set, and the higher the user activity is, the more negative sample articles are pumped, the more popular the articles are, the higher the pumped probability is, and the data set thus constructed is more consistent with long-tail distribution.
S130: and calculating a user liveness probability distribution parameter and an item popularity probability distribution parameter according to the historical behavior data of each user in the item candidate set and the characteristic data of each item, and generating a user liveness table and an item popularity table.
In this step, the above S120 is used to perform proportion adjustment on each item and each user in the item candidate set, and process the abnormally active user and the item too cold, so as to prevent part of long-tail items from being missed, and reduce the influence of the abnormally active user and the cold item on the negative sample model.
The abnormally active users refer to users who log in the platform, browse webpage time and the like too frequently and exceed a set threshold, and the cold items refer to items of which the number of users who click, browse or purchase single items is smaller than the set threshold.
And carrying out proportion adjustment on each item and each user in the item candidate set to obtain an adjusted item candidate set, then carrying out calculation on probability distribution of user liveness according to historical behavior data of each user and characteristic data of each item in the adjusted item candidate set to obtain user liveness probability distribution parameters of each user in the adjusted item candidate set and item popularity probability distribution parameters of each item, carrying out statistics on the user liveness probability distribution parameters of each user to obtain a user liveness table, and carrying out statistics on the item popularity probability distribution parameters of each item to obtain an item popularity table.
S140: and associating the user activity table with the item popularity table, generating a list of random numbers, and taking the users and the items, the values of which are smaller than the probability distribution values of the user activity and the item popularity corresponding to the random numbers, as negative samples of the identification model.
In this step, the user activity table and the item popularity table generated in step S130 are associated, and a list of random numbers is generated, and the specific process is as follows:
it should be noted that, the value of the random number in the association table is merely taken as an example, and other values may be determined in other manners, which are not illustrated in detail and are not limiting in the embodiments of the present invention.
After the random number is generated, the random number is selected in the database through SQL conditional query, and the pseudo code logic is as follows:
SELECT user,item
WHEREAnd/>
the selection of the random number, i.e. the sampling of the negative sample, is achieved by means of the above-mentioned pseudo-code, wherein,indicating user liveness>And r represents a random number, and when the random number simultaneously meets probability distribution values smaller than the user liveness and the object popularity, the user and the object corresponding to the random number are determined to be negative samples.
Firstly, acquiring a user set and an article set, and carrying out Cartesian product on the user set and the article set to obtain an article candidate set; wherein the candidate set of items characterizes a set of items in a set of items that each user of the set of users can select; collecting historical behavior data of each user in the user set and characteristic data of each article in the article set, and adjusting the article candidate set according to the historical behavior data and the characteristic data to obtain an adjusted article candidate set; calculating a user activity probability distribution parameter and an item popularity probability distribution parameter according to the historical behavior data of each user in the item candidate set and the characteristic data of each item, and generating a user activity table and an item popularity table; and finally, associating the user activity table with the item popularity table, generating a list of random numbers, and taking the users and the items, the values of which are smaller than the probability distribution values of the user activity and the item popularity corresponding to the random numbers, as negative samples of the identification model.
According to the method and the device, in the process of negative sample construction, user liveness and article popularity are considered at the same time, a constructed data set can be enabled to approach real sample distribution to the greatest extent, the distribution of the article popularity is considered to control the occurrence proportion of long-tail articles in a negative sample construction model to be moderate, real data distribution cannot be affected, the sample size of active users cannot be underestimated and the sample size of silent users cannot be overestimated in the negative sample construction process due to the distribution of the user liveness, and then constructed sample types can be balanced, and accordingly output of the negative sample model cannot be affected.
In one embodiment, the historical behavior data includes a number of days a user logged into the platform; before the step of adjusting the article candidate set according to the historical behavior data and the feature data to obtain the adjusted article candidate set in step S120, the method may further include:
(1) Collecting the number of days for a user to log on a platform through a big data platform, and calculating the activity of the user according to the number of logging on days in a certain time period;
(2) And carrying out probability distribution statistics according to the user liveness to obtain a calculation formula of a first probability distribution parameter as follows:
wherein P (u) represents a first probability distribution parameter, u represents a user, N (u) represents a login day of the user u in a period of T, t| represents a length of the time T, and P (u) ∈ (0, 1).
In the above process, the number of days for the user to log on the platform is collected through log information in the big data platform, the user activity is calculated according to the number of logging on days in a certain time period, then the probability distribution parameter of the user activity is obtained by utilizing the calculation formula, namely, a first probability distribution parameter, a probability distribution diagram of the user activity can be obtained according to the first probability distribution parameter, and random sampling can be carried out according to the probability distribution diagram to obtain a negative sample.
In one embodiment, the characteristic data includes a number of users that the item was clicked on; before the step of adjusting the article candidate set according to the historical behavior data and the feature data to obtain the adjusted article candidate set in step S120, the method may further include:
(1) Collecting the number of users who click on the article through a big data platform, and calculating the popularity of the article according to the number of users who click on the article in a period of time;
(2) And carrying out probability distribution statistics according to the popularity of the article to obtain a calculation formula of a second probability distribution parameter as follows:
wherein P (i) represents a second probability distribution parameter, i represents an article, N i Indicating the number of users that item i was clicked over a period of time; s represents the item set, P (i) E (0, 1]。
In the process, the number of users who click on the article is collected through a web crawler or a buried point arranged on a display page, the popularity of the article is calculated according to the number of users who click on the article in a period of time, probability distribution parameters of the popularity of the article, namely second probability distribution parameters, are obtained by utilizing the calculation formula, a probability distribution diagram of the popularity of the article can be obtained according to the second probability distribution parameters, and random sampling can be carried out according to the probability distribution diagram to obtain a negative sample.
In one embodiment, the step of adjusting the article candidate set according to the historical behavior data in step S120 may include:
(1) Acquiring a first probability distribution score threshold of user liveness through the first probability distribution parameters;
(2) Acquiring an abnormally active user according to the historical behavior data, and undersampling the abnormally active user, wherein the abnormally active user is a user of which the probability distribution score value of the user activity is greater than a first probability distribution score value threshold;
(3) And adjusting the user set in the item candidate set according to the undersampling result.
In the above process, the first probability distribution parameter of the user is determined according to the historical behavior data of the user, the probability distribution score value of the user activity can be obtained through the first probability distribution parameter, and the first probability distribution score threshold can be determined according to the probability distribution score value, wherein the first probability distribution score threshold refers to the probability distribution score threshold of the abnormally active user.
For example, a first threshold score value of the probability distribution of an abnormally active user is taken as 99%, and when the percentage of the abnormally active user is greater than 99%, the user is an abnormally active user.
It should be noted that, the first probability distribution score threshold of the abnormally active user is 99%, and other scores may also be taken, which is not limited by the embodiment of the present invention.
And after determining the abnormally active users, undersampling the part of the abnormally active users, and reducing the influence of the abnormally active users on the negative sample model.
In one embodiment, the step of calculating the user liveness probability distribution parameter according to the historical behavior data of each user in the adjustment item candidate set in step S130 may include:
(1) Acquiring first probability distribution parameters of each user in the item candidate set and the user set in the item candidate set; the first probability distribution parameters are obtained through calculation according to historical behavior data of the user;
(2) And calculating a user liveness probability distribution parameter according to the first probability distribution parameter and the user set in the adjusted item candidate set.
The calculation formula for obtaining the user liveness probability distribution parameter after undersampling the abnormally active user is as follows:
wherein U represents the user set in the adjustment item candidate set, and P (U) represents the first probability distribution parameterThe number of the product is the number, Representing the probability distribution parameters of user liveness, P (u) E (0, 1)],/>
And calculating the sampled first probability distribution parameters again, and obtaining the user liveness probability distribution parameters by using the calculation formula, wherein the calculation of the process is realized by SQL codes.
It should be noted that, the undersampling ratio 0.75 in the above formula may be adjusted according to practical situations, which is not limited herein.
In one embodiment, the step of adjusting the candidate set of items according to the feature data in step S120 may include:
(1) Acquiring a second probability distribution score threshold of the item popularity through the second probability distribution parameter;
(2) Determining a cold door article according to the characteristic data, and oversampling the cold door article; wherein, the cold door article is an article with the probability distribution score value of the popularity of the article smaller than the second probability distribution score value threshold;
(3) And adjusting the article set of the article candidate set according to the oversampling result.
In the above process, the second probability distribution parameters of the articles are determined according to the feature data of the articles, the probability distribution score value of the popularity of the articles can be obtained through the second probability distribution parameters, and the second probability distribution score value threshold can be determined according to the probability distribution score value, wherein the second probability distribution score value threshold refers to the probability distribution score value threshold of the cold articles.
For example, a second probability distribution score threshold of 1% for a cold item is taken, and when the popularity percentage of the item is less than 1%, the item is a cold item.
It should be noted that, the second probability distribution score threshold of the cold door article is 1%, and other scores may also be taken, which is not limited by the embodiment of the present invention.
And after the cold door article is determined, oversampling the part of the cold door article, and improving the sample distribution of the cold door article in the negative sample model.
In one embodiment, the step of calculating the item popularity probability distribution parameter according to the feature data of each item in the adjusted item candidate set in step S130 may include:
(1) Acquiring a second probability distribution parameter of each item in the item candidate set and adjusting the item set in the item candidate set; the second probability distribution parameters are calculated according to the characteristic data of the article;
(2) And calculating the item popularity probability distribution parameters according to the second probability distribution parameters and the item set in the item candidate set.
And re-calculating the sampled second probability distribution parameters, and re-setting the oversampled proportional values by using a calculation formula of the user liveness probability distribution parameters to obtain the item popularity probability distribution parameters, wherein the calculation of the process is realized through SQL codes.
In one embodiment, the step of adjusting the article candidate set in step S120 according to the historical behavior data and the feature data may further include:
(1) Determining silent users according to the historical behavior data, oversampling the silent users, and adjusting a user set in the article candidate set according to the oversampling result;
(2) And determining hot articles according to the characteristic data, undersampling the hot articles, and adjusting the article set of the article candidate set according to the undersampling result.
In this embodiment, after determining the first probability distribution parameter of the user according to the historical behavior data of the user, the probability distribution score value of the user liveness may be obtained through the first probability distribution parameter, and the probability distribution score threshold of the silent user may be determined according to the probability distribution score value.
For example, taking the threshold of probability distribution score of the silent user as 2%, when the percentage of the silent user is less than 2%, the silent user is considered as the silent user, and the sample distribution of the silent user in the negative sample model is improved by oversampling the silent user.
In this embodiment, after determining the second probability distribution parameter of the article according to the feature data of the article, the probability distribution score of the popularity of the article may also be obtained through the second probability distribution parameter, and the probability distribution score threshold of the popular article may be determined according to the probability distribution score.
For example, the probability distribution score threshold of the hot object is 90%, and when the popularity percentage of the object is greater than 90%, the object is the hot object, the hot object is undersampled, and the influence of the hot object on the negative sample model is reduced.
In one embodiment, as shown in fig. 3, fig. 3 is a schematic structural diagram of a negative-sample construction device of an identification model in one embodiment, and in this embodiment, there is provided a negative-sample construction device of an identification model, which includes: a first processing module 210, an adjustment module 220, a second processing module 230, and a sampling module 240, wherein:
the first processing module 210: the method comprises the steps of obtaining a user set and an article set, and carrying out Cartesian product on the user set and the article set to obtain an article candidate set; wherein the candidate set of items characterizes a set of items in a set of items that each user of the set of users can select.
In the module, registered users of the application platform are collected through the big data platform, and all registered users form a user set, namely a user set.
And then, acquiring the articles displayed in the application platform through the big data platform, and forming a set of articles, namely an article set, by all the displayed articles.
Based on the above description, the user set and the item set are obtained, and the manner of carrying out Cartesian product on the user set and the item set in the database is as follows:
suppose the user set is (u) 1 ,u 2 ) The article set is (i) 1 ,i 2 ,i 3 ) Constructing a Cartesian product of the user set and the item set:
the objective of the cartesian integration of the user set and the item set is to form a relevant associated item candidate set after the cartesian integration of the two irrelevant sets, where the item candidate set characterizes the item set in the item set that can be selected by each user in the user set.
In this embodiment, after the user set and the item set are subjected to cartesian integration, abnormal data, such as abnormally active users, silent users, cold items, hot items, and the like, can be screened from the obtained item candidate set.
The adjustment module 220: the article collection device is used for collecting historical behavior data of each user in the user set and characteristic data of each article in the article set, and adjusting the article candidate set according to the historical behavior data and the characteristic data to obtain an adjusted article candidate set.
In the module, historical behavior data of a user is collected through a big data platform, the historical behavior data comprise the number of times the user logs in the platform, the time for browsing the webpage and the like, and the activity related information of the user is determined according to the number of times the user logs in the platform and the time for browsing the webpage.
For example, the information related to the user's liveness includes the collected information such as the number of times of logging in the platform by the user and the time of browsing the web page, the user liveness of each user is determined by different information such as the number of times of logging in the platform by the user and the time of browsing the web page, and the candidate set of items is adjusted according to the user liveness of different users. The information related to the activity of the user may include one or more of the features described above, and may further include other features, which are not limited by the embodiment of the present invention.
The adjustment may be performed by calculating a probability distribution map of the user liveness based on the historical behavior data of the user, and undersampling or oversampling the abnormally active or silent users in the user liveness based on the probability distribution map.
The module also comprises a step of collecting characteristic data of the object through the big data platform, wherein the characteristic data of the object comprises information such as click rate, purchase rate and the like of the object to be acted by a user, and the object popularity information is determined according to the information such as the click rate, the purchase rate and the like of the object to be acted by the user, and then the object candidate set is adjusted according to the object popularity information.
Wherein, the item acted by the user refers to the item clicked, browsed or purchased by the user, the item popularity information refers to determining the popularity of the item according to the number of users who click, browse or purchase the single item, and the feature data may also include other feature data, which is not limited by the embodiment of the present invention.
The adjustment method can also be used for calculating a probability distribution map of the popularity of the article according to the characteristic data of the article, and oversampling or undersampling the cold door article or the hot article in the popularity of the article according to the probability distribution map.
In this embodiment, by collecting historical behavior data of each user in the user set and feature data of each article in the article set, the article candidate set is adjusted according to the historical behavior data and the feature data to obtain an adjusted article candidate set, and the higher the user activity is, the more negative sample articles are pumped, the more popular the articles are, the higher the pumped probability is, and the data set thus constructed is more consistent with long-tail distribution.
The second processing module 230: and the method is used for calculating the user liveness probability distribution parameters and the item popularity probability distribution parameters according to the historical behavior data of each user in the item candidate set and the characteristic data of each item, and generating a user liveness table and an item popularity table.
In this module, the above S120 is used to perform proportion adjustment on each item and each user in the item candidate set, and process the abnormally active user and the item too cold, so as to prevent part of long-tail items from being missed, and reduce the influence of the abnormally active user and the cold item on the negative sample model.
The abnormally active users refer to users who log in the platform, browse webpage time and the like too frequently and exceed a set threshold, and the cold items refer to items of which the number of users who click, browse or purchase single items is smaller than the set threshold.
And carrying out proportion adjustment on each item and each user in the item candidate set to obtain an adjusted item candidate set, then carrying out calculation on probability distribution of user liveness according to historical behavior data of each user and characteristic data of each item in the adjusted item candidate set to obtain user liveness probability distribution parameters of each user in the adjusted item candidate set and item popularity probability distribution parameters of each item, carrying out statistics on the user liveness probability distribution parameters of each user to obtain a user liveness table, and carrying out statistics on the item popularity probability distribution parameters of each item to obtain an item popularity table.
Sampling module 240: and the user and the article are used as negative samples of an identification model, wherein the user activity table is used for associating the user activity table with the article popularity table and generating a list of random numbers, and the values of the random numbers are smaller than probability distribution values of the user activity and the article popularity corresponding to the random numbers.
In this module, the user activity table and the item popularity table generated in step S130 are associated, and a list of random numbers is generated, and the specific process is as follows:
it should be noted that, the value of the random number in the association table is merely taken as an example, and other values may be determined in other manners, which are not illustrated in detail and are not limiting in the embodiments of the present invention.
After the random number is generated, the random number is selected in the database through SQL conditional query, and the pseudo code logic is as follows:
SELECT user,item
WHEREAnd/>
the selection of the random number, i.e. the sampling of the negative sample, is achieved by means of the above-mentioned pseudo-code, wherein,indicating user liveness>And r represents a random number, and when the random number simultaneously meets probability distribution values smaller than the user liveness and the object popularity, the user and the object corresponding to the random number are determined to be negative samples.
The negative sample construction device of the identification model firstly obtains a user set and an article set, and carries out Cartesian product on the user set and the article set to obtain an article candidate set; wherein the candidate set of items characterizes a set of items in a set of items that each user of the set of users can select; collecting historical behavior data of each user in the user set and characteristic data of each article in the article set, and adjusting the article candidate set according to the historical behavior data and the characteristic data to obtain an adjusted article candidate set; calculating a user activity probability distribution parameter and an item popularity probability distribution parameter according to the historical behavior data of each user in the item candidate set and the characteristic data of each item, and generating a user activity table and an item popularity table; and finally, associating the user activity table with the item popularity table, generating a list of random numbers, and taking the users and the items, the values of which are smaller than the probability distribution values of the user activity and the item popularity corresponding to the random numbers, as negative samples of the identification model.
According to the method and the device, in the process of negative sample construction, user liveness and article popularity are considered at the same time, a constructed data set can be enabled to approach real sample distribution to the greatest extent, the distribution of the article popularity is considered to control the occurrence proportion of long-tail articles in a negative sample construction model to be moderate, real data distribution cannot be affected, the sample size of active users cannot be underestimated and the sample size of silent users cannot be overestimated in the negative sample construction process due to the distribution of the user liveness, and then constructed sample types can be balanced, and accordingly output of the negative sample model cannot be affected.
For specific limitations on the negative-sample construction means of the identification model, reference may be made to the above description of the negative-sample construction method of the identification model, and no further description is given here. The respective modules in the negative-sample construction apparatus of the above-described identification model may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
As shown in fig. 4, fig. 4 is a schematic diagram of an internal structure of the computer device in one embodiment. The computer device 310 includes a processor 314, a non-volatile storage medium 315, a memory 311, and a network interface 312 connected by a system bus 313. The non-volatile storage medium 315 of the computer device 310 stores an operating system 317 and a computer program 316, where the computer program 316, when executed by the processor 314, causes the processor 314 to implement a negative-sample construction method for an identification model. The processor 314 of the computer device 310 is configured to provide computing and control capabilities to support the operation of the entire computer device 310. The memory 311 of the computer device 310 has stored therein a computer program 316, which computer program 316, when executed by the processor 314, causes the processor 314 to perform a negative-sample construction method of the recognition model. The network interface 312 of the computer device 310 is used for connection communication with a mobile terminal.
Those skilled in the art will appreciate that the structures shown in FIG. 4 are block diagrams only and do not constitute a limitation of the computer device on which the present aspects apply, and that a particular computer device may include more or less components than those shown, or may combine some of the components, or have a different arrangement of components.
In an embodiment, a negative-working construction system of an identification model is proposed, comprising a computer memory, a computer processor and a computer program stored in the computer memory and executable in the computer processor, the computer processor implementing the steps in the negative-working construction method of an identification model according to any of the above embodiments when the computer program is executed.
It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited in order and may be performed in other orders, unless explicitly stated herein. Moreover, at least some of the steps in the flowcharts of the figures may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order of their execution not necessarily being sequential, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.
The foregoing is only a partial embodiment of the present invention, and it should be noted that it will be apparent to those skilled in the art that modifications and adaptations can be made without departing from the principles of the present invention, and such modifications and adaptations are intended to be comprehended within the scope of the present invention.

Claims (6)

1. The negative sample construction method of the identification model is characterized by comprising the following steps of:
acquiring a user set and an article set, and carrying out Cartesian product on the user set and the article set to obtain an article candidate set; wherein the candidate set of items characterizes a set of items in a set of items that each user of the set of users can select;
collecting historical behavior data of each user in the user set and characteristic data of each article in the article set, and adjusting the article candidate set according to the historical behavior data and the characteristic data to obtain an adjusted article candidate set;
calculating a user activity probability distribution parameter and an item popularity probability distribution parameter according to the historical behavior data of each user in the item candidate set and the characteristic data of each item, and generating a user activity table and an item popularity table;
associating the user activity table with the item popularity table, generating a list of random numbers, and taking users and items, the values of which are smaller than probability distribution values of the user activity and the item popularity corresponding to the random numbers, as negative samples of an identification model;
The historical behavior data comprise the number of days when a user logs in a platform;
before the step of adjusting the article candidate set according to the historical behavior data and the characteristic data to obtain the adjusted article candidate set, the method further comprises the following steps:
collecting the number of days for a user to log on a platform through a big data platform, and calculating the activity of the user according to the number of logging on days in a certain time period;
and carrying out probability distribution statistics according to the user liveness to obtain a calculation formula of a first probability distribution parameter as follows:
wherein, the liquid crystal display device comprises a liquid crystal display device,representing a first probability distribution parameter, +.>Representing the user->Representing user +.>At->Login days in time period, < > on->Time of presentation->Length of->
The characteristic data comprises the number of users who click on the article;
before the step of adjusting the article candidate set according to the historical behavior data and the characteristic data to obtain the adjusted article candidate set, the method further comprises the following steps:
collecting the number of users who click on the article through a big data platform, and calculating the popularity of the article according to the number of users who click on the article in a period of time;
and carrying out probability distribution statistics according to the popularity of the article to obtain a calculation formula of a second probability distribution parameter as follows:
wherein, the liquid crystal display device comprises a liquid crystal display device,representing a second probability distribution parameter, " >Representing articles->Express item->The number of users clicked in a period of time; />Representing the set of items->
The step of adjusting the article candidate set according to the historical behavior data comprises the following steps:
acquiring a first probability distribution score threshold of user liveness through the first probability distribution parameters;
determining an abnormally active user according to the historical behavior data, and undersampling the abnormally active user, wherein the abnormally active user is a user of which the probability distribution score value of the user activity is greater than a first probability distribution score value threshold;
adjusting a user set in the item candidate set according to the undersampling result;
the step of adjusting the article candidate set according to the characteristic data comprises the following steps:
acquiring a second probability distribution score threshold of the popularity of the object through the second probability distribution parameter;
determining a cold door article according to the characteristic data, and oversampling the cold door article; wherein, the cold door article is an article with the probability distribution score value of the popularity of the article smaller than the second probability distribution score value threshold;
and adjusting the article set of the article candidate set according to the oversampling result.
2. The method of claim 1, wherein the step of calculating a user liveness probability distribution parameter from historical behavioral data of each user in the candidate set of adjustment items comprises:
acquiring first probability distribution parameters of each user in the item candidate set and the user set in the item candidate set; the first probability distribution parameters are obtained through calculation according to historical behavior data of the user;
and calculating a user liveness probability distribution parameter according to the first probability distribution parameter and the user set in the adjusted item candidate set.
3. The method of claim 2, wherein the step of calculating item popularity probability distribution parameters from the characteristic data of each item in the candidate set of adjusted items comprises:
acquiring a second probability distribution parameter of each item in the item candidate set and adjusting the item set in the item candidate set; the second probability distribution parameters are calculated according to the characteristic data of the article;
and calculating the item popularity probability distribution parameters according to the second probability distribution parameters and the item set in the item candidate set.
4. The method of claim 3, wherein the step of adjusting the article candidate set based on the historical behavioral data and characteristic data further comprises:
determining silent users according to the historical behavior data, oversampling the silent users, and adjusting a user set in the article candidate set according to the oversampling result;
and determining hot articles according to the characteristic data, undersampling the hot articles, and adjusting the article set of the article candidate set according to the undersampling result.
5. A negative-sample construction apparatus of an identification model, characterized by being applied to the negative-sample construction method of an identification model according to any one of claims 1 to 4, the apparatus comprising:
the first processing module is used for acquiring a user set and an article set, and carrying out Cartesian product on the user set and the article set to obtain an article candidate set; wherein the candidate set of items characterizes a set of items in a set of items that each user of the set of users can select;
the adjustment module is used for collecting historical behavior data of each user in the user set and characteristic data of each article in the article set, and adjusting the article candidate set according to the historical behavior data and the characteristic data to obtain an adjusted article candidate set;
The second processing module is used for calculating a user activity probability distribution parameter and an article popularity probability distribution parameter according to the historical behavior data of each user in the article candidate set and the characteristic data of each article, and generating a user activity table and an article popularity table;
and the sampling module is used for associating the user activity table with the item popularity table and generating a list of random numbers, and taking the users and the items, the values of which are smaller than the probability distribution values of the user activity and the item popularity corresponding to the random numbers, as negative samples of the identification model.
6. A negative-working construction system for an identification model, characterized by comprising a computer memory, a computer processor and a computer program stored in the computer memory and executable in the computer processor, the computer processor implementing the steps in the method according to any one of claims 1 to 4 when the computer program is executed.
CN201910606078.0A 2019-07-05 2019-07-05 Negative sample construction method, device and system of recognition model Active CN110472137B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910606078.0A CN110472137B (en) 2019-07-05 2019-07-05 Negative sample construction method, device and system of recognition model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910606078.0A CN110472137B (en) 2019-07-05 2019-07-05 Negative sample construction method, device and system of recognition model

Publications (2)

Publication Number Publication Date
CN110472137A CN110472137A (en) 2019-11-19
CN110472137B true CN110472137B (en) 2023-07-25

Family

ID=68506775

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910606078.0A Active CN110472137B (en) 2019-07-05 2019-07-05 Negative sample construction method, device and system of recognition model

Country Status (1)

Country Link
CN (1) CN110472137B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113111085B (en) * 2021-04-08 2024-01-30 达观数据有限公司 Automatic hierarchical exploration method and device based on stream data

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106294762A (en) * 2016-08-11 2017-01-04 齐鲁工业大学 A kind of entity recognition method based on study
WO2017143919A1 (en) * 2016-02-26 2017-08-31 阿里巴巴集团控股有限公司 Method and apparatus for establishing data identification model
CN107424007A (en) * 2017-07-12 2017-12-01 北京京东尚科信息技术有限公司 A kind of method and apparatus for building electronic ticket susceptibility identification model
CN108616491A (en) * 2016-12-13 2018-10-02 北京酷智科技有限公司 A kind of recognition methods of malicious user and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017143919A1 (en) * 2016-02-26 2017-08-31 阿里巴巴集团控股有限公司 Method and apparatus for establishing data identification model
CN106294762A (en) * 2016-08-11 2017-01-04 齐鲁工业大学 A kind of entity recognition method based on study
CN108616491A (en) * 2016-12-13 2018-10-02 北京酷智科技有限公司 A kind of recognition methods of malicious user and system
CN107424007A (en) * 2017-07-12 2017-12-01 北京京东尚科信息技术有限公司 A kind of method and apparatus for building electronic ticket susceptibility identification model

Also Published As

Publication number Publication date
CN110472137A (en) 2019-11-19

Similar Documents

Publication Publication Date Title
CN111079022B (en) Personalized recommendation method, device, equipment and medium based on federal learning
US11710054B2 (en) Information recommendation method, apparatus, and server based on user data in an online forum
US10762288B2 (en) Adaptive modification of content presented in electronic forms
US11100421B2 (en) Customized website predictions for machine-learning systems
US9288124B1 (en) Systems and methods of classifying sessions
US20150161139A1 (en) Data search processing
CN106997549A (en) The method for pushing and system of a kind of advertising message
CN105989074A (en) Method and device for recommending cold start through mobile equipment information
US11361239B2 (en) Digital content classification and recommendation based upon artificial intelligence reinforcement learning
US20170140023A1 (en) Techniques for Determining Whether to Associate New User Information with an Existing User
CN110825977A (en) Data recommendation method and related equipment
CN111967914A (en) User portrait based recommendation method and device, computer equipment and storage medium
CN112579854A (en) Information processing method, device, equipment and storage medium
Gisselbrecht et al. Whichstreams: A dynamic approach for focused data capture from large social media
CN110472137B (en) Negative sample construction method, device and system of recognition model
CN110851708B (en) Negative sample extraction method, device, computer equipment and storage medium
JP6872853B2 (en) Detection device, detection method and detection program
WO2011008282A2 (en) Evaluation of website visitor based on value grade
CN108920492B (en) Webpage classification method, system, terminal and storage medium
CN111340062A (en) Mapping relation determining method and device
CN107622125B (en) Information crawling method and device and electronic equipment
CN115858815A (en) Method for determining mapping information, advertisement recommendation method, device, equipment and medium
CN115034826A (en) Advertisement putting method and device, electronic equipment and readable storage medium
CN113850416A (en) Advertisement promotion cooperation object determining method and device
CN111882360A (en) User group expansion method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant