CN112925973A

CN112925973A - Data processing method and device

Info

Publication number: CN112925973A
Application number: CN201911243337.4A
Authority: CN
Inventors: 张美娜; 仲济源
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Priority date: 2019-12-06
Filing date: 2019-12-06
Publication date: 2021-06-08
Anticipated expiration: 2039-12-06
Also published as: CN112925973B

Abstract

The invention discloses a data processing method and device, and relates to the technical field of computers. Wherein, the method comprises the following steps: responding to the triggering of the crowd expansion task, and constructing a candidate user set for crowd expansion; extracting partial users from the candidate user set according to a first extraction rule, and then taking the extracted partial users and a seed user set as positive sample users; extracting part of users as negative sample users according to a second extraction rule; training a first machine learning model according to the user characteristic data of the positive sample user and the negative sample user to obtain a trained first machine learning model; and screening out an expanded user set from the candidate user set according to the trained first machine learning model. Through the steps, the training effect of the machine learning model in population expansion can be improved, and the accuracy of population expansion is improved.

Description

Data processing method and device

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a data processing method and apparatus.

Background

Crowd expansion is often used for placement of advertisements or marketing campaigns for merchants. For example, when advertisement delivery is performed, considering that the user quantity of seed crowds provided by an advertiser is often small, and the defects that advertisement coverage is small and expected flow cannot be achieved when advertisement delivery is performed based on the seed crowds are overcome, an advertisement data platform or a shopping data platform (DMP) expands the seed crowds by analyzing the significance characteristics of the seed crowds according to the characteristics, and then advertisement delivery is performed based on the expanded crowds, so that the purpose of improving click conversion rate or purchase conversion rate is achieved.

The existing population expansion schemes mainly include the following two types: first, crowd expansion is performed based on user portraits. Specifically, various portrait feature labels are set for users through user portrait analysis, portrait feature labels of most users in seed crowds are analyzed, and then crowds with high-similarity portrait feature labels in a database are listed as expanded crowds. And secondly, carrying out crowd expansion based on a classification algorithm. Specifically, a classification model is trained by taking the seed population as a positive sample and the candidate population as a negative sample, and then the candidate population is screened through the trained classification model to obtain an expanded population.

In the process of implementing the invention, the inventor finds that at least the following problems exist in the prior art: in the first prior art, the problem of low accuracy, low timeliness and the like exists in the process of completely depending on user images to expand people groups. In the second prior art, the seed population may be selected according to a specific rule, so that the problem of overfitting of the model is easily caused by taking the seed population as a positive sample, and moreover, the training effect of the model is also easily poor due to the rough sampling of a negative sample, so that the final population expansion effect is influenced.

Disclosure of Invention

In view of this, the invention provides a data processing method and device, which can improve the training effect of a machine learning model in population expansion and improve the accuracy of population expansion.

To achieve the above object, according to one aspect of the present invention, a data processing method is provided.

The data processing method of the invention comprises the following steps: in response to triggering of a crowd expansion task, determining a set of candidate users for crowd expansion; extracting partial users from the candidate user set according to a first extraction rule, and then taking the extracted partial users and a seed user set as positive sample users; extracting part of users as negative sample users according to a second extraction rule; training a first machine learning model according to the user characteristic data of the positive sample user and the negative sample user to obtain a trained first machine learning model; and screening out an expanded user set from the candidate user set according to the trained first machine learning model.

Optionally, the determining a set of candidate users for population expansion comprises: acquiring business activity information needing population expansion; inquiring a database table according to the business activity information to obtain a candidate user set corresponding to the business activity information; the business activity information comprises at least one of brand identification of target commodities related to business activities, category identification of target commodities related to business activities and shop identification related to business activities.

Optionally, the set of candidate users comprises: a short-term interest user set and a medium-term interest user set; the short-term interest user set is a user set which is screened out based on short-term behavior characteristic data of users and is interested in the target commodity; the medium-long term interest user set is a user set which is screened out based on medium-long term behavior characteristic data of users and is interested in the target commodity.

Optionally, the set of short-term interest users comprises: a first short-term interest user set, a second short-term interest user set, and a third short-term interest user set; the method further comprises the following steps: screening a first short-term interest user set from a first user set which has a first type of operation behaviors on a target commodity recently; determining similar commodities of the target commodity, and screening a second short-term interest user set from a second user set which has a first type of operation behavior on the similar commodities recently; and screening out a third short-term interest user set from a third user set which has a second type of operation behaviors on the target commodity or the similar commodity in the near future.

Optionally, the filtering out a first short-term interest user set from a first user set having a first type of operation behavior on the target commodity in the near future includes: acquiring a first user set which has a first type of operation behavior on a target commodity recently; determining the preference degree of each user in the first user set to the target commodity according to the trained second machine learning model; and taking all users with the preference degrees larger than a preset threshold value or taking a preset number of users with the maximum preference degrees as a first short-term interest user set.

Optionally, the set of medium-and long-term interest users includes: a first medium-long term interest user set and a second medium-long term interest user set; the method further comprises the following steps: counting the value distribution of the portrait labels corresponding to each user in the seed user set to determine a group portrait corresponding to the seed user set; constructing a first medium-long term interest user set according to users similar to the community portrait; and constructing a second medium and long term interest user set according to the users who have no purchasing behavior on the target commodity recently but have purchasing behavior on the target commodity once.

Optionally, the screening out an extended user set from the candidate user set according to the trained first machine learning model includes: determining the preference degree of each user in the candidate user set to the target commodity according to the trained first machine learning model; and taking all users with the preference degrees larger than a preset threshold value or the users with the maximum preference degrees in a preset number as an expanded user set.

To achieve the above object, according to another aspect of the present invention, there is provided a data processing apparatus.

The data processing apparatus of the present invention includes: the system comprises a determining module, a searching module and a searching module, wherein the determining module is used for responding to triggering of a crowd expansion task and determining a candidate user set for crowd expansion; the extraction module is used for extracting partial users from the candidate user set according to a first extraction rule and then taking the extracted partial users and the seed user set as positive sample users; the system is also used for extracting part of users as negative sample users according to a second extraction rule; the training module is used for training a first machine learning model according to the user characteristic data of the positive sample user and the negative sample user to obtain a trained first machine learning model; and the screening module is used for screening out an expanded user set from the candidate user set according to the trained first machine learning model.

To achieve the above object, according to still another aspect of the present invention, there is provided an electronic apparatus.

The electronic device of the present invention includes: one or more processors; and storage means for storing one or more programs; when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the data processing method of the present invention.

To achieve the above object, according to still another aspect of the present invention, there is provided a computer-readable medium.

The computer-readable medium of the present invention has stored thereon a computer program which, when executed by a processor, implements the data processing method of the present invention.

One embodiment of the above invention has the following advantages or benefits: the method comprises the steps of constructing a candidate user set for crowd expansion, extracting partial users from the candidate user set according to a first extraction rule, taking the extracted partial users and seed user set as positive sample users, extracting partial users as negative sample users according to a second extraction rule, and training a first machine learning model according to user characteristic data of the positive sample users and the negative sample users, so that the training effect of the machine learning model in the crowd expansion can be improved, and the precision of the crowd expansion is further improved.

Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.

Drawings

The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:

fig. 1 is a schematic diagram of a main flow of a data processing method according to a first embodiment of the present invention;

fig. 2 is a schematic diagram of a main flow of a data processing method according to a second embodiment of the present invention;

FIG. 3 is a schematic diagram of the main blocks of a data processing apparatus according to a third embodiment of the present invention;

FIG. 4 is a schematic diagram of the main blocks of a data processing apparatus according to a fourth embodiment of the present invention;

FIG. 5 is an exemplary system architecture diagram in which embodiments of the present invention may be employed;

FIG. 6 is a schematic block diagram of a computer system suitable for use with the electronic device to implement an embodiment of the invention.

Detailed Description

Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.

Fig. 1 is a schematic diagram of a main flow of a data processing method according to a first embodiment of the present invention. As shown in fig. 1, the data processing method according to the embodiment of the present invention includes:

and S101, responding to the triggering of the crowd expansion task, and determining a candidate user set for crowd expansion.

For example, after receiving a crowd expansion request submitted by a demand party (such as an advertiser or a marketing party), the crowd expansion task may be started to be executed, that is, the data processing flow of the embodiment of the present invention may be started to be executed. Wherein, the crowd expansion request may include: the business activity information needing population expansion and the seed user set related to the business activity. Further, the business activity information needing population expansion may include at least one of the following information: the business activity-related brand identification, the business activity-related item identification, and the business activity-related shop identification.

In an alternative embodiment, the determining the set of candidate users for population expansion comprises: acquiring business activity information needing population expansion; and querying a database table according to the business activity information to obtain a candidate user set corresponding to the business activity information. The database table stores candidate user information (such as information such as candidate user identifiers) corresponding to the service activity information.

In a specific example of this optional implementation, the obtained business activity information is specifically a brand identifier of a target product related to a business activity, and then the database table may be queried according to the brand identifier of the target product to find a candidate user identifier corresponding to the target product, and further construct a candidate user set based on the found candidate user identifier.

In another specific example of the optional implementation manner, if the obtained business activity information is specifically a category identifier of a target product related to a business activity, the database table may be queried according to the category identifier of the target product to find a candidate user identifier corresponding to the target product, and then a candidate user set is constructed based on the found candidate user identifier.

Step S102, extracting partial users from the candidate user set according to a first extraction rule, and then taking the extracted partial users and the seed user set as positive sample users; and extracting part of the users as negative sample users according to a second extraction rule.

Considering that the set of seed users provided by the demander may be from a specific strong rule, such as a user who has recently purchased the brand or category of the commodity, directly using the seeds for the set as a positive sample easily causes an overfitting problem to the model. In view of this, in the embodiment of the present invention, part of the users are extracted from the candidate user set according to the first extraction rule, and the seed user set is appropriately generalized to solve the problem of over-fitting and enhance the generalization capability of the machine learning model.

In an alternative embodiment, the first extraction rule may include: and determining users needing to be extracted from the candidate user set according to the purchase conversion rate and/or the click conversion rate. For example, if there are multiple categories in the candidate user set, users who have purchased and clicked on the product brand related to the business activity within the last month may be counted first, the occupation ratio of the users in the candidate user set of each category may be analyzed, and a part of users may be extracted from the candidate user set of each category according to the occupation ratio as a supplement to the positive sample. For another example, if the candidate users are grouped into a plurality of categories, the users who have bought or clicked on the goods related to the business activities within the last month can be counted, the occupation ratio of the users in the candidate user group of each category is analyzed, and a part of the users in the candidate user group of each category is extracted as the supplement of the positive sample according to the occupation ratio.

Further, in the above optional embodiment, the first extraction rule may further include: the number of the partial users extracted from the candidate user set is not lower than one fifth of the seed user and not higher than one third of the seed user. Through setting up above first extraction rule, can control the quantity of generalized positive sample when guaranteeing that the positive sample that generalizes is similar with the seed user, make the seed user not diluted to help improving the model training effect.

In the prior art, users except for the seed user are used as negative samples, and the selection mode of the negative samples is rough, so that the training effect of the model is poor. In view of this, in the embodiment of the present invention, the users having a correlation with the positive sample and a large difference are extracted as the negative sample users according to the second extraction rule, so as to improve the training effect of the model.

In an alternative embodiment, the second extraction rule comprises: the user who has a purchasing behavior in the near term (such as within a half year or within a year) but has a click on the brand or the class of goods involved in the business activity but has no purchasing behavior is taken as a negative sample user. In another alternative embodiment, the second extraction rule comprises: and taking users who have purchasing behavior recently but have clicks on brands involved in business activities and similar brands but have no purchasing behavior as negative sample users.

And S103, training a first machine learning model according to the user feature data of the positive sample user and the negative sample user to obtain the trained first machine learning model.

Wherein the user characteristic data can be constructed based on user characteristics in one or more of the following dimensions: the user portrait type characteristics, the behavior characteristics of the user to the target commodity related to the business activity (such as behavior characteristics of purchasing, shopping cart adding, clicking, attention and the like), the behavior characteristics of the user to similar commodities, and the word segmentation characteristics of the purchased commodities.

Illustratively, the first machine learning model may be an XGBoost model (the XGBoost model is called eXtreme Gradient Boosting, which is a Boosting algorithm). The first machine learning model may also be other machine learning models without affecting the implementation of the present invention.

And S104, screening an expanded user set from the candidate user set according to the trained first machine learning model.

In this step, the preference degree of each user in the candidate user set to the target commodity can be determined according to the trained first machine learning model; and taking all users with the preference degrees larger than a preset threshold value or the users with the maximum preference degrees in a preset number as an expanded user set. In specific implementation, the preset threshold and the preset number can be set by a demand side, and can also be flexibly set by a crowd development task executing side according to specific business requirements.

In the embodiment of the invention, crowd expansion is realized through the steps. Compared with the prior art, the method provided by the embodiment of the invention can improve the training effect of the machine learning model in population expansion and improve the accuracy of population expansion. In addition, the embodiment of the invention does not need to rely on the social network of the user when carrying out crowd expansion, thereby improving the universality of the crowd expansion.

Fig. 2 is a schematic diagram of a main flow of a data processing method according to a second embodiment of the present invention. As shown in fig. 2, the data processing method according to the embodiment of the present invention includes:

step S201, a database table is constructed, and the database table is used for storing a candidate user set corresponding to the commodity brand identification or the commodity category identification.

In an alternative embodiment, the set of candidate users comprises: a short-term interest user set and a medium-term interest user set; the short-term interest user set is a user set which is screened out based on short-term behavior characteristic data of users and is interested in the target commodity; the medium-long term interest user set is a user set which is screened out based on medium-long term behavior characteristic data of users and is interested in the target commodity. In the embodiment of the invention, when the candidate user set is determined, not only the users interested in the target commodity in a short term but also the users interested in the target commodity in a long term are considered, the long-term and short-term preference and diversity of the users are considered, the timeliness and the accuracy of crowd expansion are improved, and the effect of business activities (such as advertisement putting) based on the crowd expansion is further improved.

Illustratively, the set of short-term interest users may include: a first set of short-term interest users, a second set of short-term interest users, and a third set of short-term interest users. Wherein the first short-term interest user set is a user set selected from a first user set having a first type of operation behavior on the target commodity in the near term (for example, within a last month), and may be referred to as a "high-potential user set of the target commodity" for short; the second short-term interest user set is a user set screened from a second user set which has a first type of operation behavior on similar commodities of the target commodity in the near future, and can be called a high-potential user set of the similar commodities; the third short-term interest user set is a user set filtered from a third user set having a second type of operation behavior on the target commodity or the similar commodity in the near future. The first type of operation behavior can be behaviors of purchasing, shopping cart adding, attention, clicking and the like; the second type of operational behavior may be a search behavior. When the second type of operational behavior is search behavior, the third set of short-term interest users may be referred to simply as a "search recall user set". Furthermore, the short-term interest user set may include only one or two of the first to third short-term interest user sets without affecting the implementation of the present invention.

Illustratively, the set of medium and long term interest users includes: a first set of medium and long term interest users, and a second set of medium and long term interest users. Wherein the first set of medium-long term interest users is a set of users constructed based on users similar to the community portrait of the seed set of users, which may be referred to as a "portrait label similar set of users"; the second medium and long term interest user set is a user set constructed based on users who have no purchasing behavior for the target commodity in the near future but have purchasing behavior for the target commodity once, and may be referred to as an "attrition user set" for short. In addition, the medium-long term interest user set may include only one of the first to second medium-long term interest user sets without affecting the implementation of the present invention.

In a specific example, the set of candidate users specifically includes: "a high potential user set of target commodities", "a high potential user set of similar commodities", "a search recall user set", "a portrait label similar user set", and "a lost user set". The following describes the construction process of these five candidate user sets.

In this specific example, the process of building the "high potential user set of target commodities" includes: acquiring a first user set with a first type of operation behaviors on a target commodity in the near future (such as within a last month); determining the preference degree of each user in the first user set to the target commodity according to the trained second machine learning model; and taking all users with the preference degrees larger than a preset threshold value or taking a preset number of users with the maximum preference degrees as a first short-term interest user set. The second machine learning model may be an XGBoost model (the XGBoost model is called eXtreme Gradient Boosting, which is a Boosting algorithm). The second machine learning model may also be another machine learning model without affecting the implementation of the present invention.

In this specific example, the construction process of the "high potential user set of similar commodities" includes: and determining similar commodities of the target commodity, and screening out a high-potential user set of the similar commodities from a second user set which has the first type of operation behaviors on the similar commodities recently.

In an alternative embodiment, similar items of the target item may be determined according to the following: and inquiring a corresponding relation table of the commodity brand and similar brands thereof according to the brand identification of the target commodity, and taking the inquired similar brands with preset quantity (such as the first 10, the first 5 or other quantities with the maximum similarity) as the similar commodities of the target commodity. In specific implementation, a behavior sequence of a user in one month can be obtained, the behavior sequence is processed through a text processing algorithm (such as a word2vec algorithm) to obtain word embedding (embedding) of each brand, similarity between the brands is calculated based on the word embedding of each brand, then the similar brands of each brand are determined, and a corresponding relation table of the commodity brand and the similar brands is generated based on the similarity.

In another alternative embodiment, similar items of the target item may be determined according to the following: and inquiring a corresponding relation table of the commodity categories and the related categories according to the category identification of the target commodity, and taking the inquired related categories with preset quantity as similar commodities of the target commodity. The related categories are obtained based on the concept of "shopping basket", for example, a customer may purchase a toothbrush at the same time of purchasing toothpaste, so that toothpaste and toothbrush are related categories. In specific implementation, a frequent mining mode can be adopted to mine the related categories of each commodity. For example, the order placing data of the user in the last year may be obtained first, the promotion degree may be calculated based on the order placing data of the user, and then the related categories of the commodities may be determined based on the promotion degree. Wherein, the promotion degree is used for measuring the correlation degree between the commodity classes. For example, the degree of lift between item class A and item class B may be defined as: the ratio of "the number of users who purchased the article class A and the article class B at the same time to the number of users who purchased the article class A" to the number of users who purchased the article class B to all the users "is set.

In this particular example, the construction process of the "search recall user set" includes: the search keyword records of all users in the last month can be obtained firstly, the users who search the target commodity or similar commodities are found out according to the search keyword records, and a search recall user set is constructed based on the users "

In this specific example, the construction process of the "portrait label similarity user set" includes: counting the value distribution of portrait labels corresponding to each user in a seed user set to determine a group portrait corresponding to the seed user set; then, a "portrait label similar user set" is constructed from users similar to the community portrait.

Further, in this particular example, the construction process for the "attrition user set" includes: acquiring users who have no purchasing behavior on the target commodity in the last year but have purchasing behavior on the target commodity, and constructing an 'attrition user set' based on the users.

Step S202, responding to the triggering of the crowd expansion task, and acquiring business activity information and a seed user set which need to be subjected to crowd expansion.

For example, the crowd expansion task may be started to be executed after receiving the crowd expansion request submitted by the demander (such as the advertiser or the marketer), that is, the step S202 is started to be executed. Wherein, the crowd expansion request may include: the business activity information needing population expansion and the seed user set related to the business activity. Further, the business activity information that needs population expansion may include the following information: a brand identification of a target item involved in a business activity, an item identification of a target item involved in a business activity, and a store identification involved in a business activity. In addition, the crowd expansion request may further include: and the recalling proportion of each candidate user set by the demander. For example, assuming that five types of candidate user sets, i.e., "a high potential user set of target product", "a high potential user set of similar product", "a search recall user set", "a portrait label similar user set", and "a lost user set" are shared, the demander can flexibly set the recall ratio, for example, set the recall ratio to 3: 3: 2: 1: 1.

step S203, inquiring the database table according to the business activity information to obtain a candidate user set corresponding to the business activity information.

In one example, a database table may be queried according to the brand identifier of the target product to find a candidate user identifier corresponding thereto, and then a candidate user set may be constructed based on the found candidate user identifier.

In another example, a database table may be queried according to the category identification of the target product to find a candidate user identification corresponding thereto, and then a candidate user set may be constructed based on the found candidate user identification.

And step S204, extracting partial users from the candidate user set according to a first extraction rule, and taking the extracted partial users and the seed user set as positive sample users.

In an alternative embodiment, the first extraction rule may include: and determining users needing to be extracted from the candidate user set according to the purchase conversion rate and/or the click conversion rate. For example, the recent purchase click rate of each user in the candidate user set on the brand of the product involved in the business activity may be analyzed, and the 10 users with the highest purchase click rate may be selected and used together with the seed user set as the positive sample user. For another example, the click conversion rate of each user in the candidate user set on the commodity category related to the business activity in the near term may be analyzed, and the 20 users with the highest click conversion rate may be selected and used together with the seed user set as the positive sample user.

And step S205, extracting a part of users as negative sample users according to a second extraction rule.

And S206, training the first machine learning module according to the user characteristic data of the positive sample user and the negative sample user to obtain a trained first machine learning model.

And S207, screening an expanded user set from the candidate user set according to the trained first machine learning model.

Further, the method of the embodiment of the present invention may further include the steps of: and evaluating the quality of the expanded user set screened out based on the trained first machine learning model. In specific implementation, the evaluation can be performed based on multiple indexes such as accuracy, recall rate and the like.

Fig. 3 is a schematic diagram of main blocks of a data processing apparatus according to a third embodiment of the present invention. As shown in fig. 3, a data processing apparatus 300 according to an embodiment of the present invention includes: a determination module 301, an extraction module 302, a training module 303, and a screening module 304.

A determining module 301, configured to determine a candidate user set for crowd expansion in response to a trigger of a crowd expansion task.

Illustratively, the data processing apparatus 300 may start to perform the crowd expansion task, i.e., start to determine the candidate user set for crowd expansion through the determination module, after receiving the crowd expansion request submitted by the demander (such as an advertiser or a marketer). Wherein, the crowd expansion request may include: the business activity information needing population expansion and the seed user set related to the business activity. Further, the business activity information needing population expansion may include at least one of the following information: the business activity-related brand identification, the business activity-related item identification, and the business activity-related shop identification.

In an alternative embodiment, the determining module 301 determines the set of candidate users for population expansion includes: the determining module 301 obtains business activity information that needs population expansion; the determining module 301 queries a database table according to the service activity information to obtain a candidate user set corresponding to the service activity information. The database table stores candidate user information (such as information such as candidate user identifiers) corresponding to the service activity information.

In a specific example of this optional implementation manner, the business activity information obtained by the determining module 301 is specifically a brand identifier of a target product related to a business activity, and then the determining module 301 may query the database table according to the brand identifier of the target product to find a candidate user identifier corresponding to the target product, so as to construct a candidate user set based on the found candidate user identifier.

In another specific example of this optional embodiment, the service activity information obtained by the determining module 301 is specifically a category identifier of a target product related to a service activity, and then the determining module 301 may query the database table according to the category identifier of the target product to find a candidate user identifier corresponding to the target product, so as to construct a candidate user set based on the found candidate user identifier.

An extracting module 302, configured to extract a part of users from the candidate user set according to a first extraction rule, and then use the extracted part of users and a seed user set as positive sample users; and extracting part of the users as negative sample users according to a second extraction rule.

Considering that the set of seed users provided by the demander may be from a specific strong rule, such as a user who has recently purchased the brand or category of the commodity, directly using the seeds for the set as a positive sample easily causes an overfitting problem to the model. In view of this, in the embodiment of the present invention, part of the users are extracted from the candidate user set by the extraction module 302 according to the first extraction rule, the seed user set is appropriately generalized to solve the problem of overfitting and enhance the generalization capability of the machine learning model.

In the prior art, users except for the seed user are used as negative samples, and the selection mode is rough, so that the training effect of the model is poor. In view of this, in the embodiment of the present invention, the extracting module 302 extracts, according to the second extraction rule, the user having a correlation with the positive sample and a large difference as the negative sample user, so as to improve the training effect of the model.

The training module 303 is configured to train the first machine learning model according to the user feature data of the positive sample user and the negative sample user to obtain the trained first machine learning model.

A screening module 304, configured to screen out an expanded user set from the candidate user set according to the trained first machine learning model.

For example, the screening module 304 may determine, according to the trained first machine learning model, a preference degree of each user in the candidate user set for a target commodity; then, the screening module 304 uses all users whose preference degrees are greater than a preset threshold value, or uses a preset number of users whose preference degrees are the greatest, as an expanded user set. In specific implementation, the preset threshold and the preset number can be set by a demand side, and can also be flexibly set by a crowd development task executing side according to specific business requirements.

In the embodiment of the invention, crowd expansion is realized through the device. Compared with the prior art, the device provided by the embodiment of the invention can improve the training effect of the machine learning model in population expansion and improve the accuracy of population expansion. In addition, the embodiment of the invention does not need to rely on the social network of the user when carrying out crowd expansion, thereby improving the universality of the crowd expansion.

Fig. 4 is a schematic diagram of main blocks of a data processing apparatus according to a fourth embodiment of the present invention. As shown in fig. 4, a data processing apparatus 400 according to an embodiment of the present invention includes: a construction module 401, a determination module 402, an extraction module 403, a training module 404, and a screening module 405.

A building module 401, configured to build a database table, where the database table is used to store a candidate user set corresponding to a brand identifier or a category identifier of a commodity.

A determination module 402 configured to determine a set of candidate users for crowd expansion in response to a trigger of a crowd expansion task.

Illustratively, the data processing apparatus 400 may begin determining a set of candidate users for crowd expansion by the determination module upon receiving a crowd expansion request submitted by a requesting party (such as an advertiser or a marketer). Wherein, the crowd expansion request may include: the business activity information needing population expansion and the seed user set related to the business activity. Further, the business activity information needing population expansion may include at least one of the following information: the business activity-related brand identification, the business activity-related item identification, and the business activity-related shop identification.

In an embodiment of the present invention, the determining module 402 determines the candidate user set for population expansion includes: the determining module 402 acquires business activity information that needs population expansion; the determining module 402 queries a database table according to the business activity information to obtain a candidate user set corresponding to the business activity information. The database table stores candidate user information (such as information such as candidate user identifiers) corresponding to the service activity information.

An extracting module 403, configured to extract a part of users from the candidate user set according to a first extraction rule, and then use the extracted part of users and a seed user set as positive sample users; and extracting part of the users as negative sample users according to a second extraction rule.

Considering that the set of seed users provided by the demander may be from a specific strong rule, such as a user who has recently purchased the brand or category of the commodity, directly using the seeds for the set as a positive sample easily causes an overfitting problem to the model. In view of this, in the embodiment of the present invention, part of the users are extracted from the candidate user set by the extraction module 403 according to the first extraction rule, the seed user set is appropriately generalized to solve the problem of overfitting and enhance the generalization capability of the machine learning model.

In the prior art, users except for the seed user are used as negative samples, and the selection mode is rough, so that the training effect of the model is poor. In view of this, in the embodiment of the present invention, the extracting module 403 extracts, according to the second extraction rule, the user having a correlation with the positive sample and a large difference as the negative sample user, so as to improve the training effect of the model.

A training module 404, configured to train the first machine learning model according to the user feature data of the positive sample user and the negative sample user, so as to obtain a trained first machine learning model.

Wherein the user characteristic data can be constructed based on user characteristics in the following dimensions: the user portrait type characteristics, the behavior characteristics of the user to the target commodity related to the business activity (such as behavior characteristics of purchasing, shopping cart adding, clicking, attention and the like), the behavior characteristics of the user to similar commodities, and the word segmentation characteristics of the purchased commodities. In specific implementation, the user characteristic data can be constructed in advance, for example, the step of constructing the user characteristic data can be executed routinely every day, so that the user characteristic data can be directly acquired from the database after the crowd expansion task is triggered, and the execution efficiency of the crowd expansion task is improved.

A screening module 405, configured to screen out an expanded user set from the candidate user set according to the trained first machine learning model.

For example, the screening module 405 may determine, according to the trained first machine learning model, a preference degree of each user in the candidate user set for a target commodity; then, the screening module 405 takes all users whose preference degrees are greater than a preset threshold value, or a preset number of users whose preference degrees are the greatest, as an expanded user set. In specific implementation, the preset threshold and the preset number can be set by a demand side, and can also be flexibly set by a crowd development task executing side according to specific business requirements.

Fig. 5 shows an exemplary system architecture 500 of a data processing method or data processing apparatus to which embodiments of the present invention may be applied.

As shown in fig. 5, the system architecture 500 may include

terminal devices

501, 502, 503, a network 504, and a server 505. The network 504 serves to provide a medium for communication links between the

terminal devices

501, 502, 503 and the server 505. Network 504 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

501, 502, 503 to interact with a server 505 over a network 504 to receive or send messages or the like. The

terminal devices

501, 502, 503 may have various communication client applications installed thereon, such as a shopping application, a web browser application, a search application, an instant messaging tool, a mailbox client, social platform software, and the like.

The

terminal devices

501, 502, 503 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.

The server 505 may be a server providing various services, such as a background management server providing support for crowd expansion tasks submitted by users using the

terminal devices

501, 502, 503. The background management server can perform analysis and other processing after receiving the crowd expansion task, and feed back a processing result (such as an expansion user set) to the terminal device.

It should be noted that the data processing method provided by the embodiment of the present invention is generally executed by the server 505, and accordingly, the data processing apparatus is generally disposed in the server 505.

It should be understood that the number of terminal devices, networks, and servers in fig. 5 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Referring now to FIG. 6, shown is a block diagram of a computer system 600 suitable for use with the electronic device implementing an embodiment of the present invention. The computer system illustrated in FIG. 6 is only one example and should not impose any limitations on the scope of use or functionality of embodiments of the invention.

As shown in fig. 6, the computer system 600 includes a Central Processing Unit (CPU)601 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage section 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the system 600 are also stored. The CPU 601, ROM 602, and RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

The following components are connected to the I/O interface 605: an input portion 606 including a keyboard, a mouse, and the like; an output portion 607 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 608 including a hard disk and the like; and a communication section 609 including a network interface card such as a LAN card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The driver 610 is also connected to the I/O interface 605 as needed. A removable medium 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 610 as necessary, so that a computer program read out therefrom is mounted in the storage section 608 as necessary.

In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 609, and/or installed from the removable medium 611. The computer program performs the above-described functions defined in the system of the present invention when executed by the Central Processing Unit (CPU) 601.

It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor includes a determination module, an extraction module, a training module, and a screening module. Where the names of these modules do not in some cases constitute a limitation of the module itself, for example, a determination module may also be described as a "module that determines a set of candidate users".

As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to perform the following: in response to triggering of a crowd expansion task, determining a set of candidate users for crowd expansion; extracting partial users from the candidate user set according to a first extraction rule, and then taking the extracted partial users and a seed user set as positive sample users; extracting part of users as negative sample users according to a second extraction rule; training a first machine learning model according to the user characteristic data of the positive sample user and the negative sample user to obtain a trained first machine learning model; and screening out an expanded user set from the candidate user set according to the trained first machine learning model.

The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method of data processing, the method comprising:

in response to triggering of a crowd expansion task, determining a set of candidate users for crowd expansion;

extracting partial users from the candidate user set according to a first extraction rule, and then taking the extracted partial users and a seed user set as positive sample users; extracting part of users as negative sample users according to a second extraction rule;

training a first machine learning model according to the user characteristic data of the positive sample user and the negative sample user to obtain a trained first machine learning model;

and screening out an expanded user set from the candidate user set according to the trained first machine learning model.

2. The method of claim 1, wherein determining the set of candidate users for population expansion comprises:

acquiring business activity information needing population expansion; inquiring a database table according to the business activity information to obtain a candidate user set corresponding to the business activity information; the business activity information comprises at least one of brand identification of target commodities related to business activities, category identification of target commodities related to business activities and shop identification related to business activities.

3. The method of claim 2, wherein the set of candidate users comprises: a short-term interest user set and a medium-term interest user set; the short-term interest user set is a user set which is screened out based on short-term behavior characteristic data of users and is interested in the target commodity; the medium-long term interest user set is a user set which is screened out based on medium-long term behavior characteristic data of users and is interested in the target commodity.

4. The method of claim 3, wherein the set of short-term interest users comprises: a first short-term interest user set, a second short-term interest user set, and a third short-term interest user set; the method further comprises the following steps:

screening a first short-term interest user set from a first user set which has a first type of operation behaviors on a target commodity recently; determining similar commodities of the target commodity, and screening a second short-term interest user set from a second user set which has a first type of operation behavior on the similar commodities recently; and screening out a third short-term interest user set from a third user set which has a second type of operation behaviors on the target commodity or the similar commodity in the near future.

5. The method of claim 4, wherein filtering out a first set of short-term interest users from a first set of users who have a first type of operational behavior recently on a target good comprises:

acquiring a first user set which has a first type of operation behavior on a target commodity recently; determining the preference degree of each user in the first user set to the target commodity according to the trained second machine learning model; and taking all users with the preference degrees larger than a preset threshold value or taking a preset number of users with the maximum preference degrees as a first short-term interest user set.

6. The method of claim 2, wherein the set of medium and long term interest users comprises: a first medium-long term interest user set and a second medium-long term interest user set; the method further comprises the following steps:

counting the value distribution of the portrait labels corresponding to each user in the seed user set to determine a group portrait corresponding to the seed user set; constructing a first medium-long term interest user set according to users similar to the community portrait; and constructing a second medium and long term interest user set according to the users who have no purchasing behavior on the target commodity recently but have purchasing behavior on the target commodity once.

7. The method of claim 1, wherein the filtering out a set of expanded users from the set of candidate users according to the trained first machine learning model comprises:

determining the preference degree of each user in the candidate user set to the target commodity according to the trained first machine learning model; and taking all users with the preference degrees larger than a preset threshold value or the users with the maximum preference degrees in a preset number as an expanded user set.

8. A data processing apparatus, characterized in that the apparatus comprises:

the system comprises a determining module, a searching module and a searching module, wherein the determining module is used for responding to triggering of a crowd expansion task and determining a candidate user set for crowd expansion;

the extraction module is used for extracting partial users from the candidate user set according to a first extraction rule and then taking the extracted partial users and the seed user set as positive sample users; the system is also used for extracting part of users as negative sample users according to a second extraction rule;

the training module is used for training a first machine learning model according to the user characteristic data of the positive sample user and the negative sample user to obtain a trained first machine learning model;

and the screening module is used for screening out an expanded user set from the candidate user set according to the trained first machine learning model.

9. An electronic device, comprising:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.

10. A computer-readable medium, on which a computer program is stored, which program, when being executed by a processor, is adapted to carry out the method of any one of claims 1 to 7.