CN113312552A

CN113312552A - Data processing method, device, electronic equipment and medium

Info

Publication number: CN113312552A
Application number: CN202110647282.4A
Authority: CN
Inventors: 刘昊骋; 陈奇石
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-06-10
Filing date: 2021-06-10
Publication date: 2021-08-27
Anticipated expiration: 2041-06-10
Also published as: CN113312552B

Abstract

The disclosure provides a data processing method, a data processing device, electronic equipment and a medium, and relates to the field of artificial intelligence, in particular to the technical field of deep learning and intelligent recommendation. The implementation scheme is as follows: obtaining sample user data; carrying out supervised training on sample user data based on a binary classification algorithm and generating a decision tree; and generating a recall rule associated with the user attribute based on one or more nodes in the decision tree.

Description

Data processing method, device, electronic equipment and medium

Technical Field

The present disclosure relates to the field of artificial intelligence, in particular to the technical field of deep learning and intelligent recommendation, and in particular to a data processing method, apparatus, electronic device, computer-readable storage medium, and computer program product.

Background

Artificial intelligence is the subject of research that makes computers simulate some human mental processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.), both at the hardware level and at the software level. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like: the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, machine learning/deep learning, a big data processing technology, a knowledge map technology and the like.

And determining the target user through data processing is beneficial to carrying out personalized recommendation to the user. For example, if it is determined through data processing that the user is a target user, content or services related to the target may be recommended to the user.

The approaches described in this section are not necessarily approaches that have been previously conceived or pursued. Unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. Similarly, unless otherwise indicated, the problems mentioned in this section should not be considered as having been acknowledged in any prior art.

Disclosure of Invention

The present disclosure provides a data processing method, an apparatus, an electronic device, a computer-readable storage medium, and a computer program product.

According to an aspect of the present disclosure, there is provided a data processing method including: obtaining sample user data; carrying out supervised training on the sample user data based on a binary classification algorithm and generating a decision tree; and generating a recall rule associated with a user attribute based on one or more nodes in the decision tree.

According to an aspect of the present disclosure, there is provided a data processing apparatus including: an acquisition module configured to acquire sample user data; a training module configured to perform supervised training on the sample user data based on a binary classification algorithm and generate a decision tree; and a recall rule generation module configured to generate a recall rule associated with a user attribute based on one or more nodes in the decision tree.

According to an aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, which when executed by the at least one processor, enable the at least one processor to perform a data processing method.

According to an aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to execute a data processing method.

According to an aspect of the disclosure, a computer program product is provided, comprising a computer program, wherein the computer program realizes the data processing method when executed by a processor.

According to one or more embodiments of the disclosure, the target user is determined through supervised training, and the mining range and the mining accuracy of the target user are expanded. It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the embodiments and, together with the description, serve to explain the exemplary implementations of the embodiments. The illustrated embodiments are for purposes of illustration only and do not limit the scope of the claims. Throughout the drawings, identical reference numbers designate similar, but not necessarily identical, elements.

Fig. 1 shows a flow chart of a data processing method according to an exemplary embodiment of the present disclosure;

FIG. 2 shows a schematic diagram of generating recall rules according to an example embodiment of the present disclosure;

FIG. 3 shows a schematic diagram of a sequencing model workflow according to an exemplary embodiment of the present disclosure;

FIG. 4 illustrates a schematic diagram of generating a recall rule based on a decision tree in accordance with an exemplary embodiment of the present disclosure;

fig. 5 shows a block diagram of a data processing apparatus according to an exemplary embodiment of the present disclosure; and

FIG. 6 illustrates a block diagram of an exemplary electronic device that can be used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the present disclosure, unless otherwise specified, the use of the terms "first", "second", etc. to describe various elements is not intended to limit the positional relationship, the timing relationship, or the importance relationship of the elements, and such terms are used only to distinguish one element from another. In some examples, a first element and a second element may refer to the same instance of the element, and in some cases, based on the context, they may also refer to different instances.

The terminology used in the description of the various described examples in this disclosure is for the purpose of describing particular examples only and is not intended to be limiting. Unless the context clearly indicates otherwise, if the number of elements is not specifically limited, the elements may be one or more. Furthermore, the term "and/or" as used in this disclosure is intended to encompass any and all possible combinations of the listed items.

In the related art, mining and determining of the target user are generally performed according to business rules, and the rules may be, for example, age, gender, region, academic calendar or income. The formulation of such business rules requires the accumulation of a large amount of business domain experience and the full analysis of the relevant domain knowledge. The method has higher requirements on the experience of related fields, is difficult to realize, and the rules obtained by analysis and exhaustion still have the possibility of imperfect and limited coverage range, and the accuracy is lower.

To address one or more of the above-mentioned issues, the present disclosure generates recall rules for recalling users through supervised training based on a classification algorithm, effectively extending the recall rules and enabling recalling of users over a larger range.

In the technical solution of the present disclosure, the acquisition, storage, and application of the personal information (e.g., historical behavior information and geographic location information) of the related user all comply with the regulations of the relevant laws and regulations, and do not violate the customs of the public order. And the personal information of the user is subjected to desensitization processing (namely anonymization processing) in the processes of acquisition, storage and application.

The data processing method of the present disclosure will be further described below with reference to the accompanying drawings.

Fig. 1 shows a flowchart of a data processing method according to an exemplary embodiment of the present disclosure.

As shown in fig. 1, the method includes: s101, obtaining sample user data; s102, carrying out supervised training on the sample user data based on a binary algorithm and generating a decision tree; and S103, generating a recall rule related to the user attribute based on one or more nodes in the decision tree.

According to some embodiments, the second classification algorithm may be one or more of two classification algorithms including, but not limited to, Logistic Regression (LG), Gradient Boosting Decision Tree (GBDT), eXtreme Gradient Boosting (XGBoost), or any combination of several of them, and the disclosure does not limit the types of the two classification algorithms.

For convenience of description and understanding, the following description will specifically describe the technical solution of the present disclosure by taking the binary algorithm as the XGBoost algorithm as an example. It should be noted that the technical solution of the present disclosure is not limited to the two-class algorithm based on XGBoost, and is also applicable to other two-class algorithms.

According to some embodiments, the sample user may be determined according to the target user's definition. For example, the target user may be a potential user of a certain type of product or service. In this case, the sample users may be users who purchased the product or service and users who did not purchase the product or service as positive and negative samples of supervised training, respectively. All the examples corresponding to the user attributes of the sample user can be used as feature input in supervised training, for example, the user attributes can be search words of the user, and correspondingly, a word set obtained by performing word segmentation processing on the search history of the sample user can be used as feature input in supervised training; it is understood that the nodes in the decision tree that is finally generated by supervised training may be all or part of the words in the word set.

According to some embodiments, step S103 further comprises: performing a weighted calculation on one or more nodes in the decision tree to generate the recall rule associated with a user attribute. The process is equivalent to screening out features which are relatively related to a target behavior (for example, a behavior of purchasing a certain product or service) of a target user in all nodes of a decision tree according to a preset calculation rule, and taking the screened features as a recall rule so as to improve the accuracy of mining the target user. A detailed description will be given in the following embodiments as to the specific manner of the weighting calculation of the nodes.

According to some embodiments, performing a weighted calculation on one or more nodes in the decision tree to generate the recall rule associated with a user attribute comprises: generating a feature value list based on respective feature values of one or more nodes in the decision tree, wherein a feature value is a product of the number of times that the respective node is used for splitting and a gain when the node is split, and the feature value list comprises the one or more nodes in the decision tree and the corresponding feature values thereof; and sorting the list of feature values and generating the recall rule related to the user attribute.

It can be understood that the feature value of a node is equal to the number of times the node is used for splitting times the gain during splitting, and the larger the feature value is, the higher the correlation degree of the feature with the target behavior of the target user is, so that the feature value calculation provides a way to filter the feature. And generating a characteristic value list for all nodes in the decision tree, and sequencing the characteristic values of all nodes according to the calculation result obtained by the formula so as to screen out the characteristics with high characteristic values as a recall rule of a recall user.

According to some embodiments, the data processing method further comprises: after the recall rule is generated through the steps, the users can be matched in the data source according to the recall rule, wherein the matched users are recallable users.

According to some embodiments, the user attributes may correspond to terms from the user and, accordingly, the data source may be a search log of a full number of users. It is understood that the search term of the user may reflect the user's intention to some extent, for example, when a certain user searches for a certain product, it may be presumed that the user may have an intention to purchase the product. Therefore, the accuracy of the user recalled according to the search terms is high.

In one example, the method may include: respectively taking a user who purchases a certain product and a user who does not purchase the product as a positive sample and a negative sample, performing word segmentation processing on the retrieval history of a sample user to obtain a set of retrieval words, and performing XGboost multi-round training on the positive sample, the negative sample and the set of retrieval words serving as characteristics to obtain a decision tree; performing weighted calculation on all nodes in the decision tree and sequencing to obtain a characteristic value list, wherein a retrieval word corresponding to a node with higher weight can be used as a recall rule finally used for recalling a user; and matching the search words obtained in the supervised training in the search logs of all users, wherein the user corresponding to the matched result in the search logs is the recallable user. In the process of matching the search terms in the search logs, the time range of the data source can be set as required, for example, the data source can be the search logs of a whole number of users within one year.

It should be noted that, in the embodiment of the present disclosure, the search log of the user is acquired, stored, and used based on the authorization and approval of the user. And the search logs are desensitized (i.e., anonymized) during acquisition, storage and use.

According to some embodiments, the search term may be subjected to word expansion processing by a word vector algorithm to further expand the recall rule.

According to other embodiments, the user attribute may further correspond to an Application (APP) installed by the user, and accordingly, the data source is an APP installation log. Similar to the search term, the APP installed by the user can also be used as a user attribute for determining the target user, and the range of user recall is further expanded on the basis of the search term.

FIG. 2 shows a schematic diagram of generating recall rules according to an example embodiment of the present disclosure. As shown in fig. 2, the process of generating the recall rule by using the APP installed by the user is similar to the workflow of applying the search term, and is not described herein again. Further, after obtaining the APP list with high feature value, a positive sample rate of each APP may be calculated, and the APP list with high feature value may be ranked according to the positive sample rate, so as to further obtain an APP cluster with high intention as a recall rule.

According to some embodiments, the user attributes may also correspond to a user's post visit history, a user's geographic location information, etc. to recall a user from different data sources, to expand a user's recall range. The specific examples of the user attributes and their corresponding data sources can be determined according to the definition of the target user, and can also be set according to the specific application scenario. It is understood that, for different user attributes, the process for generating the recall rule is similar to the process for generating the recall rule by applying the search term, and the detailed description of the disclosure is omitted here.

It should be noted that, in the embodiment of the present disclosure, similar to the search log of the user, the historical behavior information such as the post access history of the user and the geographic location information are acquired, stored and used based on the authorization and the consent of the user. The information is desensitized (i.e., anonymized) during acquisition, storage, and use.

In one example, the recallable users obtained by the method may be a collection of users recalled from the various data sources described above using different recall rules.

According to some embodiments, the data processing method further comprises: inputting user characteristics of the recallable users into a preset ranking model; and determining whether the recallable user is a target user according to an output result of the ranking model. The preset ranking model can be a model obtained by supervised training based on a binary classification algorithm. The ranking model further screens the recallable users, and the mining accuracy of the target users is improved.

FIG. 3 shows a schematic diagram of a sequencing model workflow according to an exemplary embodiment of the present disclosure. As shown in FIG. 3, the sequencing model workflow may be divided into two parts: 301. training a sequencing model; 302. the ranking model ranks and ranks users. In 301, a user who purchased a product and a user who did not purchase the product may be used as positive and negative examples, respectively, and supervised training based on a binary classification algorithm is performed to obtain the ranking model by using the positive and negative examples, characteristics of the sample user (e.g., user portrait, internet behavior, derivative characteristics, etc.), and product characteristics of the product as input. At 302, the user characteristics of the recallable users are input into the ranking model, and the trained ranking model scores and ranks the recallable users and outputs score ranking, wherein the score represents the probability that the corresponding recallable user purchases the product.

According to some embodiments, the data processing method may further include: and setting a pushing rule based on the score sorting output by the sorting model, and pushing related products or services to corresponding target users according to the pushing rule. For example, each threshold score may be reviewed for its accuracy and recall, and when its accuracy and recall satisfy preset criteria, all recallable users within the threshold are pushed for related products or services. The specific push rules and standards may be set according to specific application scenarios, which are not limited in this disclosure.

FIG. 4 shows a schematic diagram of generating a recall rule based on a decision tree, according to an example embodiment of the present disclosure. It is noted that the present embodiment employs an application scenario of recommending potential insurance users for the purpose of illustrating the disclosed solution only. Those skilled in the art can understand that the solution of the present disclosure is not limited to the application scenario of the present embodiment; rather, other application scenarios are also contemplated based on the teachings of the present disclosure. In this embodiment, the target user is a credit high-risk user, the user attribute is a search term from the user, and the decision tree shown in fig. 4 can be obtained through XGBoost training on sample user data, where each node on the tree is a search term from the user. Through multiple rounds of training of XGBoost, multiple decision trees as shown in fig. 4 can be obtained, and here, only fig. 4 is taken as an example to describe a process for generating a recall rule based on the decision trees. Each node in the decision tree corresponds to a search term from the user, and each node in the decision tree corresponds to a gain. According to the eigenvalue calculation formula introduced in the present disclosure, that is, the product of the number of times a certain search term appears in all decision tree nodes and the gain corresponding to the node, the eigenvalue of each node in the decision tree can be calculated. The larger the characteristic value is, the higher the degree of correlation between the search term corresponding to the node and the behavior of the credit high-risk user is, and the search term corresponding to the high-characteristic-value node is the recall rule. The method for generating the recall rule and recalling the user has the advantages of wide recall range and high accuracy.

According to another aspect of the present disclosure, there is provided a data processing apparatus for performing any of the above methods. As shown in fig. 5, the data processing apparatus 500 includes: an obtaining module 501 configured to obtain sample user data; a training module 502 configured to perform supervised training on the sample user data based on a classification algorithm and generate a decision tree; a recall rule generation module 503 configured to generate a recall rule associated with a user attribute based on one or more nodes in the decision tree.

The operations of the units 501-503 of the data processing apparatus 500 are similar to the operations of the steps S101-S103 described above, and are not described herein again.

According to some embodiments, the recall rule generation module 503 is further configured to: performing a weighted calculation on one or more nodes in the decision tree to generate the recall rule associated with a user attribute. This process is equivalent to filtering out features that are more relevant to the target behavior of the target user (e.g., the behavior of purchasing a certain product or service) through the computing unit, and a detailed description will be given in the following embodiments for the specific manner of weighting calculation of the nodes.

According to some embodiments, the recall rule generation module 503 comprises: a list generating unit configured to generate a feature value list based on respective feature values of one or more nodes in the decision tree, wherein a feature value is a product of a number of times that the respective node is used for splitting and a gain when the node is split, and the feature value list includes the one or more nodes in the decision tree and their corresponding feature values; and a sorting unit that sorts the feature value list and generates the recall rule related to the user attribute.

It is understood that the feature value of a node is equal to the number of times the node is used for splitting times the gain during splitting, and a larger feature value indicates that the feature is more relevant to the target behavior of the target user. The list generating unit generates a characteristic value list based on all nodes in the decision tree, and the sorting unit sorts the characteristic values of all nodes according to the calculation result obtained by the formula so as to screen out the characteristics with larger weight as a recall rule of the recall user.

According to some embodiments, the data processing apparatus further comprises: and the matching module is configured to match users in the data source according to the recall rule, wherein the matched users are recallable users.

According to some embodiments, the user attributes may correspond to terms from the user and, accordingly, the data source may be a search log of a full number of users. It will be appreciated that the user's search term may reflect the user's intent to some extent, for example, when a user searches for a product, it may be presumed that the user may have an intent to purchase the product.

According to other embodiments, the user attribute may further correspond to an Application (APP) installed by the user, and accordingly, the data source is an APP installation log. Similar to the search term, the APP installed by the user may also be used as a user attribute for determining the target user.

It should be noted that, in the embodiment of the present disclosure, the historical behavior information and the like of the user are obtained, stored and used based on the authorization and the consent of the user. The information is desensitized (i.e., anonymized) during acquisition, storage, and use.

In one example, the recallable users obtained by the matching module may be a collection of users recalled from different data sources applying different recall rules.

According to some embodiments, the data processing apparatus further comprises: an input module: configured to input user characteristics of the recallable users into a preset ranking model; and a determination module configured to determine whether the recallable user is a target user according to an output result of the ranking model. The preset ranking model can be a model obtained by supervised training based on a binary classification algorithm.

According to some embodiments, the data processing apparatus may further comprise: and the pushing module is configured to set a pushing rule based on the output score ranking of the ranking model and push related products or services to corresponding target users according to the pushing rule. For example, each threshold score may be reviewed for its accuracy and recall, and when its accuracy and recall satisfy preset criteria, all recallable users within the threshold are pushed for related products or services. The specific push rules and standards may be set according to specific application scenarios, which are not limited in this disclosure.

According to another aspect of the present disclosure, there is also provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the data processing method described above.

According to another aspect of the present disclosure, there is also provided a non-transitory computer readable storage medium storing computer instructions for causing the computer to execute the above-described data processing method.

According to another aspect of the present disclosure, there is also provided a computer program product comprising a computer program, wherein the computer program realizes the above-mentioned data processing method when executed by a processor.

Referring to fig. 6, a block diagram of a structure of an electronic device 600, which may be a server of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described. Electronic device is intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 6, the apparatus 600 includes a computing unit 601, which can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)602 or a computer program loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data required for the operation of the device 600 can also be stored. The calculation unit 601, the ROM 602, and the RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

A number of components in the device 600 are connected to the I/O interface 605, including: an input unit 606, an output unit 607, a storage unit 608, and a communication unit 609. The input unit 606 may be any type of device capable of inputting information to the device 600, and the input unit 606 may receive input numeric or character information and generate key signal inputs related to user settings and/or function controls of the electronic device, and may include, but is not limited to, a mouse, a keyboard, a touch screen, a track pad, a track ball, a joystick, a microphone, and/or a remote control. Output unit 607 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, a video/audio output terminal, a vibrator, and/or a printer. The storage unit 608 may include, but is not limited to, a magnetic disk, an optical disk. The communication unit 609 allows the device 600 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks, and may include, but is not limited to, a modem, a network card, an infrared communication device, a wireless communication transceiver, and/or a chipset, such as a bluetooth (TM) device, an 1302.11 device, a WiFi device, a WiMax device, a cellular communication device, and/or the like.

The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 601 executes the respective methods and processes described above, such as the data processing method. For example, in some embodiments, the data processing method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 600 via the ROM 602 and/or the communication unit 609. When the computer program is loaded into the RAM 603 and executed by the computing unit 601, one or more steps of the method described above may be performed. Alternatively, in other embodiments, the computing unit 601 may be configured to perform the data processing method by any other suitable means (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be performed in parallel, sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

Although embodiments or examples of the present disclosure have been described with reference to the accompanying drawings, it is to be understood that the above-described methods, systems and apparatus are merely exemplary embodiments or examples and that the scope of the present invention is not limited by these embodiments or examples, but only by the claims as issued and their equivalents. Various elements in the embodiments or examples may be omitted or may be replaced with equivalents thereof. Further, the steps may be performed in an order different from that described in the present disclosure. Further, various elements in the embodiments or examples may be combined in various ways. It is important that as technology evolves, many of the elements described herein may be replaced with equivalent elements that appear after the present disclosure.

Claims

1. A method of data processing, comprising:

obtaining sample user data;

carrying out supervised training on the sample user data based on a binary classification algorithm and generating a decision tree; and

generating a recall rule associated with a user attribute based on one or more nodes in the decision tree.

2. The method of claim 1, wherein generating recall rules related to user attributes based on one or more nodes in the decision tree comprises:

performing a weighted calculation on one or more nodes in the decision tree to generate the recall rule associated with a user attribute.

3. The method of claim 1 or 2, further comprising:

and matching users in the data source according to the recall rule, wherein the users obtained through matching are recallable users.

4. The method of claim 3, wherein the user attributes correspond to terms from a user, and wherein the data source is a search log of a full number of users.

5. The method of claim 2, wherein the performing a weighted computation on one or more nodes in the decision tree to generate the recall rule associated with a user attribute comprises:

generating a feature value list based on respective feature values of one or more nodes in the decision tree, wherein a feature value is a product of the number of times that the respective node is used for splitting and a gain when the node is split, and the feature value list comprises the one or more nodes in the decision tree and the corresponding feature values thereof; and

ranking the list of feature values and generating the recall rule related to the user attribute.

6. The method of claim 3, wherein the user attributes correspond to applications installed by a user, and wherein the data source is an application installation log.

7. The method of any of claims 3 to 6, further comprising:

inputting user characteristics of the recallable users into a preset ranking model; and

and determining whether the recallable user is a target user according to the output result of the ranking model.

8. A data processing apparatus comprising:

an acquisition module configured to acquire sample user data;

a training module configured to perform supervised training on the sample user data based on a binary classification algorithm and generate a decision tree; and

a recall rule generation module configured to generate a recall rule associated with a user attribute based on one or more nodes in the decision tree.

9. The apparatus of claim 8, wherein the recall rule generation module is further configured to perform a weighted computation on one or more nodes in the decision tree to generate the recall rule associated with a user attribute.

10. The apparatus of claim 8 or 9, further comprising,

and the matching module is configured to match users in the data source according to the recall rule, wherein the matched users are recallable users.

11. The apparatus of claim 10, wherein the user attributes correspond to terms from a user, and wherein the data source is a search log of a full-scale user.

12. The apparatus of claim 9, wherein the recall rule generation module comprises:

a list generating unit configured to generate a feature value list based on respective feature values of one or more nodes in the decision tree, wherein a feature value is a product of a number of times that the respective node is used for splitting and a gain when the node is split, and the feature value list includes the one or more nodes in the decision tree and their corresponding feature values; and

a ranking unit configured to rank the list of feature values and generate the recall rule related to the user attribute.

13. The apparatus of claim 10, wherein the user attributes correspond to applications installed by a user, and wherein the data source is an application installation log.

14. The apparatus of any of claims 10 to 13, further comprising:

an input module configured to input user characteristics of the recallable users into a preset ranking model; and

a determination module configured to determine whether the recallable user is a target user according to an output result of the ranking model.

15. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein

The memory stores instructions executable by the at least one processor, the instructions when executed by the at least one processor enabling the at least one processor to perform the method of any of claims 1 to 7.

16. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1 to 7.

17. A computer program product comprising a computer program, wherein the computer program realizes the method of any one of claims 1 to 7 when executed by a processor.