CN113312552B - Data processing method, device, electronic device and medium - Google Patents

Data processing method, device, electronic device and medium Download PDF

Info

Publication number
CN113312552B
CN113312552B CN202110647282.4A CN202110647282A CN113312552B CN 113312552 B CN113312552 B CN 113312552B CN 202110647282 A CN202110647282 A CN 202110647282A CN 113312552 B CN113312552 B CN 113312552B
Authority
CN
China
Prior art keywords
user
decision tree
users
recall
user attribute
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110647282.4A
Other languages
Chinese (zh)
Other versions
CN113312552A (en
Inventor
刘昊骋
陈奇石
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202110647282.4A priority Critical patent/CN113312552B/en
Publication of CN113312552A publication Critical patent/CN113312552A/en
Application granted granted Critical
Publication of CN113312552B publication Critical patent/CN113312552B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The disclosure provides a data processing method, a data processing device, electronic equipment and a medium, relates to the field of artificial intelligence, and particularly relates to the technical field of deep learning and intelligent recommendation. The implementation scheme is as follows: acquiring sample user data; performing supervised training on sample user data based on a classification algorithm and generating a decision tree; and generating recall rules related to the user attributes based on one or more nodes in the decision tree.

Description

Data processing method, device, electronic equipment and medium
Technical Field
The present disclosure relates to the field of artificial intelligence, and in particular, to the field of deep learning and intelligent recommendation technologies, and in particular, to a data processing method, apparatus, electronic device, computer readable storage medium, and computer program product.
Background
Artificial intelligence is the discipline of studying the process of making a computer mimic certain mental processes and intelligent behaviors (e.g., learning, reasoning, thinking, planning, etc.) of a person, both hardware-level and software-level techniques. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, etc.: the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, a machine learning/deep learning technology, a big data processing technology, a knowledge graph technology and the like.
And the target user is determined through data processing, so that personalized recommendation is facilitated to the user. For example, if it is determined through data processing that the user is a target user, content or services related to the target may be recommended to the user.
The approaches described in this section are not necessarily approaches that have been previously conceived or pursued. Unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. Similarly, the problems mentioned in this section should not be considered as having been recognized in any prior art unless otherwise indicated.
Disclosure of Invention
The present disclosure provides a data processing method, apparatus, electronic device, computer readable storage medium, and computer program product.
According to one aspect of the present disclosure, there is provided a data processing method including: acquiring sample user data; performing supervised training on the sample user data based on a classification algorithm and generating a decision tree; and generating recall rules related to user attributes based on one or more nodes in the decision tree.
According to one aspect of the present disclosure, there is provided a data processing apparatus comprising: an acquisition module configured to acquire sample user data; a training module configured to supervise training the sample user data based on a classification algorithm and generate a decision tree; and a recall rule generation module configured to generate recall rules related to user attributes based on one or more nodes in the decision tree.
According to one aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, which when executed by the at least one processor, enable the at least one processor to perform a data processing method.
According to one aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform a data processing method.
According to one aspect of the present disclosure, a computer program product is provided, comprising a computer program, wherein the computer program, when executed by a processor, implements a data processing method.
According to one or more embodiments of the present disclosure, a target user is determined through supervised training, and a mining range for the target user and a mining accuracy are enlarged. It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.
Drawings
The accompanying drawings illustrate exemplary embodiments and, together with the description, serve to explain exemplary implementations of the embodiments. The illustrated embodiments are for exemplary purposes only and do not limit the scope of the claims. Throughout the drawings, identical reference numerals designate similar, but not necessarily identical, elements.
FIG. 1 illustrates a flow chart of a data processing method according to an exemplary embodiment of the present disclosure;
FIG. 2 illustrates a schematic diagram of generating recall rules in accordance with an exemplary embodiment of the present disclosure;
FIG. 3 illustrates a schematic diagram of a ranking model workflow in accordance with an exemplary embodiment of the present disclosure;
FIG. 4 illustrates a schematic diagram of generating recall rules based on a decision tree in accordance with an exemplary embodiment of the present disclosure;
FIG. 5 shows a block diagram of a data processing apparatus according to an exemplary embodiment of the present disclosure; and
Fig. 6 illustrates a block diagram of an exemplary electronic device that can be used to implement embodiments of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
In the present disclosure, the use of the terms "first," "second," and the like to describe various elements is not intended to limit the positional relationship, timing relationship, or importance relationship of the elements, unless otherwise indicated, and such terms are merely used to distinguish one element from another. In some examples, a first element and a second element may refer to the same instance of the element, and in some cases, they may also refer to different instances based on the description of the context.
The terminology used in the description of the various illustrated examples in this disclosure is for the purpose of describing particular examples only and is not intended to be limiting. Unless the context clearly indicates otherwise, the elements may be one or more if the number of the elements is not specifically limited. Furthermore, the term "and/or" as used in this disclosure encompasses any and all possible combinations of the listed items.
In the related art, mining and determination of a target user is generally performed according to business rules, and the rules may be, for example, age, gender, territory, academic or income, etc. The establishment of the business rules requires accumulation of a large amount of business domain experience and sufficient analysis of related domain knowledge. The method has higher experience requirements on the related fields, is difficult to realize, and the rules obtained through analysis and exhaustion still have the possibility of incomplete rule and limited coverage range, and has lower accuracy.
To address one or more of the above problems, the present disclosure effectively expands recall rules and enables recall of users over a wider range by generating recall rules for recall users based on supervised training of a classification algorithm.
It should be noted that, in the technical solution of the present disclosure, the acquisition, storage, application, etc. of the related personal information (such as the historical behavior information and the geographic location information) of the user all conform to the rules of the related laws and regulations, and do not violate the public welfare. And, the user personal information is subjected to desensitization (namely anonymization) in the process of being acquired, stored and applied.
The data processing method of the present disclosure will be further described below with reference to the accompanying drawings.
Fig. 1 shows a flowchart of a data processing method according to an exemplary embodiment of the present disclosure.
As shown in fig. 1, the method includes: step S101, acquiring sample user data; step S102, performing supervised training on the sample user data based on a classification algorithm and generating a decision tree; step S103, based on one or more nodes in the decision tree, recall rules related to user attributes are generated.
According to some embodiments, the classification algorithm may be one or more of or any combination of several algorithms including, but not limited to, logistic regression (Logistic Regression, LG), gradient boost decision tree (Gradient Boosting Decision Tree, GBDT), and limit gradient boost (eXtreme Gradient Boosting, XGBoost), and the type of classification algorithm is not limited by the present disclosure.
For convenience of description and understanding, the following description will specifically describe the technical solution of the present disclosure by taking a XGBoost algorithm as an example of a classification algorithm. It should be noted that, the technical solution of the present disclosure is not limited to only one classification algorithm based on XGBoost, and is equally applicable to other classification algorithms.
According to some embodiments, the sample user may be determined according to the definition of the target user. For example, the target user may be a potential user of a certain type of product or service. In this case, the sample user may be a user who purchased the product or service and a user who did not purchase the product or service as positive and negative samples of the supervised training, respectively. All examples corresponding to the user attributes of the sample user can be used as feature input in the supervised training, for example, the user attributes can be search words of the user, and correspondingly, word sets obtained by word segmentation processing on the search history of the sample user can be used as feature input in the supervised training; it will be appreciated that the nodes in the decision tree ultimately generated by the supervised training may be all or part of the terms in the set of terms described above.
According to some embodiments, step S103 further comprises: one or more nodes in the decision tree are weighted to generate the recall rules related to user attributes. This process is equivalent to screening out features that are relatively related to the target behavior of the target user (e.g., the behavior of purchasing a certain product or service) from all nodes of the decision tree according to a preset calculation rule, and taking the screened features as recall rules, so as to improve the accuracy of mining the target user. A detailed description will be given in the following embodiments regarding a specific manner of weighting calculation of the nodes.
According to some embodiments, weighting one or more nodes in the decision tree to generate the recall rule related to a user attribute comprises: generating a feature value list based on corresponding feature values of one or more nodes in the decision tree, wherein the feature values are products of the times of splitting of the corresponding nodes and gains when the nodes split, and the feature value list comprises the one or more nodes in the decision tree and the corresponding feature values thereof; and ordering the list of feature values and generating the recall rule related to the user attribute.
It will be appreciated that the feature value of a node is equal to the number of times the node is split multiplied by the gain at split, the larger the feature value is, the higher the correlation of the feature with the target behavior of the target user, and therefore the feature value calculation provides a way to filter the feature. And generating a characteristic value list for all nodes in the decision tree, and sequencing the characteristic values of all nodes according to the calculation result obtained by the formula so as to screen out the characteristic of the high characteristic value as a recall rule of a recall user.
According to some embodiments, the data processing method further comprises: after the recall rule is generated by the steps, the users can be matched in the data source according to the recall rule, wherein the matched users are recall users.
According to some embodiments, the user attribute may correspond to a term from a user, and the data source may be a search log of a full number of users, accordingly. It will be appreciated that the user's terms may reflect the user's intent to some extent, for example, when a user searches for a product, it may be presumed that the user may have an intent to purchase the product. Therefore, the accuracy of the recalled user according to the search term is high.
In one example, the method may include: taking a user who purchases a certain product and a user who does not purchase the product as positive and negative samples respectively, performing word segmentation processing on the retrieval history of the sample user to obtain a set of retrieval words, and performing XGBoost rounds of training on the positive and negative samples and the set of retrieval words serving as characteristics to obtain a decision tree; weighting calculation is carried out on all nodes in the decision tree, and sorting is carried out to obtain a characteristic value list, and a retrieval word corresponding to a node with higher weight can be used as a recall rule for finally recalling a user; matching the search words obtained in the supervised training in the search logs of the total users, wherein the users corresponding to the matched results in the search logs are recall users. In the process of matching the search terms in the search logs, the time range of the data source can be set according to the needs, for example, the data source can be the search logs of the whole user in one year.
It should be noted that, in the embodiments of the present disclosure, the search logs of the user are acquired, stored, and used based on the authorization and consent of the user. And, the search logs are subjected to desensitization (namely anonymization) in the process of being acquired, stored and used.
According to some embodiments, the term vector algorithm may be used to perform word expansion processing on the above-mentioned term, so as to further expand the recall rule.
According to other embodiments, the user attributes may also correspond to Applications (APP) installed by the user, and the data source is an application installation log, accordingly. Similar to the search term, the APP installed by the user can also be used as a user attribute for determining a target user, and the recall range of the user is further enlarged on the basis of the search term.
Fig. 2 shows a schematic diagram of generating recall rules according to an exemplary embodiment of the present disclosure. The process of generating recall rules using a user-installed APP, as shown in fig. 2, is similar to the workflow of the application search term described above, and will not be described in detail here. Further, after obtaining the APP list with high eigenvalues, positive sample rates of each APP can be calculated for each APP, and sorting is performed for the positive sample rates, so that the high intention APP cluster is further obtained as a recall rule.
According to some embodiments, the user attributes may also correspond to the user's bar access history, the user's geographic location information, etc., to recall the user from different data sources, to expand the recall scope of the user. The specific instance of the user attribute and the corresponding data source can be determined according to the definition of the target user, and can also be set according to the specific application scene. It will be appreciated that the process for generating recall rules for different user attributes is similar to the process for generating recall rules using terms described above, and this disclosure is not repeated here.
It should be noted that, in the embodiment of the present disclosure, similar to the search log of the user, the historical behavior information such as the bar access history of the user and the geographic location information are acquired, stored and used based on the authorization and consent of the user. The information is subjected to desensitization (namely anonymization) during the process of being acquired, stored and used.
In one example, the callable users resulting from the method may be a collection of users that are recalled from the various data sources described above using different recall rules.
According to some embodiments, the data processing method further comprises: inputting the user characteristics of the recallable user into a preset sequencing model; and determining whether the recallable user is a target user according to the output result of the ranking model. The preset sorting model can be a model obtained through supervised training based on a classification algorithm. The sorting model carries out further screening on the recall users, and improves the accuracy rate of mining the target users.
FIG. 3 shows a schematic diagram of a ranking model workflow in accordance with an exemplary embodiment of the present disclosure. As shown in fig. 3, the ranking model workflow may be divided into two parts: 301. training a sequencing model; 302. the ranking model ranks the users in a scoring order. In 301, a user who purchased a product and a user who did not purchase the product may be taken as positive and negative samples, respectively, and the positive and negative samples, characteristics of the sample user (e.g., user portraits, internet surfing behavior, derivative characteristics, etc.), and product characteristics of the product are taken as inputs for supervised training based on a classification algorithm to obtain the ranking model. In 302, the user characteristics of the recall user are input into a ranking model, the trained ranking model will score the recall user and output a score ranking that indicates the probability that its corresponding recall user purchased the product.
According to some embodiments, the data processing method may further comprise: and setting pushing rules based on the score ordering output by the ordering model, and pushing related products or services to corresponding target users according to the pushing rules. For example, the accuracy and recall may be checked against each threshold score, and when the accuracy and recall meet preset criteria, all callable users within the threshold may be pushed related products or services. Specific pushing rules and standards can be set according to specific application scenarios, which are not limited in this disclosure.
FIG. 4 illustrates a schematic diagram of generating recall rules based on a decision tree in accordance with an exemplary embodiment of the present disclosure. It should be noted that this embodiment employs an application scenario that recommends potential insurance users for purposes of illustrating the disclosed solution only. Those skilled in the art will understand that the scheme of the present disclosure is not limited to the application scenario of the present embodiment; on the contrary, other application scenarios are also contemplated based on the teachings of the present disclosure. In this embodiment, the target user is a credit high risk user, the user attribute is a term from the user, and the decision tree shown in fig. 4 can be obtained by training XGBoost on sample user data, wherein each node on the tree is a term from the user. Through XGBoost rounds of training, a plurality of decision trees as shown in fig. 4 can be obtained, and only fig. 4 is taken as an example, and a process of generating recall rules based on the decision trees is described. Each node in the decision tree corresponds to a term from the user, while each node in the decision tree corresponds to a gain. According to the eigenvalue calculation formula introduced in the disclosure, that is, the product of the number of times a certain search term appears in all nodes of the decision tree and the gain corresponding to the node, the eigenvalue of each node in the decision tree can be calculated. The larger the feature value is, the higher the correlation degree between the retrieval word corresponding to the node and the behavior of the credit high-risk user is, and the retrieval word corresponding to the high-feature value node is the recall rule. The recall rule is generated in such a way and the user is recalled, so that the recall rule has the advantages of wide recall range and high accuracy.
According to another aspect of the present disclosure, there is provided a data processing apparatus for performing any of the methods described above. As shown in fig. 5, the data processing apparatus 500 includes: an acquisition module 501 configured to acquire sample user data; a training module 502 configured to supervise training the sample user data based on a classification algorithm and generate a decision tree; a recall rule generation module 503 configured to generate recall rules related to user attributes based on one or more nodes in the decision tree.
The operation of the units 501-503 of the data processing apparatus 500 is similar to the operation of the steps S101-S103 described above, and will not be described in detail here.
According to some embodiments, the recall rule generation module 503 is further configured to: one or more nodes in the decision tree are weighted to generate the recall rules related to user attributes. This process corresponds to the screening of features by the computing unit that are more relevant to the target behavior of the target user (e.g. the behavior of purchasing a product or service), and the specific manner of weighting the nodes will be described in detail in the following embodiments.
According to some embodiments, the recall rule generation module 503 includes: a list generation unit configured to generate a list of eigenvalues based on respective eigenvalues of one or more nodes in the decision tree, wherein an eigenvalue is a product of the number of times a respective node is split and a gain at which the node is split, the list of eigenvalues containing the one or more nodes in the decision tree and their corresponding eigenvalues; and a ranking unit that ranks the feature value list and generates the recall rule related to the user attribute.
It will be appreciated that a node's eigenvalue is equal to the number of times that node is used for splitting multiplied by the gain at splitting, the larger the eigenvalue is, the higher the correlation of the eigenvalue with the target behavior of the target user. The list generating unit generates a characteristic value list based on all nodes in the decision tree, and the sorting unit sorts the characteristic values of all the nodes according to the calculation results obtained by the formula so as to screen out the characteristic with larger weight as a recall rule of a recall user.
According to some embodiments, the data processing apparatus further comprises: and the matching module is configured to match users in the data source according to the recall rule, wherein the matched users are recall users.
According to some embodiments, the user attribute may correspond to a term from a user, and the data source may be a search log of a full number of users, accordingly. It will be appreciated that the user's terms may reflect the user's intent to some extent, for example, when a user searches for a product, it may be presumed that the user may have an intent to purchase the product.
According to some embodiments, the term vector algorithm may be used to perform word expansion processing on the above-mentioned term, so as to further expand the recall rule.
According to other embodiments, the user attributes may also correspond to Applications (APP) installed by the user, and the data source is an application installation log, accordingly. Similar to the search term, the APP installed by the user may also be used as a user attribute for determining the target user.
It should be noted that, in the embodiments of the present disclosure, the historical behavior information of the user and the like are all acquired, stored and used based on the authorization and consent of the user. The information is subjected to desensitization (namely anonymization) during the process of being acquired, stored and used.
In one example, the recallable users that the matching module obtains may be a collection of users recalled from different data sources applying different recall rules.
According to some embodiments, the data processing apparatus further comprises: an input module: configured to input user features of the recallable user into a preset ranking model; and a determining module configured to determine whether the recall user is a target user based on an output result of the ranking model. The preset sorting model can be a model obtained through supervised training based on a classification algorithm.
According to some embodiments, the data processing apparatus may further comprise: and the pushing module is configured to set pushing rules based on the output score ranking of the ranking model and push related products or services to corresponding target users according to the pushing rules. For example, the accuracy and recall may be checked against each threshold score, and when the accuracy and recall meet preset criteria, all callable users within the threshold may be pushed related products or services. Specific pushing rules and standards can be set according to specific application scenarios, which are not limited in this disclosure.
According to another aspect of the present disclosure, there is also provided an electronic apparatus including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the data processing method described above.
According to another aspect of the present disclosure, there is also provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to execute the above-described data processing method.
According to another aspect of the present disclosure, there is also provided a computer program product comprising a computer program, wherein the computer program, when executed by a processor, implements the data processing method described above.
Referring to fig. 6, a block diagram of an electronic device 600 that may be a server of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described. Electronic devices are intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 6, the apparatus 600 includes a computing unit 601 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 602 or a computer program loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data required for the operation of the device 600 may also be stored. The computing unit 601, ROM 602, and RAM 603 are connected to each other by a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.
Various components in the device 600 are connected to the I/O interface 605, including: an input unit 606, an output unit 607, a storage unit 608, and a communication unit 609. The input unit 606 may be any type of device capable of inputting information to the device 600, the input unit 606 may receive input numeric or character information and generate key signal inputs related to user settings and/or function control of the electronic device, and may include, but is not limited to, a mouse, a keyboard, a touch screen, a trackpad, a trackball, a joystick, a microphone, and/or a remote control. The output unit 607 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, video/audio output terminals, vibrators, and/or printers. Storage unit 608 may include, but is not limited to, magnetic disks, optical disks. The communication unit 609 allows the device 600 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication transceivers and/or chipsets, such as bluetooth (TM) devices, 1302.11 devices, wiFi devices, wiMax devices, cellular communication devices, and/or the like.
The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 601 performs the respective methods and processes described above, such as a data processing method. For example, in some embodiments, the data processing method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 600 via the ROM 602 and/or the communication unit 609. When the computer program is loaded into RAM 603 and executed by the computing unit 601, one or more steps of the method described above may be performed. Alternatively, in other embodiments, the computing unit 601 may be configured to perform the data processing method by any other suitable means (e.g. by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.
Although embodiments or examples of the present disclosure have been described with reference to the accompanying drawings, it is to be understood that the foregoing methods, systems, and apparatus are merely exemplary embodiments or examples, and that the scope of the present invention is not limited by these embodiments or examples but only by the claims following the grant and their equivalents. Various elements of the embodiments or examples may be omitted or replaced with equivalent elements thereof. Furthermore, the steps may be performed in a different order than described in the present disclosure. Further, various elements of the embodiments or examples may be combined in various ways. It is important that as technology evolves, many of the elements described herein may be replaced by equivalent elements that appear after the disclosure.

Claims (13)

1.一种数据处理方法,包括:1. A data processing method, comprising: 获取样本用户数据;Get sample user data; 基于二分类算法对所述样本用户数据进行有监督训练并生成决策树;以及Performing supervised training on the sample user data based on a binary classification algorithm and generating a decision tree; and 基于所述决策树中的多个节点,生成与用户属性相关的召回规则,包括:Based on the multiple nodes in the decision tree, a recall rule related to the user attribute is generated, including: 对所述决策树中的多个节点进行加权计算,以生成与用户属性相关的所述召回规则,包括:Performing weighted calculation on multiple nodes in the decision tree to generate the recall rule related to the user attribute includes: 基于所述决策树中的多个节点的相应的特征值生成特征值列表,其中,特征值为相应的节点用于分裂的次数与该节点分裂时的增益的乘积,所述特征值列表包含所述决策树中的所述多个节点及其对应的特征值;以及generating a feature value list based on corresponding feature values of a plurality of nodes in the decision tree, wherein the feature value is the product of the number of times the corresponding node is used for splitting and the gain of the node when splitting, and the feature value list includes the plurality of nodes in the decision tree and their corresponding feature values; and 对所述特征值列表进行排序并生成与所述用户属性有关的所述召回规则,包括:Sorting the feature value list and generating the recall rule related to the user attribute includes: 对所述特征值列表进行排序以筛选得到高特征值特征值列表;Sorting the eigenvalue list to screen out a high eigenvalue eigenvalue list; 计算所述高特征值特征值列表中的每个节点对应的正样本率;以及Calculate the positive sample rate corresponding to each node in the high eigenvalue eigenvalue list; and 基于每个节点对应的正样本率,生成与所述用户属性有关的所述召回规则,Based on the positive sample rate corresponding to each node, generating the recall rule related to the user attribute, 其中,所述用户属性对应于来自用户的检索词和/或所述用户属性对应于用户所安装的应用程序。The user attribute corresponds to a search term from the user and/or the user attribute corresponds to an application installed by the user. 2.根据权利要求1所述的方法,还包括:2. The method according to claim 1, further comprising: 根据所述召回规则在数据源中匹配用户,其中,经匹配得到的用户为可召回用户。Users are matched in the data source according to the recall rule, wherein the users obtained through matching are the recallable users. 3.根据权利要求2所述的方法,其中,当所述用户属性对应于来自用户的检索词时,所述数据源为全量用户的搜索日志。3. The method according to claim 2, wherein when the user attribute corresponds to a search term from a user, the data source is the search log of all users. 4.根据权利要求2所述的方法,其中,当所述用户属性对应于用户所安装的应用程序时,所述数据源为应用程序安装日志。4 . The method according to claim 2 , wherein, when the user attribute corresponds to an application installed by the user, the data source is an application installation log. 5.根据权利要求2至4中任一项所述的方法,还包括:5. The method according to any one of claims 2 to 4, further comprising: 将所述可召回用户的用户特征输入预设的排序模型;以及Inputting the user characteristics of the recallable users into a preset sorting model; and 根据所述排序模型的输出结果,确定所述可召回用户是否是目标用户。According to the output result of the ranking model, it is determined whether the recallable user is a target user. 6.一种数据处理装置,包括:6. A data processing device, comprising: 获取模块,被配置为获取样本用户数据;An acquisition module, configured to acquire sample user data; 训练模块,被配置为基于二分类算法对所述样本用户数据进行有监督训练并生成决策树;以及A training module, configured to perform supervised training on the sample user data based on a binary classification algorithm and generate a decision tree; and 召回规则生成模块,被配置为基于所述决策树中的多个节点,生成与用户属性相关的召回规则,其中,所述召回规则生成模块被进一步配置为对所述决策树中的多个节点进行加权计算,以生成与用户属性相关的所述召回规则,其中,所述召回规则生成模块包括:A recall rule generation module is configured to generate a recall rule related to the user attribute based on multiple nodes in the decision tree, wherein the recall rule generation module is further configured to perform weighted calculation on the multiple nodes in the decision tree to generate the recall rule related to the user attribute, wherein the recall rule generation module includes: 列表生成单元,被配置为基于所述决策树中的多个节点的相应的特征值生成特征值列表,其中,特征值为相应的节点用于分裂的次数与该节点分裂时的增益的乘积,所述特征值列表包含所述决策树中的所述多个节点及其对应的特征值;以及a list generating unit configured to generate a feature value list based on corresponding feature values of a plurality of nodes in the decision tree, wherein the feature value is the product of the number of times the corresponding node is used for splitting and the gain of the node when splitting, and the feature value list includes the plurality of nodes in the decision tree and their corresponding feature values; and 排序单元,被配置为对所述特征值列表进行排序并生成与所述用户属性有关的所述召回规则,其中,所述排序单元被进一步配置为:A sorting unit is configured to sort the feature value list and generate the recall rule related to the user attribute, wherein the sorting unit is further configured to: 对所述特征值列表进行排序以筛选得到高特征值特征值列表;Sorting the eigenvalue list to screen out a high eigenvalue eigenvalue list; 计算所述高特征值特征值列表中的每个节点对应的正样本率;以及Calculate the positive sample rate corresponding to each node in the high eigenvalue eigenvalue list; and 基于每个节点对应的正样本率,生成与所述用户属性有关的所述召回规则,Based on the positive sample rate corresponding to each node, generating the recall rule related to the user attribute, 其中,所述用户属性对应于来自用户的检索词和/或所述用户属性对应于用户所安装的应用程序。The user attribute corresponds to a search term from the user and/or the user attribute corresponds to an application installed by the user. 7.根据权利要求6所述的装置,还包括,7. The device according to claim 6, further comprising: 匹配模块,被配置为根据所述召回规则在数据源中匹配用户,其中,经匹配得到的用户为可召回用户。The matching module is configured to match users in the data source according to the recall rule, wherein the users obtained through matching are recallable users. 8.根据权利要求7所述的装置,其中,当所述用户属性对应于来自用户的检索词时,所述数据源为全量用户的搜索日志。8. The device according to claim 7, wherein when the user attribute corresponds to a search term from a user, the data source is a search log of all users. 9.根据权利要求7所述的装置,其中,当所述用户属性对应于用户所安装的应用程序时,所述数据源为应用程序安装日志。9 . The apparatus according to claim 7 , wherein when the user attribute corresponds to an application installed by the user, the data source is an application installation log. 10.根据权利要求7至9中任一项所述的装置,还包括:10. The device according to any one of claims 7 to 9, further comprising: 输入模块,被配置为将所述可召回用户的用户特征输入预设的排序模型;以及An input module, configured to input the user characteristics of the recallable users into a preset sorting model; and 确定模块,被配置为根据所述排序模型的输出结果,确定所述可召回用户是否是目标用户。The determination module is configured to determine whether the recallable user is a target user according to the output result of the ranking model. 11.一种电子设备,包括:11. An electronic device, comprising: 至少一个处理器;以及at least one processor; and 与所述至少一个处理器通信连接的存储器;其中a memory communicatively coupled to the at least one processor; wherein 所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行时使得所述至少一个处理器能够执行权利要求1至5 中任一项所述的方法。The memory stores instructions that can be executed by the at least one processor, and when the instructions are executed by the at least one processor, the at least one processor is enabled to perform the method according to any one of claims 1 to 5. 12.一种存储有计算机指令的非瞬时计算机可读存储介质,其中,所述计算机指令用于使所述计算机执行根据权利要求1至5中任一项所述的方法。12. A non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause the computer to execute the method according to any one of claims 1 to 5. 13.一种计算机程序产品,包括计算机程序,其中,所述计算机程序在被处理器执行时实现权利要求1至5中任一项所述的方法。13. A computer program product, comprising a computer program, wherein the computer program implements the method according to any one of claims 1 to 5 when executed by a processor.
CN202110647282.4A 2021-06-10 2021-06-10 Data processing method, device, electronic device and medium Active CN113312552B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110647282.4A CN113312552B (en) 2021-06-10 2021-06-10 Data processing method, device, electronic device and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110647282.4A CN113312552B (en) 2021-06-10 2021-06-10 Data processing method, device, electronic device and medium

Publications (2)

Publication Number Publication Date
CN113312552A CN113312552A (en) 2021-08-27
CN113312552B true CN113312552B (en) 2024-11-08

Family

ID=77378019

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110647282.4A Active CN113312552B (en) 2021-06-10 2021-06-10 Data processing method, device, electronic device and medium

Country Status (1)

Country Link
CN (1) CN113312552B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112651790B (en) * 2021-01-19 2024-04-12 恩亿科(北京)数据科技有限公司 OCPX self-adaptive learning method and system based on user touch in quick-elimination industry
CN114021650A (en) * 2021-11-04 2022-02-08 北京百度网讯科技有限公司 Data processing method, device, electronic equipment and medium
CN116050543A (en) * 2023-02-09 2023-05-02 北京百度网讯科技有限公司 Data processing method, device, electronic device, medium and chip
CN116821495A (en) * 2023-06-28 2023-09-29 中国第一汽车股份有限公司 A Tieba management method, device, equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106846061A (en) * 2017-01-25 2017-06-13 百度在线网络技术(北京)有限公司 Potential user's method for digging and device
CN111860858A (en) * 2020-04-15 2020-10-30 北京嘀嘀无限科技发展有限公司 Method and device for determining model updating parameters and readable storage medium
CN112559900A (en) * 2021-02-26 2021-03-26 深圳索信达数据技术有限公司 Product recommendation method and device, computer equipment and storage medium
CN112785397A (en) * 2021-03-09 2021-05-11 中国工商银行股份有限公司 Product recommendation method, device and storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109947498B (en) * 2017-12-20 2021-06-29 Oppo广东移动通信有限公司 Application preloading method, device, storage medium and mobile terminal
CN110992097B (en) * 2019-12-03 2024-02-13 上海钧正网络科技有限公司 Processing method and device for price of revenue product, computer equipment and storage medium
CN112035453B (en) * 2020-08-27 2024-03-12 平安科技(深圳)有限公司 Recommendation method, device and storage medium based on GBDT high-order feature combination
CN112036476A (en) * 2020-08-28 2020-12-04 上海冰鉴信息科技有限公司 Data feature selection method and device based on two-classification service and computer equipment
CN112149807B (en) * 2020-09-28 2024-06-28 北京百度网讯科技有限公司 Method and device for processing user characteristic information

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106846061A (en) * 2017-01-25 2017-06-13 百度在线网络技术(北京)有限公司 Potential user's method for digging and device
CN111860858A (en) * 2020-04-15 2020-10-30 北京嘀嘀无限科技发展有限公司 Method and device for determining model updating parameters and readable storage medium
CN112559900A (en) * 2021-02-26 2021-03-26 深圳索信达数据技术有限公司 Product recommendation method and device, computer equipment and storage medium
CN112785397A (en) * 2021-03-09 2021-05-11 中国工商银行股份有限公司 Product recommendation method, device and storage medium

Also Published As

Publication number Publication date
CN113312552A (en) 2021-08-27

Similar Documents

Publication Publication Date Title
CN113312552B (en) Data processing method, device, electronic device and medium
CN111639516B (en) Analysis platform based on machine learning
CN112579909A (en) Object recommendation method and device, computer equipment and medium
CN113961765B (en) Search methods, devices, equipment and media based on neural network models
CN113656587B (en) Text classification method, device, electronic equipment and storage medium
CN117931991B (en) Training sample acquisition and large model optimization training method and device
CN113761379B (en) Commodity recommendation method and device, electronic device and medium
CN113392920B (en) Method, apparatus, device, medium, and program product for generating cheating prediction model
CN114443964B (en) Information recommendation methods, devices, electronic devices and media
US11869050B2 (en) Facilitating responding to multiple product or service reviews associated with multiple sources
CN114120416A (en) Model training method and device, electronic equipment and medium
CN118278534A (en) A method and device for generating a model
CN113420227B (en) Training method of click rate estimation model, click rate estimation method and device
CN114021650A (en) Data processing method, device, electronic equipment and medium
WO2024102232A1 (en) Automatic mapping of a question or compliance controls associated with a compliance standard to compliance controls associated with another compliance standard
CN114547340B (en) Hotspot event mining methods, devices, equipment and media
CN114219079B (en) Feature selection methods and apparatus, model training methods and apparatus, equipment and media
CN116342164A (en) Target user group positioning method, device, electronic equipment and storage medium
JP6680472B2 (en) Information processing apparatus, information processing method, and information processing program
CN120372310A (en) User sample data processing method and device, electronic equipment and medium
CN114492370A (en) Webpage identification method and device, electronic equipment and medium
CN113391989A (en) Program evaluation method, device, equipment, medium and program product
CN115965817B (en) Image classification model training method, device and electronic equipment
CN113240304B (en) Feature construction method, device, equipment and storage medium
JP6990757B1 (en) Generator, generation method and generation program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant