WO2022227176A1

WO2022227176A1 - Drug information pushing method and apparatus, computer device, and storage medium

Info

Publication number: WO2022227176A1
Application number: PCT/CN2021/096712
Authority: WO
Inventors: 徐卓扬; 孙行智; 胡岗
Original assignee: 平安科技（深圳）有限公司
Priority date: 2021-04-29
Filing date: 2021-05-28
Publication date: 2022-11-03
Also published as: CN113076486B; CN113076486A

Abstract

Disclosed in embodiments of the present application are a drug information pushing method and apparatus, a computer device, and a storage medium. The method is applicable to the field of digital medicine, and comprises: obtaining target user attribute information of a target user, and inputting the target user attribute information into a drug reward prediction model; by means of the drug reward prediction model, outputting first target reward parameters and second target reward parameters of the target user under the action of drugs; on the basis of the first target reward parameters of the target user and/or the second target reward parameters of the target user, determining user reward parameters of the target user under the action of the drugs; determining the maximum user reward parameter from among the user reward parameters, and outputting drug information of the target drug having the maximum user reward parameter to a user interface to display the target drug to the target user. By using the embodiments of the present application, the scalability of the drug reward prediction model can be enhanced, thereby improving the accuracy of drug information pushing.

Description

Drug information push method, device, computer equipment and storage medium

This application claims the priority of the Chinese patent application filed on April 29, 2021 with the application number 202110473086.X and the title of the invention is "drug information push method, device, computer equipment and storage medium", the entire contents of which are Incorporated herein by reference.

technical field

The present application relates to the technical field of artificial intelligence, and in particular, to a method, device, computer equipment and storage medium for pushing drug information.

Background technique

At present, the use of deep reinforcement learning (DRL) models can solve more and more practical problems. The inventors found that when running the DRL model, the patient's sample data can be input into the DRL model to output a Q value (value), where the Q value can be used to evaluate the expected reward of different actions (such as a doctor's prescription) (reward, such as the degree of influence of the drug). Since DRL models tend to consider both short-term and long-term outcomes and the DRL model has only one return factor, the Q-value evaluates both the expected reward for the short-term outcome and the expected reward for the long-term outcome, resulting in the expected reward for the long-term outcome and the expected reward for the short-term outcome. The expected reward is essentially the same. However, the inventors realized that due to the essential difference between long-term outcomes and short-term outcomes, the essential difference is mainly reflected in the difference in the distance of action between long-term outcomes and short-term outcomes (for example, short-term outcomes are mainly affected by the most recent drug, and long-term outcomes are mainly affected by longer-term outcomes. drug effects before time), thus resulting in poor scalability of the DRL model.

SUMMARY OF THE INVENTION

The embodiments of the present application provide a drug information push method, device, computer equipment and storage medium, which can enhance the scalability of a drug reward prediction model, thereby improving the accuracy of drug information push.

In a first aspect, the application provides a method for pushing drug information, the method comprising:

Obtain target user attribute information of the target user, input the target user attribute information into the drug reward prediction model, and the target user attribute information includes at least one of demographic information, health indicators for drug use for the target disease, and historical drug use information;

Each first target reward parameter and each second target reward parameter of the target user under the action of each drug are output through the drug reward prediction model, wherein the drug reward prediction model includes the first network parameter and the second network parameter, and the first network parameter uses It is used to determine the first reward parameter of any user with any user attribute information under the action of various drugs, and the second network parameter is used to determine the second reward parameter of any user under the action of various drugs, and any user is under the action of various drugs. A drug corresponds to a first reward parameter and a second reward parameter, and the drug action duration corresponding to the first reward parameter is greater than the drug action duration corresponding to the second reward parameter;

Based on each first target reward parameter of the target user and/or each second target reward parameter of the target user, each user reward parameter of the target user under the action of each drug is determined, wherein the target user corresponds to one user under the action of one drug reward parameters;

The maximum user reward parameter is determined from each user reward parameter, and the drug information of the target drug with the maximum user reward parameter is output to the user interface to display the target drug to the target user.

In combination with the second aspect, in a possible implementation manner, the above-mentioned device further includes:

a data acquisition module, configured to acquire sample data of at least two users, and the sample data of one user includes user attribute information and sample drug information of the user;

The sample input module is used to obtain each first sample reward parameter and each second sample reward parameter of each user under the action of the sample drug indicated by the sample drug information, and combine the sample data of at least two users, each first sample This reward parameter and each second sample reward parameter are input into the drug reward prediction model;

The parameter training module is used to train the first network parameters and the second network parameters of the drug reward prediction model based on the user attribute information of at least two users, each first sample reward parameter and each second sample reward parameter, so as to obtain the parameters based on any parameter. The user attribute information of a user predicts the ability of the first reward parameter and the second reward parameter of any user under the action of each drug.

In a third aspect, the present application provides a computer device, including: a processor, a memory, and a network interface;

The processor is connected to a memory and a network interface, wherein the network interface is used to provide a data communication function, the memory is used to store a computer program, and the processor is used to call the computer program to execute the first aspect in the embodiment of the present application. The drug information push method, the drug push method includes:

In a fourth aspect, the present application provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and the computer program includes program instructions that, when executed by a processor, execute the above-mentioned first step in the present application. The drug information push method in one aspect, the drug push method includes:

The embodiment of the present application enhances the scalability of the drug reward prediction model, and improves the interpretability, security, selectivity and traceability of the model, thereby improving the accuracy of drug information push and having strong applicability.

Description of drawings

In order to illustrate the technical solutions in the embodiments of the present application more clearly, the following briefly introduces the drawings that are used in the description of the embodiments. Obviously, the drawings in the following description are only some embodiments of the present application. For those of ordinary skill in the art, other drawings can also be obtained from these drawings without creative effort.

1 is a schematic structural diagram of a network architecture provided by the application;

Fig. 2 is the schematic flow chart of the drug information push method provided by the application;

Fig. 3 is the structural representation of the drug reward prediction model provided by the application;

4 is a schematic structural diagram of a drug information push device provided by the present application;

FIG. 5 is a schematic structural diagram of a computer device provided by the present application.

Detailed ways

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are only a part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.

The technical solutions of the present application may relate to the technical field of artificial intelligence, and may be applied to scenarios such as smart medical treatment such as medical information push, so as to realize digital medical treatment and promote the construction of smart cities. Optionally, the data involved in this application, such as attribute information and/or target drug information, may be stored in a database, or may be stored in a blockchain, such as distributed storage through a blockchain, which is not limited in this application.

Please refer to FIG. 1 , which is a schematic structural diagram of a network architecture provided by the present application. As shown in FIG. 1, the network architecture may include a server 10 and a user terminal cluster, and the user terminal cluster may include multiple user terminals, as shown in FIG. ..., the user terminal 100n.

The server 10 may be an independent physical server, or may provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content distribution networks ( content delivery network, CDN), big data and artificial intelligence platforms and other basic cloud computing services cloud servers. Each user terminal in the user terminal cluster may include, but is not limited to, smart terminals such as smart phones, tablet computers, notebook computers, desktop computers, smart speakers, and smart watches.

It can be understood that the computer device in this application may be an entity terminal with a drug information push function, and the entity terminal may be the server 10 as shown in FIG. 1 or a user terminal, which is not limited herein.

As shown in FIG. 1, the user terminal 100a, the user terminal 100b, the user terminal 100c, . . . , and the user terminal 100n can be respectively connected to the above-mentioned server 10 through a network, so that each user terminal can exchange data with the server 10 through the network connection. . For example, the server 10 may output the drug information of the target drug to the user interface corresponding to the user terminal of the target user, so that the target user can view the target drug on the user interface, wherein the user terminal of the target user may be a user terminal in the user terminal cluster. Any one of the user terminals (eg, user terminal 100a). In this application, the drugs determined based on the drug reward prediction model and used for pushing to target users may be collectively referred to as target drugs. A functional model is called a drug reward prediction model.

The drug information push method provided in this application can be applied to a drug information push scenario for any disease, such as a diabetes drug information push scenario, a hypertension drug information push scenario, or a drug information push scenario for other diseases. Assuming that the target user is a doctor, the doctor can input the patient's basic information into the drug reward prediction model, and can output the pushed drug information of the target drug to the user interface based on the patient's basic information. At this time, the doctor can view the information on the user interface. The target drug (here the target drug can be used as the preliminary diagnosis result), and then combined with the further diagnosis results of the patient to determine the appropriate drug for the patient (such as the above-mentioned target drug). Assuming that the target user is a patient, the patient can input their basic information to the self-service terminal (or simply self-service machine, etc.) provided by medical institutions such as hospitals, health stations or social health institutions. The self-service machine contains the above-mentioned drug reward prediction model, which can be based on The basic information of the patient outputs the drug information of the recommended target drug to the user interface of the self-service machine. The patient can view the target drug in the user interface of the self-service machine, and the patient can purchase the target drug directly, or the doctor can further diagnose and determine the drug suitable for the patient (such as the above-mentioned target drug).

For the convenience of description, the following description will take the scenario of diabetes drug information push as an example, which will not be repeated below. The medicine information pushing method, medicine information pushing device and computer equipment of the present application will be described below in conjunction with Fig. 2 to Fig. 5 .

Please refer to FIG. 2 , which is a schematic flowchart of a method for pushing drug information provided by an embodiment of the present application. As shown in FIG. 2, the method may include the following steps S101-S104:

In step S101, target user attribute information of the target user is acquired, and the target user attribute information is input into a drug reward prediction model.

It can be understood that, before step S101 is executed, the computer device can first train the model parameters of the drug reward prediction model through the sample data of at least two users and the actual reward parameters of each user, so as to obtain the model parameters used to output any user's performance in each drug. Drug reward prediction model under the action of the first reward parameter and the second reward parameter. The drug reward prediction model here can be a deep reinforcement learning (deep q-network, DQN) model. The reinforcement learning method of the DQN model is to take actions (such as The artificial intelligence method is an artificial intelligence method that optimizes the strategy through the expected reward obtained after obtaining the expected reward. Here, the parameter value corresponding to the expected reward may be the expected reward parameter (such as the following first expected reward parameter and second expected reward parameter), in other words, the value of the expected reward parameter is used to represent the expected reward. Among them, the policy refers to the method in which an action should be taken in a specific state to maximize the expected reward.

In some feasible implementations, the computer device may acquire sample data of at least two users, wherein the sample data of at least two users may be used to train a drug reward prediction model, one user corresponds to one sample data, and one sample data may include User attribute information and sample medication information of the user. The user attribute information here may include at least one of demographic information, health indicators of medication for the target disease, and historical medication information (ie, medication history), and the medication indicated by the sample medication information is a sample medication. The demographic information may include gender, age, health status, occupation, marriage, education level, income, and other information, and the health index may be understood as an inspection index corresponding to the target disease. The sample drugs used by different users for the target disease can be the same or different.

Further, the computer device can obtain each first sample reward parameter and each second sample reward parameter of each user under the action of the sample drug, and combine the sample data of at least two users, each first sample reward parameter, and each third sample reward parameter. The two-sample reward parameters are input into the drug reward prediction model. In this application, the actual long-term reward parameter of the user under the action of the sample drug may be referred to as the first sample reward parameter. In this application, the actual short-term reward parameter of the user under the action of the sample drug may also be referred to as the second sample reward parameter. The drug action duration corresponding to the first sample reward parameter is greater than the drug action duration corresponding to the second sample reward parameter. The reward here can be understood as the degree of influence of the user on their own health indicators after taking the sample drug for a period of time, and the value of the reward parameter is used to represent the degree of influence. For example, reward parameter 1 is used to represent influence degree 1, and reward parameter 2 is used to represent influence degree 2. If reward parameter 1 is greater than reward parameter 2, it indicates that influence degree 1 is greater than influence degree 2.

Further, the computer device can train the first network parameters and the second network parameters of the drug reward prediction model based on the user attribute information of at least two users, each first sample reward parameter and each second sample reward parameter, so as to obtain the first network parameters and the second network parameters of the drug reward prediction model based on any The ability of the user attribute information (eg target user attribute information) of a user (eg target user) to predict the first reward parameter and the second reward parameter of any user under the action of each drug. Wherein, the first network parameter can be used to determine the first reward parameter (also called long-term reward parameter) of any user with any user attribute information under the action of various drugs, and the second network parameter can be used to determine any user For the second reward parameter (which may also be referred to as a short-term reward parameter) under the action of various drugs, the drug action duration corresponding to the first reward parameter is greater than the drug action duration corresponding to the second reward parameter. The first network parameter here may include the first model parameter and the first backhaul parameter, and the second network parameter may include the second model parameter and the second backhaul parameter. In the present application, the parameters that are iteratively updated based on the loss value in the drug reward prediction model may be collectively referred to as model parameters (eg, the first model parameter and the second model parameter). The application may refer to the return parameter corresponding to the first reward parameter in the first network parameter as the first return parameter (also referred to as the first return factor), and the application may also refer to the second network parameter as the second return parameter. The return parameter corresponding to the reward parameter is called the second return parameter (may also be referred to as the second return factor). The return parameters here can be understood as parameters that remain unchanged during the training process of the drug reward prediction model. Wherein, since the drug action duration corresponding to the first reward parameter is greater than the drug action duration corresponding to the second reward parameter, the first return parameter is greater than the second return parameter, for example, the first return parameter is 0.9 or other values, the The second return parameter is 0.2 or other values.

In some feasible embodiments, the computer device may determine each first expected reward parameter of each user under the action of the sample drug based on the first model parameter and the first return parameter, and based on the second model parameter and the second return parameter Each second expected reward parameter of each user under the action of the sample drug is determined. Among them, a user corresponds to a first expected reward parameter under the action of a sample drug, and a user corresponds to a second expected reward parameter under the action of a sample drug. At this time, the computer device can use the loss function to pass the first return parameter, the second return parameter, each first sample reward parameter, each second sample reward parameter, each first expected reward parameter, and each second expected reward The parameter determines each loss value corresponding to each user's sample data. Among them, a first sample reward parameter, a second sample reward parameter, a first expected reward parameter, and a second expected reward parameter correspond to a loss value corresponding to a user's sample data. Wherein, the computer device can determine the loss value l _loss corresponding to the user's sample data according to the following formula (1):

l _loss =(Q _short (s _t ,a _t )+Q _long (s _t ,a _t )-(r _short +r _long +max _a (γ _short *Q _short (s _t+1 ,a)+γ _long *Q _long (s _t+1 ,a)))) ² , formula (1)

Among them, at can represent the sample drug input into the drug reward prediction model at the current time t (that is, the sample drug in the sample data), and s _t can represent the user attribute information input into the drug reward prediction model at the current time _t (ie, the sample data in the sample data). User attribute information of the user), s _t+1 can represent the user attribute information input into the drug reward prediction model at the next time _t +1, Q _long (s _t , at ) can represent the user’s first expected reward at the current time t parameters, Q _short (s _t , at ) can represent the second expected reward parameter of the user at the current time _t , r _long can represent the first sample reward parameter of the user at the current time t, and r _short can represent the current time t. The user's second sample reward parameter, γ _long can represent the first return coefficient, γ _short can represent the second return coefficient, Q _long (s _t+1 , a) can represent the user's first return at the next moment t+1 The expected reward parameter, Q _short (s _t+1 ,a) can represent the second expected reward parameter of the user at the next moment t+1.

After obtaining each loss value based on the above formula (1), the computer device may iteratively update the parameter value of the first model parameter and the parameter value of the second model parameter based on each loss value until the loss value remains unchanged, and then stop the prediction of drug reward The model is trained, and the iteratively updated first model parameters are used as the final first model parameters of the drug reward prediction model, and the iteratively updated second model parameters are used as the final second model parameters of the drug reward prediction model. At this time, it also shows that the drug reward prediction model has the ability to predict the first reward parameter and the second reward parameter of any user under the action of each drug based on the user attribute information of any user.

Please refer to FIG. 3 , which is a schematic structural diagram of the drug reward prediction model of the present application. As shown in Figure 3, the drug reward prediction model may include multiple convolutional layers (eg, convolutional layers 10a to 10c) and multiple fully connected layers (eg, fully connected layer 20a and fully connected layer 20b). The input of the reward prediction model is the user attribute information of the user, and the output of the drug reward prediction model is the first reward parameter (eg Q _long ) and the second reward parameter (eg Q _short ) of any user under the action of each drug. When the feature vector corresponding to the user attribute information is a one-dimensional vector (for example, the user attribute information is the follow-up information of the patient), the drug reward prediction model may include the fully connected layer 20a and the fully connected layer 20b, but not the convolutional layer 10a to the volume Laminate 10c. The drug reward prediction model here includes a first network parameter and a second network parameter, wherein the fully connected layer 20b (ie, the second fully connected layer) can be configured with the first network parameter and the second network parameter, as shown in FIG. 3 . , the fully-connected layer 20b may include two fully-connected layers (such as the fully-connected layer 200b and the fully-connected layer 201b), wherein the fully-connected layer 200b is configured with the first network parameters, and the fully-connected layer 200b is configured based on the first network The parameter processes the user attribute information to output the first reward parameter Q _long of any user under the action of each drug; the fully connected layer 201b is configured with the second network parameter, and the fully connected layer 201b is used for the user based on the second network parameter. The attribute information is processed to output the second reward parameter Q _short of any user under the action of each drug.

For the convenience of description, in the scenario of diabetes drug information push (also known as the scenario of grouping diabetes patients, grouping refers to the doctor's prescription), the computer device can obtain sample data of at least two users, of which at least two users' The sample data can be long-term follow-up data of a large number of diabetic patients, and one sample data can include one-time follow-up data of one patient. The sample data here may include user attribute information, and the user attribute information may include, but is not limited to, age, gender, medication history, sample drugs (that is, drugs prescribed by doctors or drugs actually taken by patients, such as biguanides or sulfonylureas) ), HbA1c value, creatinine value, and other health indicators for diabetes. At this time, the computer device can obtain each first sample reward parameter and each first sample reward parameter of each user under the action of the sample drug, and use the user attribute information of each user, each first sample reward parameter and each first sample reward parameter for each user. A sample reward parameter is input into the above drug reward prediction model. For example, the first sample reward parameter may indicate whether diabetic complications occurred at the last follow-up after taking the drug. The first sample reward parameter is 0 when diabetic complications occur in diabetic patients, and 0 when diabetic patients do not appear. The first sample reward parameter is 1 for complications of diabetes. For example, the second sample reward parameter can indicate whether the glycated hemoglobin value of the diabetic patient reaches the target at the next follow-up after taking the drug. When the second sample reward parameter is 0.

Further, the computer device may output each first expected reward parameter of each user under the action of the sample drug based on the above-mentioned fully connected layer 200b, and output each second expected reward parameter of each user under the action of the sample drug based on the above-mentioned fully connected layer 201b. . Further, the computer device can use the above-mentioned loss function to evaluate the first return parameter, the second return parameter, each first sample reward parameter, each second sample reward parameter, each first expected reward parameter, and each second expected reward. The parameters are calculated to obtain each loss value corresponding to the sample data of each user. At this time, the computer device can iteratively update the parameter value of the first model parameter and the parameter value of the second model parameter according to the loss value corresponding to all the sample data until the loss value is basically unchanged (for example, the loss value is the smallest), indicating that the drug reward prediction model Model training has been completed (i.e. the drug reward prediction model has converged). At this time, the first network parameters configured in the fully connected layer 200b include the first return parameters and the iteratively updated first model parameters, and the second network parameters configured in the fully connected layer 201b include the second return parameters and the iteratively updated first model parameters. Updated second model parameters. The first return parameter in the fully connected layer 200b and the iteratively updated first model parameter can be used to predict the first reward parameter of any user under the action of each drug, and the second return parameter in the fully connected layer 201b And the iteratively updated second model parameters can be used to predict the second reward parameters of any user under the action of each drug. It can be seen that the drug reward prediction model at this time has the ability to predict the first reward parameter and the second reward parameter of any user under the action of each drug based on the user attribute information of any user.

After training the drug reward prediction model, when detecting an input instruction on the attribute information input area in the user interface, the computer device can acquire the target user attribute information of the target user based on the input instruction, and input the target user attribute information into the drug Reward prediction model. For example, the target user can input the attribute information of the target user in the above attribute information input area, and click the OK button in the user interface after the input is completed. At this time, the computer device can detect the input instruction on the attribute information input area, so as to obtain the target user. User's target user attribute information. Wherein, the attribute information of the target user may include at least one of demographic information, health indicators of medication for the target disease, and historical medication information.

Step S102, outputting each first target reward parameter and each second target reward parameter of the target user under the action of each drug through the drug reward prediction model.

In some feasible implementations, the computer device may determine each first target reward parameter of the target user under the action of each drug based on the first network parameters (ie, the first return parameter and the iteratively updated first model parameter), for example , the first network parameter may be the first network parameter in the fully connected layer 200b after the drug reward prediction model converges. The target user corresponds to a first target reward parameter under the action of a drug. Further, the computer device may determine each second target reward parameter of the target user under the action of each drug, for example, the second network parameter based on the second network parameter (ie, the second return parameter and the iteratively updated second model parameter). The second network parameter in the fully connected layer 201b after convergence of the drug reward prediction model may be. The target user corresponds to a second target reward parameter under the action of a drug.

Step S103: Determine each user reward parameter of the target user under the action of each drug based on each first target reward parameter of the target user and/or each second target reward parameter of the target user.

In some possible implementations, when the target user has both long-term and short-term drug action needs, the computer device may determine a first weighting coefficient for the first target reward parameter and a second weighting coefficient for the second target reward parameter. The first weighting coefficient (eg, 1 or other numerical values) and the second weighting coefficient (eg, 1 or other numerical values) here may be the weighting coefficients set by the user or the weighting coefficients configured by default in the drug reward prediction model. At this time, the computer device may determine each first weighted reward parameter corresponding to each first target reward parameter based on the first weighting coefficient and each first target reward parameter of the target user, and based on the second weighting coefficient and each second target user's second reward parameter The target reward parameter determines each second weighted reward parameter corresponding to each second target reward parameter. Further, the computer equipment can sum up each first weighted reward parameter and each second weighted reward parameter to obtain each user reward parameter of the target user under the action of each drug, and a first weighted reward parameter corresponds to a second weighted reward parameter. A user reward parameter. Optionally, the computer equipment can also directly sum up each first target reward parameter and each second target reward parameter to obtain each user reward parameter of the target user under the action of each drug, a first target reward parameter and a second target reward parameter. The reward parameter corresponds to a user reward parameter.

Optionally, in some feasible implementations, when the target user has long-term drug action needs, the computer device may determine each first target reward parameter of the target user as each user reward parameter of the target user under each drug action. Optionally, when the target user has a short-term drug action requirement, the computer device may determine each second target reward parameter of the target user as each user reward parameter of the target user under the action of each drug, which can be specifically determined according to the actual application scenario, There is no restriction here.

In step S104, the maximum user reward parameter is determined from the user reward parameters, and the drug information of the target drug with the maximum user reward parameter is output to the user interface to display the target drug to the target user.

In some feasible implementations, the computer device may sort each user reward parameter (such as from large to small or from small to large) to obtain a sequence of user reward parameters, and assign the first or The last user reward parameter is used as the maximum user reward parameter. Further, the computer device may output the drug information of the target drug with the maximum user reward parameter to the user interface to present the target drug to the target user. Taking the scenario of diabetes drug information push as an example, when the target user's drug action requirement is that there will be no complications of diabetes in the long term, the maximum user reward parameter can be the maximum first target reward parameter among the first target reward parameters. At this time, The computer device may output medication information for the target medication having the largest first target reward parameter to the user interface. When the drug action requirement of the target user is that the glycated hemoglobin value reaches the standard in the short term, the maximum user reward parameter may be the largest second target reward parameter among the second target reward parameters, and the computer device may have the largest second target reward parameter. The drug information of the target drug is output to the user interface. When the drug action demand of the target user is that there will be no complications of diabetes in the long term, and the glycated hemoglobin value reaches the standard in the short term, the reward parameters of each user can be determined by each first weighted reward parameter and each second weighted reward parameter. At this time, the computer equipment Medication information for the target medication with the maximum user reward parameter can be output to the user interface.

In some possible implementations, the target user can view the target drug on the user interface at this time, and send feedback information for the target drug to the computer device. For example, the feedback information may include that the target drug is different from the historical drug previously taken by the target user, or the effect of the target user taking the target drug is not as good as the effect of taking the historical drug. Further, after receiving the feedback information (for example, the target drug is different from the historical drug, or the effect of the target drug is not as good as the effect of the historical drug), the computer device can adjust the first network parameter and the second network parameter of the drug reward prediction model to It can better predict the first reward parameter and the second reward parameter of any user (such as the target user) under the action of each drug, and then push appropriate drug information to the target user.

In the embodiment of the present application, the computer device may input the attribute information of the target user into the drug reward prediction model, and output each first target reward parameter and each second target reward parameter of the target user under the action of each drug through the drug reward prediction model, Thereby, the drug reward prediction model can output the first target reward parameter and the second target reward parameter at the same time, the reward parameter of the long-term outcome is evaluated by the first target reward parameter, and the reward parameter of the short-term outcome is evaluated by the second target reward parameter, Thus, the scalability of the drug reward prediction model is enhanced, and the interpretability, safety, selectivity and traceability of the model are improved. Further, the computer device may determine each user reward parameter of the target user under the action of each drug based on each first target reward parameter of the target user and/or each second target reward parameter of the target user. At this time, the computer device can determine the maximum user reward parameter from the user reward parameters, and output the drug information of the target drug with the maximum user reward parameter to the user interface, so as to display the target drug to the target user, thereby improving the drug information Pushing accuracy, strong applicability.

Further, please refer to FIG. 4 , which is a schematic structural diagram of a drug information push device provided by an embodiment of the present application. The drug information push device may be a computer program (including program code) running in a computer device, for example, the drug information push device is an application software; the drug information push device may be used to execute the method provided by the embodiments of the present application corresponding steps in . As shown in FIG. 4 , the drug information pushing apparatus 1 may run on a computer device, and the computer device may be the server 10 in the embodiment corresponding to FIG. 1 above. The drug information pushing device 1 may include: a data acquisition module 10 , a sample input module 20 , a parameter training module 30 , an information input module 40 , a parameter output module 50 , a parameter determination module 60 and an information display module 70 .

The information input module 40 is used to obtain the target user attribute information of the target user, and input the target user attribute information into the drug reward prediction model. at least one.

In some possible implementations, the user interface includes an attribute information input area;

The above-mentioned information input module 40 includes: an information acquisition unit 401 .

The information acquisition unit 401 is configured to acquire target user attribute information of the target user based on the input instruction when an input instruction on the attribute information input area is detected.

For the specific implementation manner of the information obtaining unit 401, reference may be made to the description of step S101 in the embodiment corresponding to FIG. 2, which will not be repeated here.

The parameter output module 50 is configured to output each first target reward parameter and each second target reward parameter of the target user under the action of each drug through a drug reward prediction model, wherein the drug reward prediction model includes a first network parameter and a second network parameters, the first network parameter is used to determine the first reward parameter of any user with any user attribute information under the action of various drugs, and the second network parameter is used to determine the second reward parameter of any user under the action of various drugs. Reward parameters, any user corresponds to a first reward parameter and a second reward parameter under the action of a drug, and the drug action duration corresponding to the first reward parameter is greater than the drug action duration corresponding to the second reward parameter.

The parameter determination module 60 is configured to determine each user reward parameter of the target user under the action of each drug based on each first target reward parameter of the target user and/or each second target reward parameter of the target user, wherein the target user is a Each drug corresponds to a user reward parameter.

In some feasible implementations, the parameter determination module 60 includes: a weighting coefficient determination unit 601 , a first reward parameter determination unit 602 and a second reward parameter determination unit 603 .

Weighting coefficient determination unit 601, for determining the first weighting coefficient of the first target reward parameter and the second weighting coefficient of the second target reward parameter;

The first reward parameter determination unit 602 is configured to determine each first weighted reward parameter corresponding to each first target reward parameter based on the first weighting coefficient and each first target reward parameter of the target user, and based on the second weighting coefficient and the target user The second target reward parameters of each second target reward parameter determine each second weighted reward parameter corresponding to each second target reward parameter;

The second reward parameter determining unit 603 is configured to determine, based on each first weighted reward parameter and each second weighted reward parameter, each user reward parameter of the target user under the action of each drug, a first weighted reward parameter and a second weighted reward The parameter corresponds to a user reward parameter.

The specific implementation of the weighting coefficient determination unit 601, the first reward parameter determination unit 602, and the second reward parameter determination unit 603 can refer to the description of step S103 in the embodiment corresponding to FIG. 2, and will not be continued here. Repeat.

In some feasible implementations, the above parameter determination module 60 further includes: a third reward parameter determination unit 604 .

The third reward parameter determining unit 604 is configured to determine each first target reward parameter of the target user as each user reward parameter of the target user under the action of each drug;

The maximum user reward parameter is the maximum first target reward parameter among the first target reward parameters.

The specific implementation of the third reward parameter determining unit 604 may refer to the description of step S103 in the above-mentioned embodiment corresponding to FIG. 2 , which will not be repeated here.

In some feasible implementations, the above parameter determination module 60 further includes: a fourth reward parameter determination unit 605 .

The fourth reward parameter determination unit 605 is configured to determine each second target reward parameter of the target user as each user reward parameter of the target user under the action of each drug;

The maximum user reward parameter is the maximum second target reward parameter among the second target reward parameters.

The specific implementation of the fourth reward parameter determination unit 605 may refer to the description of step S103 in the above-mentioned embodiment corresponding to FIG. 2 , which will not be repeated here.

The information display module 70 is used for determining the maximum user reward parameter from each user reward parameter, and outputting the drug information of the target drug with the maximum user reward parameter to the user interface to display the target drug to the target user.

In some feasible embodiments, the above-mentioned drug information push device 1 further includes:

The data acquisition module 10 is used for acquiring sample data of at least two users, and the sample data of one user includes user attribute information and sample drug information of the user;

The sample input module 20 is used to obtain each first sample reward parameter and each second sample reward parameter of each user under the action of the sample drug indicated by the sample drug information, and combine the sample data of at least two users, each first sample reward parameter The sample reward parameters and the second sample reward parameters are input into the drug reward prediction model;

The parameter training module 30 is used for training the first network parameters and the second network parameters of the drug reward prediction model based on the user attribute information of at least two users, each first sample reward parameter and each second sample reward parameter, so as to obtain the first network parameter and the second network parameter of the drug reward prediction model. The user attribute information of any user predicts the ability of the first reward parameter and the second reward parameter of any user under the action of each drug.

In some feasible implementation manners, the first network parameters include first model parameters and first backhaul parameters, and the second network parameters include second model parameters and second backhaul parameters;

The above-mentioned parameter training module 30 includes: an expected parameter determination unit 301 , a loss value determination unit 302 and a parameter update unit 303 .

An expected parameter determination unit 301, configured to determine each first expected reward parameter of each user under the action of the sample drug based on the first model parameter and the first returned parameter, and determine each user based on the second model parameter and the second returned parameter each second expected reward parameter under the action of the sample drug;

The loss value determination unit 302 is configured to determine based on the first return parameter, the second return parameter, each first sample reward parameter, each second sample reward parameter, each first expected reward parameter and each second expected reward parameter Each loss value corresponding to each user's sample data;

The parameter updating unit 303 is configured to iteratively update the parameter value of the first model parameter and the parameter value of the second model parameter based on each loss value until the loss value remains unchanged, so as to obtain a prediction based on the user attribute information of any user in each user. The ability of the first reward parameter and the second reward parameter under the action of the drug.

The specific implementation of the expected parameter determination unit 301, the loss value determination unit 302 and the parameter update unit 303 can be referred to the description of the model training of the drug reward prediction model in step S101 of the above-mentioned embodiment corresponding to FIG. 2, which will not be discussed here. Let's go on and on.

The specific implementation of the data acquisition module 10 , the sample input module 20 , the parameter training module 30 , the information input module 40 , the parameter output module 50 , the parameter determination module 60 and the information display module 70 may refer to the embodiment corresponding to FIG. 2 above. The description of step S101 to step S104 in , will not be repeated here. In addition, the description of the beneficial effects of using the same method will not be repeated.

Further, please refer to FIG. 5 , which is a schematic structural diagram of a computer device provided by an embodiment of the present application. The computer device may include a processor, memory, and a network interface. Optionally, the computer device may also include a user interface. For example, as shown in FIG. 5 , the computer device 1000 may be the server 10 in the above-mentioned embodiment corresponding to FIG. 1 , and the computer device 1000 may include: at least one processor 1001 , such as a CPU, at least one network interface 1004 , and user interface 1003 , memory 1005 , at least one communication bus 1002 . Among them, the communication bus 1002 is used to realize the connection and communication between these components. Wherein, the user interface 1003 may include a display screen (display) and a keyboard (keyboard), and the network interface 1004 may optionally include a standard wired interface and a wireless interface (eg, a WI-FI interface). The memory 1005 may be high-speed RAM memory or non-volatile memory, such as at least one disk memory. The memory 1005 may optionally also be at least one storage device located remotely from the aforementioned processor 1001 . As shown in FIG. 5 , the memory 1005 as a computer storage medium may include an operating system, a network communication module, a user interface module, and a device control application program.

In the computer device 1000 shown in FIG. 5 , the network interface 1004 is mainly used for network communication with the user terminal; the user interface 1003 is mainly used to provide an input interface for the user; device control application to achieve:

It should be understood that the computer device 1000 described in the embodiment of the present application can execute the description of the method for pushing drug information in the embodiment corresponding to FIG. 2 above, and can also execute the device for pushing drug information in the embodiment corresponding to FIG. 4 above. The description of 1 will not be repeated here. In addition, the description of the beneficial effects of using the same method will not be repeated.

In addition, it should be pointed out here that the embodiments of the present application further provide a computer-readable storage medium, and the computer-readable storage medium stores the computer program executed by the aforementioned drug information pushing device 1, and the computer program is stored in the computer-readable storage medium. The computer program includes program instructions, and when the processor executes the program instructions, it can execute the description of the drug information pushing method in the embodiment corresponding to FIG. 2 above, and therefore will not be repeated here. In addition, the description of the beneficial effects of using the same method will not be repeated.

Optionally, the storage medium involved in this application, such as a computer-readable storage medium, may be non-volatile or volatile.

For technical details not disclosed in the computer-readable storage medium embodiments involved in the present application, please refer to the description of the method embodiments of the present application. By way of example, program instructions may be deployed to execute on one computing device, or on multiple computing devices located at one site, or alternatively, on multiple computing devices distributed across multiple sites and interconnected by a communications network Implemented, multiple computing devices distributed in multiple locations and interconnected by a communication network can form a blockchain system.

In one aspect of the present application, there is provided a computer program product or computer program, the computer program product or computer program including computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the method for pushing drug information provided in the embodiments of the present application.

Those of ordinary skill in the art can understand that all or part of the process in the method of the above embodiment can be implemented by instructing the relevant hardware through a computer program, and the above program can be stored in a computer-readable storage medium, and the program is in During execution, it may include the processes of the embodiments of the above-mentioned methods. Wherein, the above-mentioned storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM) or a random access memory (RAM) or the like.

The above-mentioned computer-readable storage medium may be the drug information pushing apparatus provided in any of the foregoing embodiments or an internal storage unit of the above-mentioned device, such as a hard disk or a memory of an electronic device. The computer-readable storage medium can also be an external storage device of the electronic device, such as a pluggable hard disk, a smart media card (SMC), a secure digital (SD) card equipped on the electronic device, Flash card (flash card), etc. The above-mentioned computer-readable storage medium may also include a magnetic disk, an optical disk, a read-only memory (ROM) or a random access memory (RAM), and the like. Further, the computer-readable storage medium may also include both an internal storage unit of the electronic device and an external storage device. The computer-readable storage medium is used to store the computer program and other programs and data required by the electronic device. The computer-readable storage medium can also be used to temporarily store data that has been or will be output.

The terms "first", "second" and the like in the claims and description of the present application and the drawings are used to distinguish different objects, rather than to describe a specific order. Furthermore, the terms "comprising" and "having" and any variations thereof are intended to cover non-exclusive inclusion. For example, a process, method, system, product or device comprising a series of steps or units is not limited to the listed steps or units, but optionally also includes unlisted steps or units, or optionally also includes For other steps or units inherent to these processes, methods, products or devices. Reference herein to an "embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the present application. The appearance of this phrase in various places in the specification is not necessarily all referring to the same embodiment, nor is it a separate or alternative embodiment that is mutually exclusive with other embodiments. It is explicitly and implicitly understood by those skilled in the art that the embodiments described herein may be combined with other embodiments. As used in this specification and the appended claims, the term "and/or" refers to and including any and all possible combinations of one or more of the associated listed items.

Those of ordinary skill in the art can realize that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of the two. Interchangeability, the above description has generally described the components and steps of each example in terms of function. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of this application.

The above disclosures are only the preferred embodiments of the present application, and of course, the scope of the rights of the present application cannot be limited by this. Therefore, equivalent changes made according to the claims of the present application are still within the scope of the present application.

Claims

A method for pushing drug information, comprising:

Obtain target user attribute information of the target user, and input the target user attribute information into the drug reward prediction model, where the target user attribute information includes at least one of demographic information, health indicators for drug use for the target disease, and historical drug use information ;

Output each first target reward parameter and each second target reward parameter of the target user under the action of each drug through the drug reward prediction model, wherein the drug reward prediction model includes a first network parameter and a second network parameter , the first network parameter is used to determine the first reward parameter of any user with any user attribute information under the action of various drugs, and the second network parameter is used to determine that any user is under the action of various drugs The second reward parameter under the action, the any user corresponds to a first reward parameter and a second reward parameter under the action of a drug, and the drug action duration corresponding to the first reward parameter is longer than the second reward parameter The corresponding duration of action of the drug;

Based on each first target reward parameter of the target user and/or each second target reward parameter of the target user, each user reward parameter of the target user under the action of each drug is determined, wherein the target The user corresponds to a user reward parameter under the action of a drug;

A maximum user reward parameter is determined from the user reward parameters, and drug information of the target drug with the maximum user reward parameter is output to the user interface, so as to display the target drug to the target user.
The method of claim 1, wherein the method further comprises:

Obtain sample data of at least two users, and the sample data of one user includes user attribute information and sample drug information of the user;

Obtain each first sample reward parameter and each second sample reward parameter of each user under the action of the sample drug indicated by the sample drug information, and combine the sample data of the at least two users, the first sample data This reward parameter and each of the second sample reward parameters are input into the drug reward prediction model;

Based on the user attribute information of the at least two users, the first sample reward parameters and the second sample reward parameters, the first network parameters and the second network parameters of the drug reward prediction model are trained to obtain The ability of predicting the first reward parameter and the second reward parameter of any user under the action of each drug based on the user attribute information of any user.
The method according to claim 2, wherein the first network parameter comprises a first model parameter and a first backhaul parameter, and the second network parameter comprises a second model parameter and a second backhaul parameter;

the first network parameters and the second network parameters for training the drug reward prediction model based on the user attribute information of the at least two users, the first sample reward parameters and the second sample reward parameters, include:

Each first expected reward parameter of each user under the action of the sample drug is determined based on the first model parameter and the first return parameter, and based on the second model parameter and the second return parameter The parameter determines each second expected reward parameter of each user under the action of the sample drug;

Based on the first return parameters, the second return parameters, the first sample reward parameters, the second sample reward parameters, the first expected reward parameters, and the second The expected reward parameter determines each loss value corresponding to the sample data of each user;

Iteratively update the parameter value of the first model parameter and the parameter value of the second model parameter based on the loss values until the loss value remains unchanged, so as to obtain the prediction of any user based on the user attribute information of any user The ability of the user's first reward parameter and second reward parameter under the action of each drug.
The method according to claim 3, wherein, based on each first target reward parameter of the target user and each second target reward parameter of the target user, it is determined that the target user is under the action of each drug Each user reward parameter of , including:

determining a first weighting coefficient of the first target reward parameter and a second weighting coefficient of the second target reward parameter;

Each first weighted reward parameter corresponding to each of the first target reward parameters is determined based on the first weighting coefficient and each first target reward parameter of the target user, and based on the second weighting coefficient and the target user each of the second target reward parameters of the

Each user reward parameter of the target user under the action of each drug is determined based on each first weighted reward parameter and each second weighted reward parameter, and a first weighted reward parameter corresponds to a second weighted reward parameter A user reward parameter.
The method according to claim 3, wherein the determining each user reward parameter of the target user under the action of each drug based on each first target reward parameter of the target user comprises:

Determining each first target reward parameter of the target user as each user reward parameter of the target user under the action of each drug;

Wherein, the maximum user reward parameter is the maximum first target reward parameter among the first target reward parameters.
The method according to claim 3, wherein the determining each user reward parameter of the target user under the action of each drug based on each second target reward parameter of the target user comprises:

Determining each second target reward parameter of the target user as each user reward parameter of the target user under the action of each drug;

Wherein, the maximum user reward parameter is the maximum second target reward parameter among the second target reward parameters.
The method according to any one of claims 1-6, wherein the user interface includes an attribute information input area;

The acquiring target user attribute information of the target user includes:

When an input instruction on the attribute information input area is detected, target user attribute information of the target user is acquired based on the input instruction.
A drug information push device, comprising:

The information input module is used to obtain the target user attribute information of the target user, and input the target user attribute information into the drug reward prediction model. at least one of the information;

A parameter output module, configured to output each first target reward parameter and each second target reward parameter of the target user under the action of each drug through the drug reward prediction model, wherein the drug reward prediction model includes a first network parameters and a second network parameter, the first network parameter is used to determine the first reward parameter of any user with any user attribute information under the action of various drugs, and the second network parameter is used to determine the A second reward parameter of a user under the action of various drugs, any user under the action of a drug corresponds to a first reward parameter and a second reward parameter, and the drug action duration corresponding to the first reward parameter is greater than the drug action duration corresponding to the second reward parameter;

A parameter determination module, configured to determine each user reward parameter of the target user under the action of each drug based on each first target reward parameter of the target user and/or each second target reward parameter of the target user , wherein the target user corresponds to a user reward parameter under the action of a drug;

The information display module is used to determine the maximum user reward parameter from the user reward parameters, and output the drug information of the target drug with the maximum user reward parameter to the user interface, so as to display the target user. target drug.
A computer device, comprising: a processor, a memory and a network interface;

The processor is connected to a memory and a network interface, wherein the network interface is used to provide a data communication function, the memory is used to store a program code, the processor is used to call the program code, and execute a drug information push method, the Drug information push methods include:

Obtain target user attribute information of the target user, and input the target user attribute information into the drug reward prediction model, where the target user attribute information includes at least one of demographic information, health indicators for drug use for the target disease, and historical drug use information ;

Output each first target reward parameter and each second target reward parameter of the target user under the action of each drug through the drug reward prediction model, wherein the drug reward prediction model includes a first network parameter and a second network parameter , the first network parameter is used to determine the first reward parameter of any user with any user attribute information under the action of various drugs, and the second network parameter is used to determine that any user is under the action of various drugs The second reward parameter under the action, the any user corresponds to a first reward parameter and a second reward parameter under the action of a drug, and the drug action duration corresponding to the first reward parameter is longer than the second reward parameter The corresponding duration of action of the drug;

Based on each first target reward parameter of the target user and/or each second target reward parameter of the target user, each user reward parameter of the target user under the action of each drug is determined, wherein the target The user corresponds to a user reward parameter under the action of a drug;

A maximum user reward parameter is determined from the user reward parameters, and drug information of the target drug with the maximum user reward parameter is output to the user interface, so as to display the target drug to the target user.
The computer device according to claim 9, wherein when the processor executes the drug information push method, the method further comprises:

Obtain sample data of at least two users, and the sample data of one user includes user attribute information and sample drug information of the user;

Obtain each first sample reward parameter and each second sample reward parameter of each user under the action of the sample drug indicated by the sample drug information, and combine the sample data of the at least two users, the first sample data This reward parameter and each of the second sample reward parameters are input into the drug reward prediction model;

Based on the user attribute information of the at least two users, the first sample reward parameters and the second sample reward parameters, the first network parameters and the second network parameters of the drug reward prediction model are trained to obtain The ability of predicting the first reward parameter and the second reward parameter of any user under the action of each drug based on the user attribute information of any user.
The computer device of claim 10, wherein the first network parameter includes a first model parameter and a first backhaul parameter, and the second network parameter includes a second model parameter and a second backhaul parameter;

Execute the training of the first network parameters and the second network parameters of the drug reward prediction model based on the user attribute information of the at least two users, the first sample reward parameters and the second sample reward parameters ,include:

Each first expected reward parameter of each user under the action of the sample drug is determined based on the first model parameter and the first return parameter, and based on the second model parameter and the second return parameter The parameter determines each second expected reward parameter of each user under the action of the sample drug;

Based on the first return parameters, the second return parameters, the first sample reward parameters, the second sample reward parameters, the first expected reward parameters, and the second The expected reward parameter determines each loss value corresponding to the sample data of each user;

Iteratively update the parameter value of the first model parameter and the parameter value of the second model parameter based on the loss values until the loss value remains unchanged, so as to obtain the prediction of any user based on the user attribute information of any user The ability of the user's first reward parameter and second reward parameter under the action of each drug.
11. The computer device according to claim 11, wherein the determining that the target user is in each drug based on each first target reward parameter of the target user and each second target reward parameter of the target user is performed. Reward parameters for each user under the action, including:

determining a first weighting coefficient of the first target reward parameter and a second weighting coefficient of the second target reward parameter;

Each first weighted reward parameter corresponding to each of the first target reward parameters is determined based on the first weighting coefficient and each first target reward parameter of the target user, and based on the second weighting coefficient and the target user each of the second target reward parameters of the

Each user reward parameter of the target user under the action of each drug is determined based on each first weighted reward parameter and each second weighted reward parameter, and a first weighted reward parameter corresponds to a second weighted reward parameter A user reward parameter.
The computer device according to claim 11, wherein performing the determining of each user reward parameter of the target user under the action of each drug based on each first target reward parameter of the target user comprises:

Determining each first target reward parameter of the target user as each user reward parameter of the target user under the action of each drug;

Wherein, the maximum user reward parameter is the maximum first target reward parameter among the first target reward parameters.
The computer device according to claim 11 , wherein, executing the second target reward parameters based on the target user to determine the user reward parameters of the target user under the action of the drugs, comprising:

Determining each second target reward parameter of the target user as each user reward parameter of the target user under the action of each drug;

Wherein, the maximum user reward parameter is the maximum second target reward parameter among the second target reward parameters.
A computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, the computer program includes program instructions, and when the program instructions are executed by a processor, a method for pushing drug information is executed, and the drug information Push methods include:

Obtain target user attribute information of the target user, and input the target user attribute information into the drug reward prediction model, where the target user attribute information includes at least one of demographic information, health indicators for drug use for the target disease, and historical drug use information ;

Output each first target reward parameter and each second target reward parameter of the target user under the action of each drug through the drug reward prediction model, wherein the drug reward prediction model includes a first network parameter and a second network parameter , the first network parameter is used to determine the first reward parameter of any user with any user attribute information under the action of various drugs, and the second network parameter is used to determine that any user is under the action of various drugs The second reward parameter under the action, the any user corresponds to a first reward parameter and a second reward parameter under the action of a drug, and the drug action duration corresponding to the first reward parameter is longer than the second reward parameter The corresponding duration of action of the drug;

Based on each first target reward parameter of the target user and/or each second target reward parameter of the target user, each user reward parameter of the target user under the action of each drug is determined, wherein the target The user corresponds to a user reward parameter under the action of a drug;

A maximum user reward parameter is determined from the user reward parameters, and drug information of the target drug with the maximum user reward parameter is output to the user interface, so as to display the target drug to the target user.
The computer-readable storage medium according to claim 15, wherein when the processor executes the method for pushing drug information, the method further comprises:

Obtain sample data of at least two users, and the sample data of one user includes user attribute information and sample drug information of the user;

Obtain each first sample reward parameter and each second sample reward parameter of each user under the action of the sample drug indicated by the sample drug information, and combine the sample data of the at least two users, the first sample data This reward parameter and each of the second sample reward parameters are input into the drug reward prediction model;

Based on the user attribute information of the at least two users, the first sample reward parameters and the second sample reward parameters, the first network parameters and the second network parameters of the drug reward prediction model are trained to obtain The ability of predicting the first reward parameter and the second reward parameter of any user under the action of each drug based on the user attribute information of any user.
17. The computer-readable storage medium of claim 16, wherein the first network parameters include first model parameters and first backhaul parameters, and the second network parameters include second model parameters and second backhaul parameters ;

Execute the training of the first network parameters and the second network parameters of the drug reward prediction model based on the user attribute information of the at least two users, the first sample reward parameters and the second sample reward parameters ,include:

Each first expected reward parameter of each user under the action of the sample drug is determined based on the first model parameter and the first return parameter, and based on the second model parameter and the second return parameter The parameter determines each second expected reward parameter of each user under the action of the sample drug;

Based on the first return parameters, the second return parameters, the first sample reward parameters, the second sample reward parameters, the first expected reward parameters, and the second The expected reward parameter determines each loss value corresponding to the sample data of each user;

Iteratively update the parameter value of the first model parameter and the parameter value of the second model parameter based on the loss values until the loss value remains unchanged, so as to obtain the prediction of any user based on the user attribute information of any user The ability of the user's first reward parameter and second reward parameter under the action of each drug.
18. The computer-readable storage medium of claim 17, wherein the determining that the target user is in the target user based on each first target reward parameter of the target user and each second target reward parameter of the target user is performed. The reward parameters of each user under the action of each drug, including:

determining a first weighting coefficient of the first target reward parameter and a second weighting coefficient of the second target reward parameter;

Each first weighted reward parameter corresponding to each of the first target reward parameters is determined based on the first weighting coefficient and each first target reward parameter of the target user, and based on the second weighting coefficient and the target user each of the second target reward parameters of the

Each user reward parameter of the target user under the action of each drug is determined based on each first weighted reward parameter and each second weighted reward parameter, and a first weighted reward parameter corresponds to a second weighted reward parameter A user reward parameter.
The computer-readable storage medium according to claim 17, wherein performing the step of determining each user reward parameter of the target user under the action of each drug based on each first target reward parameter of the target user comprises: :

Determining each first target reward parameter of the target user as each user reward parameter of the target user under the action of each drug;

Wherein, the maximum user reward parameter is the maximum first target reward parameter among the first target reward parameters.
The computer-readable storage medium according to claim 17, wherein performing the step of determining each user reward parameter of the target user under the action of each drug based on each second target reward parameter of the target user, comprising: :

Determining each second target reward parameter of the target user as each user reward parameter of the target user under the action of each drug;

Wherein, the maximum user reward parameter is the maximum second target reward parameter among the second target reward parameters.