CN103150595B

CN103150595B - Automatic matching system of selection in data handling system and device

Info

Publication number: CN103150595B
Application number: CN201110400345.2A
Authority: CN
Inventors: 佘锡伟; 谭志远; 杜嘉辉
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2011-12-06
Filing date: 2011-12-06
Publication date: 2016-03-09
Anticipated expiration: 2031-12-06
Also published as: CN103150595A

Abstract

The invention discloses a kind of automatic matching system of selection and device, method comprises: A, the pairing request accepting for user; B, detect the status information of described user's current time, utilize the mapping relations of the state in the enhancing learning function preset and pairing object type select probability, what calculate difference pairing object type corresponding to the status information of user's current time chooses probability; Probability selection one pairing object is chosen described in C, basis; D, detection user, to the reaction action message of selected pairing object, carry out feedback modifiers according to obtained reaction action message to the corresponding state in described enhancing learning function and corresponding pairing object type select probability.Device comprises: request receiving module, status information detection module, enhancing study module, Object Selection module, reaction information detection module and correcting module.Utilize the present invention, the degree of correlation of pairing object and the pairing request side selected can be improved, and then improve and final be paired into power.

Description

Automatic pairing selection method and device in data processing system

Technical Field

The present invention relates to network data processing technologies, and in particular, to a method and an apparatus for automatic pairing selection in a network data processing system.

Background

At present, with the development of internet technology, network data processing systems in various segment fields appear and develop in succession, and special requirements of users in different segment fields are met. For example, the communication network system can meet the real-name or anonymous communication requirements of the user, the social network system can meet the social requirements of the user, the e-commerce platform system can meet the commodity purchasing requirements of the user, the web blog system can meet the log display requirements of the user, the literature network system can meet the reading requirements of the user, and the like.

In a network data processing system, the system needs to perform a selective pairing process according to a request of a user in many cases. For example: the method comprises the steps of selecting an anonymous communication object for a user in an anonymous communication system, selecting and recommending friends for the user in other communication networks such as an instant communication network or a social network system, recommending specific commodities for the user in an e-commerce platform system, recommending specific logs for the user in a web blog system, recommending articles for the user in a literature network system and the like.

In the current network data processing system, the following two methods are generally used for the background service system to select one of a plurality of candidate pairing objects for pairing according to the pairing request of the requesting party.

And (I) adopting a completely random pairing mode.

In this way, after receiving a pairing request from a user, a pairing object is randomly selected for the user. For example, in an anonymous communication system, a communication request sent by an information receiving party is a pairing request, and after receiving the communication request, the system selects one from a plurality of pairing objects, namely information dissemination units, to the information receiving party, and further establishes communication between an initiator of the information dissemination unit and the receiving party.

The disadvantages of this random pairing approach are: the relevance between the randomly selected pairing object and the pairing requesting party is extremely low, and users often are not satisfied with the pairing object randomly selected by the system, so that the final pairing success rate is extremely low.

And (II) setting a static pairing mode according to manual experience.

For example, in an existing anonymous communication system called "drift bottle", the "directional bottle" and the "communication bottle" are both the probabilities of pairing manually set according to the region or gender information of the user. Compared with a completely random pairing strategy, the method for setting the fixed pairing strategy according to the attribute characteristics of the user has some improvement on the relevance of the selected information dissemination unit and the state information of the receiving party, but the method still has the following defects:

1) the method artificially designs the pairing strategy according to the preference of most users and acts on the global users, but ignores the personalized requirements of different users, and causes that the relevance between part of user state information and paired objects is not high.

2) The user's preference for the anonymous object to be communicated may possibly change with different dates (e.g., different working days or holidays) and different time periods, and the manual static pairing method cannot meet the dynamic user state change requirement, so that the correlation between the user state information and the paired object is not high under certain dynamic conditions.

3) Since this method is to perform pairing by a human being based on experience, an estimated value is generally used when setting the pairing probability, and it is difficult to give a pairing object highly related to the receiving party.

4) The response speed to the user feedback is slow. Although the manually set static pairing strategy can adjust the pairing strategy by observing and analyzing the use condition of the user in a period of time, the feedback mechanism has a long period and cannot quickly adjust the strategy for the use condition of the user.

In summary, in the existing method for selecting a pairing object for a pairing request party by a network data processing system, the relevance between the selected pairing object and the state (including a static state and a dynamic state) of the pairing request party is not high, and the pairing request party is often not satisfied with a pairing result, which results in a low final pairing success rate.

Disclosure of Invention

In view of the above, the present invention provides an automatic pairing selection method and apparatus in a data processing system, so as to improve the correlation between a selected pairing object and a pairing request party.

The technical scheme of the invention is realized as follows:

an automatic pairing selection method in a data processing system, comprising:

A. accepting a pairing request for a user;

B. detecting the state information of the user at the current moment, and calculating the selection probability of different pairing object types corresponding to the state information of the user at the current moment by using the mapping relation between the state in the preset reinforcement learning function and the pairing object type selection probability;

C. selecting a matching object according to the selected probability;

D. and detecting the reaction action information of the user to the selected pairing object, and performing feedback correction on the corresponding state in the reinforcement learning function and the corresponding pairing object type selection probability according to the obtained reaction action information.

An automatic pairing selection apparatus in a data processing system, comprising:

the request receiving module is used for receiving a pairing request aiming at a user and triggering the state information detection module after receiving the pairing request;

the state information detection module is used for detecting the state information of the user at the current moment and inputting the state information into the reinforcement learning module;

the reinforcement learning module is used for storing the mapping relation between the state in the reinforcement learning function and the matching object type selection probability and calculating the selection probability of different matching object types corresponding to the state information of the user at the current moment by utilizing the mapping relation;

the object selection module is used for selecting a matched object according to the selection probability calculated by the reinforcement learning module;

the response information detection module is used for detecting the response action information of the user to the selected pairing object;

and the correction module is used for carrying out feedback correction on the selection probability of the corresponding state and the corresponding pairing object type in the reinforcement learning function according to the reaction action information detected by the reaction information detection module.

Compared with the prior art, the method inputs the state information of the user into the reinforcement learning function, calculates the selection probability of different pairing object types corresponding to the state information of the user by utilizing the mapping relation between the state in the reinforcement learning function and the pairing object type selection probability, selects the pairing object according to the selection probability, and carries out feedback correction on the reinforcement learning function according to the reaction of the user. Therefore, the pairing object can be selected according to the user state, the correlation degree of the selected pairing object and the pairing requester is improved, and the final pairing success rate is further improved.

Drawings

FIG. 1 is a flow chart of an automatic pair selection method according to the present invention;

FIG. 2 is a schematic diagram of an embodiment of the automatic pairing-selection apparatus according to the present invention;

FIG. 3 is a schematic diagram of an embodiment of a method for automatically pairing and selecting a communication object in an anonymous communication system according to the present invention;

FIG. 4 is a type and profile of an automatic pair selection apparatus according to the present invention;

FIG. 5 is a detailed flow chart of initializing an automatic pairing selection device according to the invention;

FIG. 6 is a diagram illustrating an embodiment of external intervention in a selection policy through threshold shifting according to the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.

FIG. 1 is a flow chart of a method for automatic pair selection in a data processing system according to the present invention. Referring to fig. 1, the method of the present invention mainly comprises:

step 101, accepting a pairing request for a user;

102, detecting the state information of the user at the current moment, and calculating the selection probability of different pairing object types corresponding to the state information of the user at the current moment by using the mapping relation between the state in a preset reinforcement learning function and the pairing object type selection probability;

103, selecting a matching object according to the selected probability;

and 104, detecting the reaction action information of the user to the selected pairing object, and performing feedback correction on the corresponding state in the reinforcement learning function and the corresponding pairing object type selection probability according to the obtained reaction action information.

The pairing request for the user according to the present invention may be: a pairing request initiated online by an online user, the pairing request being directed to the user initiating the request, such as a pairing request initiated by a user in an anonymous communication system; or, the system side may initiate a pairing request for a certain user or each user in the system in an offline condition of the user, such as a request of pairing the user and a commodity initiated in an e-commerce platform system for recommending a specific commodity for the user, a request of pairing the user and a specific log initiated in a web blog system for recommending a specific log for the user, and so on.

And then, if the user initiates a new pairing request, returning to the step 101 to perform the selection and corresponding correction processes again, and enabling the selection probabilities of different states and corresponding pairing object types to approach the real requirement of the user through a large number of selection and correction processes. Therefore, the pairing object can be selected according to the user state, the correlation degree of the selected pairing object and the pairing requester is improved, and the final pairing success rate is further improved.

FIG. 2 is a block diagram of an automatic pair selection apparatus in a data processing system according to the present invention. Referring to fig. 2, the apparatus 200 includes:

a request receiving module 201, configured to accept a pairing request for a user, and trigger the state information detecting module 202 after receiving the pairing request.

A state information detecting module 202, configured to detect state information of the user at the current time, and input the state information to the reinforcement learning module 203.

The reinforcement learning module 203 is used for storing a mapping relation between the state in the reinforcement learning function and the matching object type selection probability, and calculating the selection probability of different matching object types corresponding to the state information of the user at the current moment by using the mapping relation; the reinforcement learning module 203 includes an approximation function learner, and the mapping relationship between the state in the reinforcement learning function and the pairing object type selection probability is stored by the approximation function learner.

An object selection module 204, configured to select a matching object according to the selection probability calculated by the reinforcement learning module 203.

And a reaction information detection module 205, configured to detect reaction action information of the user on the selected pairing object.

And a correcting module 206, configured to perform feedback correction on the selection probabilities of the corresponding states and the corresponding pairing object types in the reinforcement learning function according to the reaction action information detected by the reaction information detecting module.

The correcting module 206 may further be configured to notify the state information detecting module 202 to detect the latest user state information, and perform feedback correction on the selection probability of the corresponding state and the corresponding pairing object type in the reinforcement learning function according to the latest user state information and the reaction action information detected by the reaction information detecting module.

The pairing selection scheme described in the present invention is based on an reinforcement learning function. The reinforcement learning (called refitting learning) is an important artificial intelligence online strategy learning method. Reinforcement learning treats behavior learning as a trial-and-error process, mapping dynamic environmental states to corresponding actions. In the reinforcement learning problem, when the control system is transferred from one state to another state, a value called reward (payoff) is obtained, the reward value is used for representing reward and punishment of the state transfer and is used for adjusting subsequent state transfer actions, and the control aim of the system is to find an action control strategy so that the sum of reward obtained in the future and a discount factor is maximized. The functional expression for this value is a prediction of the return variable for each state, as in equation (1) below:

V (s_{t}) = E {Σ_{k = 0}^{\infty} γ^{k} r_{t + k}} - - - (1)

wherein r is_tIs a state vector s_tIs transferred to s_t+1γ represents a discount factor (0)<γ<1)。V(s_t) Representing the sum of the reward discount values from time t later, which will depend on the action of the subsequent selection. The system control needs to find the corresponding action so that V(s)_t) The value at each state is maximum.

The reinforcement learning function is a 'state-action' control method, wherein mapping relations of different states and different pairing object type selection probabilities are set. The mapping relation can be stored by a common mapping table or an approximation function learner. The approximating function learning device may be, for example, a back propagation neural network (BPNN, BP neural network for short), or other approximating function learning devices such as a support vector machine, a radial machine neural network, and the like.

In the reinforcement learning function, the initial value of the selection probability of different pairing object types corresponding to a certain state can be set empirically or randomly, but the value range is [ -1,1 ]. The reinforcement learning function is a continuous selection and correction process for selecting the pairing object type according to the initial value of the selection probability of different pairing object types corresponding to different states, and then performing feedback correction on the selection probability of the selected pairing object type corresponding to the corresponding state in the reinforcement learning function according to the reaction of a user on the selected pairing object, so that the final selection probability of the pairing object type corresponding to each state approaches the real pairing requirement of the user through a large number of selection correction processes. This process of data processing for selective correction is therefore referred to figuratively as a "learning" process.

In one embodiment of the invention, the reinforcement learning function is a Q learning function. The Q learning function is one of reinforcement learning functions, and is a Markov Decision Process (MDP). Compared with the 'state-action' control method (such as a dynamic programming method) of other reinforcement learning functions, the Q learning function does not need to be an environment prior model based on action selection, and is in interaction learning with the environment to obtain the relationship among the state, the action and the reward and punishment values. The Q learning function is an unsupervised learning method that does not require any existing training samples, but only consists in the correlation of actions made by any encounter states, and through the exploration of dynamic "trial-and-error" of optional actions and the observation and feedback of the correlation results.

In the Q learning function, the estimated value of the reward value for each selection action in each state is generally referred to as a Q value. Let a be the select action, A ═ a₁,a₂,…,a_mIn the invention, the estimation value of the reward value of a certain selection action in a certain state in the Q learning function is the selection probability of a certain pairing object type corresponding to the state, the selection probability is called the Q value of the pairing object type corresponding to the state, different pairing object types corresponding to different states have corresponding Q values, and the mapping relation between the Q values of different states and different pairing object types is stored in the Q learning function.

In the Q learning function, an initial Q value may be set in advance. For example, instead of using a Q learning function in which an approximation function learner stores mapping relationships between different states and Q values of different object types, the Q values corresponding to each object type may be equal, and if there are m object types, the initial Q values corresponding to each object type may be equal to 1/m. For the Q learning function using the approximation function learner (e.g., BP neural network) to store the mapping relationship between different states and different Q values of the paired object types, since the initial connection weight in the approximation function learner corresponding to each paired object type is randomly assigned, the initial Q value is also random, but the value range is [ -1,1 ].

In the Q learning function, the correction of the Q value can be accomplished by the following formula (2):

Q’(s_t,a_t)＝Q(s_t,a_t)+α[R(s_t,a_t)-Q(s_t,a_t)](2)

wherein, said s_tState information at time t; a is a_tThe type of the pairing object selected for the time t; q(s)_t,a_t) Is in a state s_tLower selection pairing object type a_tQ value (i.e., selection probability), Q'(s)_t,a_t) Is a correction value; the value is assigned; r(s)_t,a_t) According to the state s_tLower selection pairing object type a_tAnd the immediate punishment value is calculated by the response information of the selection result of the user, and α is a preset learning rate.

In another Q value correction method, user state information s at the latest moment can be further introduced_t+1And Q(s) is corrected by the following formula (3)_t,a_t)：

Q’(s_t,a_t)＝Q(s_t,a_t)+α[R(s_t,a_t)+γmax_a∈AQ(s_t+1,a)-Q(s_t,a_t)](3)

Wherein, said s_t+1The user state information at the latest moment, namely the t +1 moment; said Q(s)_t+1A) is in s_t+1Corresponding to each type of object to be paired in stateQ value, said is Q'(s)_t,a_t) A correction value; max is stated_a∈AQ(s_t+1A) is in s_t+1All the pairing object types correspond to the maximum value in the Q values under the state; the gamma is a preset discount rate.

The mode of storing the mapping relation between different states in the Q learning function and the Q values of different pairing object types can be a mapping table mode, and can also be stored by an approximation function learner. The approximating function learning device can be, for example, a BP neural network, or other approximating function learning devices such as a support vector machine, a radial machine neural network, and the like.

Since the status information of the user is affected by various factors and varies with time. Thus, the user's state information can be considered to be in a high-dimensional continuous state space. The Q learning method in the form of a "state-action" look-up table requires storing the current Q value estimates for each state for each action and updating them as new feedback is obtained. For such user states, which are determined collectively by a variety of factors, it is inefficient to store values for each state-action individually. Thus, in one embodiment of the present invention, the sequential correspondence may be stored using some functional approximation method for induction and prediction (even for user states not previously encountered), such as in one embodiment using a functional approximation technique such as BP neural network to store Q values for user states corresponding to each paired object type. The feedback error of the Q value correction phase (i.e., learning phase) of the BP neural network is shown in equation (4):

▽＝α[R(s_t,a_t)-Q(s_t,a_t)]or ▽ ═ α [ R(s) ]_t,a_t)+γmax_a∈AQ(s_t+1,a)-Q(s_t,a_t)](4)

And (6) indicating the feedback error of the BP neural network, and adjusting the weight of the BP neural network to make the error as small as possible, so that the Q value corresponding to the optimal strategy is obtained finally. Thus, in embodiments where the BP neural network incorporates a Q learning function, equations (2) and (3) above may also be changed to equation (5) below:

Q’(s_t,a_t)＝Q(s_t,a_t)+▽(5)

the Q learning function updates the Q value of the type of the pairing object corresponding to the state in a short time interval, so that the method is very suitable for real-time online learning. The following embodiment describes the scheme of the present invention by taking the Q learning function in combination with the BP neural network to store the mapping relationship as an example.

The method and the device of the invention are suitable for matching requirements in various network data processing systems, such as:

selecting an anonymous communication object for a user in an anonymous communication system, wherein a pairing request of the anonymous communication object is an anonymous communication request, and the type of the pairing object is the type of the communication object in the anonymous communication system; the step 103 specifically includes: selecting a communication object type according to the selected probability, selecting a communication object from the communication object type to be paired with the user initiating the pairing request, and establishing communication between the communication object and the user;

or the pairing request is a friend recommendation request in an instant messaging system or a social networking system, and the pairing object type is a user type in the instant messaging system or the social networking system; the step 103 specifically includes: selecting a user type according to the selected probability, and selecting a user from the user type as a friend to recommend to the user initiating the pairing request;

or the pairing request is a commodity recommendation request in the electronic commerce platform system, and the pairing object type is a commodity type in the electronic commerce platform system; the step 103 specifically includes: selecting a commodity type according to the selected probability, and selecting a commodity from the commodity type as a recommended commodity to be recommended to the user initiating the pairing request;

or the pairing request is an article (including blog log) recommendation request in a weblog system or a literature weblog system, and the pairing object type is an article type; the step 103 specifically includes: and selecting an article type according to the selected probability, and selecting an article from the article type as a recommended article to be recommended to the user initiating the pairing request.

In the following embodiments, the present invention is described in terms of a method and apparatus for selecting a communication partner for a user to select an anonymous pairing in an anonymous communication system.

In such an anonymous communication system, the information sender may send out information dissemination units of different kinds and contents, which do not specify the receiver, but directly send to the background service system of the anonymous communication system. After logging in, the receiving party sends a pairing request for receiving the information transmission unit to the background service system, the background service system selects and pairs from a plurality of information transmission units according to the pairing request of the receiving party, selects one information transmission unit to send to the receiving party, and establishes communication between the sending party and the receiving party of the information transmission unit. In the anonymous communication system, a user can input own blessings, wishes, personal introduction, privacy which is inconvenient for a familiar person, and the like into the information transmission unit to be transmitted, and after a receiving party receives a certain information transmission unit according to the pairing of the system, the receiving party can choose to reply or discard the information transmission unit. Since the sender does not designate the recipient at the time of sending the information dissemination unit, in some anonymous communication systems, these information dissemination units are referred to figuratively as "drift bottles". In the anonymous communication system, the pairing request is an anonymous communication request in the anonymous communication system, and the pairing object type is a communication object type in the anonymous communication system.

Fig. 3 is a schematic diagram of an implementation method for automatically pairing and selecting a communication object in an anonymous communication system by using the automatic pairing selection device according to the present invention. Referring to fig. 3, the implementation method includes:

step 301: at time t, a new pairing request initiated by the user, for example, a request for receiving information from the receiving party, is received.

Step 302: detecting and extracting the state information of the user at the current moment by using a user state detector, namely a state vector S_t。

The user state information includes static and dynamic. The static information comprises information such as gender, age, city and hobbies set in personal data of the user; the dynamic information comprises the current date attribute (working day or holiday), time period, login time length, user behavior characteristics obtained by other statistical and analytical methods, and the like. The specific state information that needs to be extracted from the user is determined by the personal information content of the user owned by the specific application and system. Here, it is assumed that the user status information will be obtained by one user status detector (the specific implementation is determined by different applications). The user state detector can obtain a user state vector S ═ { S ═ S₁,s₂,…,s_n}。

Step 303: in the present embodiment, a BP neural network is used to store the mapping relationship between the state in the Q learning function and the value of the paired object type Q, and each paired object type a has a corresponding BP neural network, for example, there are m paired object types a ═ { a ═ here₁,a₂,…,a_mThere are m corresponding BP neural networks. Therefore, the status information S is here_tAs the input of all the paired object types corresponding to the BP neural network, the BP neural network obtains the state s according to the stored mapping relation_tDynamic Q value Q(s) corresponding to each pairing object type_tAnd a), and outputting the Q value as an output value of the BP neural network.

In this scenario, the pairing object type may differ in different anonymous communication systems. For example, for some instant anonymous chat tools, the user may be classified according to the user attribute information, and the paired object types may be young active users, mature and steady users, and the like. For another example, for some anonymous communication systems, such as a "mailbox drift bottle" system, the paired object types may be different types of drift bottles (e.g., a correspondence bottle, a mood bottle, an orientation bottle, a truthful bottle, etc.).

Step 304: calculating a selection probability of each pairing object type according to the following formula (6);

p (s_{t}, a_{i}) = \frac{e^{Q (s_{t}, a_{i}) / τ}}{\underset{a &Element; A}{Σ} e^{Q (s_{t}, a) / τ}} - - - (6)

wherein, the a_iIs the ithPairing object type, i ═ 1,2, …, m, the p(s)_t,a_i) Is at s_tPairing object type a in state_iCorresponding hit probability, Q(s)_t,a_i) Is at s_tPairing object type a in state_iAnd when the tau → 0, the action selection strategy mode is similar to a greedy strategy algorithm, and A is a set of all paired object types.

The method of selecting the type of the pair object directly affects the rate at which the Q learning function converges to an optimal strategy, and in the selection process of the type of the pair object, if the type of the pair object corresponding to the maximum Q value is selected each time, the Q learning function may not be effectively learned because other actions with lower prediction rewards may be actually better in the training process. Therefore, in the embodiment, the probability of selecting the pairing object type according to the formula (6) is determined by using an approximately greedy and continuously differentiable action selection strategy of Boltzman (Boltzman) distribution in statistical physics, and by introducing the simulated annealing factor τ, the probability of selecting the pairing object type at the initial stage of learning is insensitive to the Q value, and as the learning progresses, the influence of the Q value on the probability of selecting the pairing object type is gradually increased, so that the pairing object type with the maximum reward value is selected from the pairing object type set with a higher probability.

Step 305: the obtained selection probability for each pairing object type is transmitted to a server of the anonymous communication system together with the pairing request. And the server receives the pairing request and selects the pairing object type according to the selected probability. Generally, a pair object type with the highest probability of selection is selected, where the selected pair object type is a_t. For example, in this embodiment, to select a communication object type from which to select a communication object (i.e., an anonymous chat object) to pair with the user initiating the pairing request, communication is established between the communication object and the user.

Step 306: detecting user reaction action information to selected pairing object(ii) a Determining an immediate punishment value R(s) according to the reaction action information of the user to the selected pairing object_t,a_t) Wherein a is_tType of pairing object selected for time t, R(s)_t,a_t) Represents the user is paired at s_tPairing object type a selected in state_tIs given a value of (1).

In the Q learning function, the immediate reward and punishment value is an evaluation for the good effect of the decision, the learning function system guides the learning process through immediate reward and punishment value feedback, and the immediate reward and punishment value signal influences the selection of the next pairing decision. Therefore, the calculation method of the immediate reward and punishment value determines the performance of the learning system, and is a key of the Q learning system.

In the scheme, the immediate reward and punishment value is determined by the satisfaction degree of the user on the pairing result. The satisfaction degree of the user to the pairing result can adopt various measurement indexes, such as whether the user initiates a conversation ST to the directed pairing object, the information number SN sent by the user to the pairing object, the information number GN sent by the pairing object to the user, the communication time length between the user and the pairing object and the like T. Meanwhile, a single explicit calculation method can be adopted for calculation of the immediate reward and punishment values, and also a plurality of multi-parameter nonlinear calculation methods such as expert models, fuzzy logic and the like can be adopted.

In this embodiment, assuming that the 4 indexes mentioned above are taken as an example and a single explicit type is adopted to calculate the immediate reward penalty value, the immediate reward penalty value R(s) can be calculated by the following feedback formula (7)_t,a_t)。

R(s_t,a_t)＝α·ST+β·SN+χ·GN+κ·T(7)

Wherein, the ST, SN, GN and T are all necessary to be dimensionless, α, chi and kappa are coefficients of index, and the obtained R(s) is finally calculated_t,a_t) Will be normalized to (-1, 1).

Step 307: according to the above formula (2), i.e. Q(s)_t,a_t)＝Q(s_t,a_t)+α[R(s_t,a_t)-Q(s_t,a_t)]Correcting s in Q learning function_tPairing object type a in state_tCorresponding Q value Q(s)_t,a_t). In this example, correction a_tThe Q value stored in the corresponding BP neural network may be modified by adjusting the feedback error ▽ of the corresponding neural network, i.e., ▽ ═ α [ R(s) ]_t,a_t)-Q(s_t,a_t)](ii) a Wherein, said s_tState information at time t; a is a_tThe type of the pairing object selected for the time t; q(s)_t,a_t) Is in a state s_tLower selection pairing object type a_tQ value (i.e., selection probability); r(s)_t,a_t) According to the state s_tLower selection pairing object type a_tAnd the immediate punishment value is calculated by the response information of the selection result of the user, and α is a preset learning rate.

Of course, in another embodiment, the method may further include: when the user finishes the session, assuming that the time is t +1, detecting and extracting the user state information s at the time of t +1 by the user state detector_t+1A 1 is to_t+1Inputting the data into a BP neural network corresponding to each pairing object type, and obtaining the data in s according to the mapping relation between the state stored in the BP neural network and the Q value of the pairing object type_t+1Q value Q(s) corresponding to each pairing object type in state_t+1A); then, through the above formula (3), the following steps are carried out: q(s)_t,a_t)＝Q(s_t,a_t)+α[R(s_t,a_t)+γmax_a∈AQ(s_t+1,a)-Q(s_t,a_t)]To correct Q(s)_t,a_t). In this example, correction a_tThe Q value stored in the corresponding BP neural network may be modified by adjusting the feedback error ▽ of the corresponding neural network, i.e., ▽ ═ α [ R(s) ]_t,a_t)+γmax_a∈AQ(s_t+1,a)-Q(s_t,a_t)]Wherein, said s_t+1The user state information at the new moment, namely the t +1 moment; said Q(s)_t+1A) is ats_t+1Q value corresponding to each pairing object type under the state; max is stated_a∈AQ(s_t+1A) is in s_t+1All the pairing object types correspond to the maximum value in the Q values under the state; the gamma is a preset discount rate.

And then, if the user initiates a new pairing request, returning to the step 301 to repeat the selection and corresponding Q value correction processes, and enabling the Q values in different states and corresponding pairing object types to approach the real requirement of the user through a large number of selection and correction processes. Therefore, the pairing object can be selected according to the user state, the correlation degree of the selected pairing object and the pairing requester is improved, and the final pairing success rate is further improved.

In the initial stage of starting the anonymous communication system, the system can independently establish an automatic pairing selection device for each registered user and execute the automatic pairing selection method. However, the initial value of the selection probability (Q value in Q learning function) of different pairing object types corresponding to the different states in the reinforcement learning function is set empirically (e.g., using non-BP neural network to store mapping relationships) or randomly (e.g., using BP neural network to store mapping relationships), so that the initial pairing strategy of the user will be uncertain, and the user must use the pairing selection device multiple times and train the automatic pairing selection device according to the reaction of the user to the pairing result until the relationship between the user state and the pairing object type can be better fitted. But the fitting speed depends on the use frequency of the user, the user state parameter number and the number of the selectable pairing object types (the number of the neural networks).

If the automatic pairing selection device using the random initial setting or the empirical setting is directly put into use, it may not be possible for each user to obtain a pairing strategy that meets the preference of the user for a long period of time from the beginning of use. This may cause some of the first-time users to quickly lose interest in the corresponding system (e.g., in an anonymous communication system).

Thus, in order for the system to achieve a reliable cold start, in one embodiment of the invention, there may be two automatic pairing selection devices. One is a global auto-pairing selection device owned by the server side (the number is 1), and the other is a private auto-pairing selection device owned by the user (the number is equal to the number of users), as shown in fig. 4. In the initial stage of system startup, the global automatic pairing selection device is used for making a selection decision of a pairing object for pairing requests of all users in a specified range, and the automatic pairing selection device is trained according to feedback of the users. And when the global automatic pairing selection device is fully trained, taking the global automatic pairing selection device as an initial private automatic pairing selection device of each user.

FIG. 5 is a detailed flowchart of initializing an auto-pairing selection device according to the present invention. Referring to fig. 4 and to fig. 5, the process includes:

step 501: in the initial stage of system startup, a global automatic pairing selection device is established at the server side.

Step 502: and making a selection decision of a pairing object for pairing requests of all users in a specified range by using the global automatic pairing selection device, and training the automatic pairing selection device according to feedback of the users. Namely: the steps 301 to 307 are performed by using the same reinforcement learning function (the algorithm includes the state and the pairing object type selection probability) for the pairing request of all users within the specified range.

Step 503: whether the feedback correction times of the global automatic pairing selection device on the selection probability of all the pairing object types exceed a preset threshold (namely whether the training times of the neural networks corresponding to all the pairing object types exceed the preset threshold), if yes, executing step 504; otherwise, the procedure returns to step 502.

Step 504: and copying the global automatic pairing selection device to each user in the specified range respectively to be used as a private automatic pairing selection device of the user.

In step 505, after different users send pairing requests, a selection decision of the pairing object is made by using the private automatic pairing selection device of the user.

The above steps 504 and 505 essentially correspond to: after the reinforcement learning function is trained through the global automatic pairing selection device, copying the reinforcement learning function to each user respectively; after different users send pairing requests, the step 301 to the step 307 are executed by using the reinforcement learning function corresponding to the user.

In the proposed solution, it can be seen that the selection probability of each user for different pairing object types in different states is determined by its own independent automatic pairing selection device, and the automatic pairing selection device can fit the dynamically changing needs of the user by continuously learning online, without manual intervention by the system operator. However, for some special cases, such as a system operator wishing to artificially increase or decrease the probability of choosing a certain pairing object type; or the system adds a new type of matching object, and the probability of the matching object being selected is expected to be improved so as to detect the preference of the user for the type. Therefore, in one embodiment of the invention, a pairing strategy external intervention method based on threshold movement is designed.

FIG. 6 is a diagram illustrating an embodiment of external intervention in a selection policy through threshold shifting according to the present invention. Referring to fig. 6, this embodiment changes the selection probability for different pairing object types by moving the output threshold of the automatic pairing selection device corresponding to different matching object type neural networks without changing the internal structure of the automatic pairing selection device. Namely: after obtaining the Q value corresponding to the pairing object type, the method further includes: and multiplying the Q value by an external intervention coefficient to obtain a threshold moving Q value, and calculating the selection probability of the pairing object type by taking the threshold moving Q value as the Q value of the corresponding pairing object type.

The concrete way is shown as formula (8):

O_{i}^{*} = O_{i} C_{i} - - - (8)

wherein,output value Q, O representing the output layer of the neural network corresponding to the type i of the pairing object after threshold shift_iQ value, C, indicating that the threshold value of movement has not been passed_iIs the external intervention coefficient for the paired object type i. The Q value finally output after threshold value movement replaces the previous Q value when calculating the selection probability of the type of the paired object, so that the selection probability of the type of the paired object can be forcibly changed through an external intervention coefficient.

By utilizing the method and the device, different pairing strategies can be adopted according to the state information of different users, the reasonability of the pairing strategy generated by the decision maker is evaluated according to the reaction of the users to the pairing result, and the mapping relation between the user state and the pairing strategy is dynamically adjusted according to the evaluation result, so that the automatic pairing selection device can better meet the requirements of the users when the next pairing decision is made. Therefore, the individual pairing requirements of the receiving party can be met, the relevance between the selected pairing object and the pairing requesting party is improved, and the final pairing success rate is further improved.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A method for automatic pair selection in a data processing system, comprising:

A. accepting a pairing request for a user;

B. detecting the state information s of the user at the current moment_t；

Inputting the state information of the user into a Q learning function, wherein the selection probability of the state in the Q learning function and the type of the matched object is a corresponding Q value, and obtaining the state s according to the mapping relation between the state in the Q learning function and the Q value of the type of the matched object_tQ value Q(s) corresponding to each pairing object type_t,a)；

According to the formulaCalculating the selection probability of each pairing object type;

wherein, the a_iFor the ith pairing object type, p(s)_t,a_i) Is at s_tPairing object type a in state_iCorresponding hit probability, Q(s)_t,a_i) Is at s_tPairing object type a in state_iThe corresponding Q value, e is a natural logarithm, tau is a simulated annealing factor, and A is a set of all paired object types;

specifically, the selection probability of the pairing object type is calculated by adopting an action selection strategy of Boltzmann distribution;

C. selecting a matching object according to the selected probability;

D. detecting reaction action information of the user on the selected pairing object;

determining an immediate punishment value R(s) according to the reaction action information of the user to the selected pairing object_t,a_t) Wherein a is_tThe type of the pairing object selected for the time t;

according to the formula Q'(s)_t,a_t)＝Q(s_t,a_t)+α[R(s_t,a_t)-Q(s_t,a_t)]Correcting s in Q learning function_tPairing object type a in state_tCorresponding Q value, said Q'(s)_t,a_t) Is a correction value, wherein, the α is a preset learning rate;

the method further comprises the following steps: firstly, aiming at pairing requests of all users in a specified range, the steps A to D are executed by using the same Q learning function;

after the feedback correction times of the selection probabilities of all the paired object types exceed a preset threshold value, copying the Q learning function to each user in the specified range respectively; and B, after different users send pairing requests, executing the steps A to D by using the Q learning function corresponding to the user.

2. The method according to claim 1, characterized in that the mapping relation between the state in the Q learning function and the Q value of the pairing object type is stored by an approximating function learner.

3. The method of claim 2, wherein the approximating function learner is a back-propagation neural network.

4. The method of claim 1, wherein the pairing request is an anonymous communication request in an anonymous communication system, and wherein the pairing object type is a communication object type in the anonymous communication system; the step C specifically comprises the following steps: selecting a communication object type according to the selected probability, selecting a communication object from the communication object type to be paired with the user initiating the pairing request, and establishing communication between the communication object and the user;

or the pairing request is a friend recommendation request in an instant messaging system or a social networking system, and the pairing object type is a user type in the instant messaging system or the social networking system; the step C specifically comprises the following steps: selecting a user type according to the selected probability, and selecting a user from the user type as a friend to recommend to the user initiating the pairing request;

or the pairing request is a commodity recommendation request in the electronic commerce platform system, and the pairing object type is a commodity type in the electronic commerce platform system; the step C specifically comprises the following steps: selecting a commodity type according to the selected probability, and selecting a commodity from the commodity type as a recommended commodity to be recommended to the user initiating the pairing request;

or the pairing request is an article recommendation request in a web blog system or a literature network system, and the pairing object type is an article type; the step C specifically comprises the following steps: and selecting an article type according to the selected probability, and selecting an article from the article type as a recommended article to be recommended to the user initiating the pairing request.

5. The method according to claim 1, further comprising, after obtaining the Q value corresponding to the pairing object type: and multiplying the Q value by an external intervention coefficient to obtain a threshold moving Q value, and calculating the selection probability of the pairing object type by taking the threshold moving Q value as the Q value of the corresponding pairing object type.

6. The method according to claim 1, wherein the step D further comprises:

detecting reaction action information of the user on the selected pairing object;

detecting user state information s at the latest moment_t+1(ii) a Will s_t+1Inputting the data into a Q learning function, and obtaining the data in s according to the mapping relation between the state in the Q learning function and the Q value of the pairing object type_t+1Q value Q(s) corresponding to each pairing object type in state_t+1,a)；

According to the formula:

Q’(s_t，a_t)＝Q(s_t，a_t)+α[R(s_t，a_t)+γmax_a∈AQ(s_t+1，a)-Q(s_t，a_t)]correcting s in Q learning function_tPairing object type a in state_tCorresponding Q value, said Q'(s)_t,a_t) To correct value, said max_a∈AQ(s_t+1A) is in s_t+1And all the paired object types in the state correspond to the maximum value in the Q values, wherein gamma is a preset discount rate, and α is a preset learning rate.

7. An automatic pairing selection apparatus in a data processing system, comprising:

a state information detection module for detecting the state information s of the user at the current moment_tAnd input to the reinforcement learning module;

the reinforcement learning module is used for storing a mapping relation between a state in a reinforcement learning function and the selection probability of the type of the matched object, wherein the reinforcement learning function is a Q learning function, and the state in the Q learning function and the selection probability of the type of the matched object are corresponding Q values; the reinforcement learning module is further specifically configured to: inputting the state information of the user into a Q learning function, and obtaining the state s according to the mapping relation between the state in the Q learning function and the Q value of the pairing object type_tQ value Q(s) corresponding to each pairing object type_t,a)；

a correction module for determining an immediate punishment value R(s) according to the response action information of the user to the selected pairing object_t,a_t) Wherein a is_tSelected for time tA pairing object type; according to the formula Q'(s)_t,a_t)＝Q(s_t,a_t)+α[R(s_t,a_t)-Q(s_t,a_t)]Correcting s in Q learning function_tPairing object type a in state_tCorresponding Q value, said Q'(s)_t,a_t) Is a correction value, wherein, the α is a preset learning rate;

the automatic pairing selection device is used as a global automatic pairing selection device to make a selection decision of a pairing object for pairing requests of all users in a specified range at the initial starting stage of the system, and the automatic pairing selection device is trained according to feedback of the users; and when the feedback correction times of the global automatic pairing selection device on the selection probabilities of all the pairing object types exceed a preset threshold value, the automatic pairing selection device is used as an initial private automatic pairing selection device of each user.

8. The apparatus of claim 7, wherein the reinforcement learning module comprises an approximation function learner, and a mapping relation between states in the reinforcement learning function and the pairing object type selection probability is stored by the approximation function learner.

9. The apparatus of claim 8, wherein the approximating function learner is a back-propagation neural network.

10. The apparatus according to claim 7, wherein the modifying module is further configured to notify the state information detecting module to detect user state information at a latest time, and perform feedback modification on the selection probabilities of the corresponding states and the corresponding paired object types in the reinforcement learning function according to the latest user state information and the response action information detected by the response information detecting module.