CN115526338A

CN115526338A - Reinforced learning model construction method for information retrieval

Info

Publication number: CN115526338A
Application number: CN202211287916.0A
Authority: CN
Inventors: 蒋永余; 方省; 曹家; 王璋盛; 罗引; 王磊
Original assignee: Beijing Zhongke Wenge Technology Co ltd
Current assignee: Beijing Zhongke Wenge Technology Co ltd
Priority date: 2022-10-20
Filing date: 2022-10-20
Publication date: 2022-12-27
Anticipated expiration: 2042-10-20
Also published as: CN115526338B

Abstract

The application relates to the field of information retrieval, in particular to a reinforcement learning model construction method for information retrieval, which comprises the following steps: s100, acquiring a feature code Q of query information Q and a feature code of each candidate document in a candidate document set; s200, constructing an MDP model, wherein: initial state s of the MDP model ₀ ＝[0，q]The agent of the MDP model selects the action a in the initial state ₀ Has a probability distribution of pi (a) ₀ |s ₀ (ii) a w); and S300, performing model training on the MDP model according to the long-term reward. The invention improves the accuracy of document sorting during information retrieval.

Description

Reinforced learning model construction method for information retrieval

Technical Field

The invention relates to the field of information retrieval, in particular to a reinforcement learning model construction method for information retrieval.

Background

With the rapid development of the internet, learning to rank (L2R) technology is also getting more and more attention, which is one of the common tasks of machine Learning. When information is retrieved, a query target is given, and a result which best meets the requirement needs to be calculated and returned. The use of a Markov Decision Process (MDP) to generate document rankings is disclosed in the prior art, which alleviates the problem of ranking complexity to some extent. However, the enhanced learning model based on the MDP in the prior art is mostly based on a first-order markov decision process, which causes the position of each document to be related to the previous document only and not to be related to the previous documents, thereby affecting the accuracy of document ranking in information retrieval. How to improve the accuracy of document ordering during information retrieval is an urgent problem to be solved.

Disclosure of Invention

The invention aims to provide a reinforcement learning model construction method for information retrieval, so as to improve the accuracy of document sequencing during information retrieval.

According to the invention, a reinforcement learning model construction method for information retrieval is provided, which comprises the following steps:

s100, acquiring the feature code Q of the query information Q and the feature codes of the candidate documents in the candidate document set.

S200, constructing an MDP model, wherein: initial state s of the MDP model ₀ ＝[0，q](ii) a Agent of MDP model selects action a in initial state ₀ Probability distribution of

The action a ₀ For selecting candidate documents from the set of candidate documents

As candidate documents

W is a preset initialized trainable parameter,x _m(a) candidate document d selected from the set of candidate documents for action a _m(a) Characteristic code of (1), A(s) ₀ ) Is in an initial state s ₀ Set of actions selectable from ") ^H A conjugate transpose matrix of (); initial reward function of MDP model

For preset candidate documents

The relevance tag of (a); the state s corresponding to the intelligent agent of the MDP model in the t step _t Lower selection action a _t Probability distribution of

Is an action a _t Candidate documents selected from the set of candidate documents

Characteristic code of (1), A(s) _t ) For the state s corresponding to the t-th step _t A set of next selectable actions; rho _t A quantum probability distribution operator for the first n-1 selected candidate documents containing the agent, n being a predetermined value,

decision reward function of MDP model

Is a predetermined action a _f Candidate documents selected from the set of candidate documents

The relevance tag of (1).

S300, performing model training on the MDP model according to the long-term reward; wherein the long-term reward

λ is a predetermined discount factor, r _k Feature coding of kth candidate document returned for MDP model

K ranges from 1 to M, M is the number of candidate documents included in the candidate document set, and E is an expectation.

Compared with the prior art, the reinforcement learning model construction method for information retrieval has obvious beneficial effects, can achieve considerable technical progress and practicability, has industrial wide utilization value and at least has the following beneficial effects:

the method considers the sorting dependency relationship among a plurality of candidate documents and expands the first-order Markov decision process to the n-order Markov decision process.

Furthermore, the method constructs the characteristics of the query and the candidate documents through a quantum language model, calculates the probability of possible actions of the intelligent agent through a quantum probability theory, introduces longer candidate document sequence information in the ordering process, and improves the accuracy of document ordering under the condition of not increasing the complexity of ordering.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a flowchart of a reinforcement learning model construction method for information retrieval according to an embodiment of the present invention;

fig. 2 is a schematic diagram of an interaction process between an agent and an environment according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without inventive step based on the embodiments of the present invention, are within the scope of protection of the present invention.

According to the invention, the number of independent potential semantics included in the candidate document set is recorded as N, and then the words are modeled as being defined in an N-dimensional Hilbert space H ^N Quantum concept of (1), where the underlying semantics form a set of basis vectors of space { ψ ₁ ，ψ ₂ ，…，ψ _N }. Each word may use hilbert space H ^N The superposition state of the basis vectors, i.e. the linear combination of the basis vectors and the complex-valued weights. Therefore, the method for acquiring the feature codes of the candidate documents in the candidate document set comprises the following steps:

s110, for the e-th candidate document doc in the candidate document set _e And performing word segmentation to obtain m words.

S120, obtaining a complex word vector of the first word

Is a complex number, { w _j } ^N _j＝1 Is non-negative and real and satisfies sigma ^N _j＝1 w ² _j ＝1，θ _j Is a real number w _j Corresponding complex phase and satisfies theta _j ∈[-π,π]，ψ _j Is a Hilbert space H ^N The jth base vector above, N being the number of independent potential semantics included in the candidate document set, and i being an imaginary unit.

According to the invention, it is also possible to redefine it according to the Euler formula

θ _j Are trainable parameters.

S130, obtaining candidate document doc _e Characteristic code x of _m ＝∑ ^m _l＝1 (u _l ·t _l ·(t _l ) ^H )，u _l Is the first word at doc _e Of importance, Σ ^m _l＝1 u _l ＝1。

According to the invention, the candidate document doc _e Can be represented as a sequence of m complex-word vectors t ₁ ，t ₂ ，…，t _m ]If the word features of a candidate document are used to form the ground state of a state space, the encoding features x of the candidate document may be represented by a quantum language model _m 。

Optionally, u is obtained according to the word frequency (that is tf) of the first word in doc _l (ii) a Or obtaining u according to the inverse text frequency index (tf-idf) of the ith word _l 。

According to the method of S110-S130, the feature codes of the candidate documents in the candidate document set can be obtained.

According to the invention, the method for obtaining q comprises the following steps:

and S111, performing word segmentation on the query information Q to obtain c words.

S121, obtaining a complex word vector of the b-th word

Is a complex number, { w _j } ^N _j＝1 Is non-negative and real and satisfies sigma ^N _j＝1 w ² _j ＝1，θ _j Is a real number w _j Corresponding complex phase and satisfies theta _j ∈[-π,π]，ψ _j Is a Hilbert space H ^N N is the number of independent potential semantics comprised by the candidate document set, and i is an imaginary unit.

S131, obtaining the Q characteristic code Q = ∑ Sigma ^c _b＝1 (u _b ·t _b ·(t _b ) ^H )，u _b Sigma is the importance of the b-th word in Q ^c _b＝1 u _b ＝1。

According to the invention, the query information can be represented as a sequence of c complex-word vectors t ₁ ，t ₂ ，…，t _c ]If the word features of the query message are used to form the ground state of a state space, the encoding features q of the query message may be represented by a quantum language model.

Optionally, u _b ＝1/c。

The action a ₀ For selecting a candidate document from the set of candidate documents

As candidate documents

W is a predetermined initial trainable parameter, x _m(a) Candidate document d selected from the set of candidate documents for action a _m(a) Feature coding of，A(s ₀ ) At an initial state s ₀ Lower optional action set, () ^H A conjugate transpose matrix of (); initial reward function for MDP model

As preset candidate documents

Characteristic code of (1), A(s) _t ) For the state s corresponding to the t-th step _t A set of next selectable actions; rho _t The operator is a quantum probability distribution operator containing the first n-1 selected candidate documents of the agent, n is a preset value,

decision reward function of MDP model

The relevance tag of (1).

In accordance with the present invention, the process of ranking candidate documents may be formulated as an MDP, wherein the construction of the candidate document ranking may be viewed as a sequential decision, wherein each time step corresponds to a ranking position, and each action selects the candidate document corresponding to the position. The ranking method for candidate documents may consist of < S, A, T, R, π >.

S is a state set representing an Environment (i.e., environment). During the ranking process, the Agent (i.e., agent) should know the current ranking position and optional candidate documents. At the t-th step, the state can be defined as s _t ＝[t，X _t ]Wherein X is _t And selecting from the feature code sets of the rest candidate documents.

A represents an agent selectable action set. Optional set of actions A(s) _t ) Dependent on the state s _t . At the t step, the action a _t ∈A(s _t ) Feature encoding of selected candidate documents

Arranged at the t +1 th position, where m (a) _t ) Represents action a _t The selected candidate document index.

T (S, A): t is S × A → S, representing the pass through action a _t Will state s _t Transition to a new state s _t+1 。

R (S, A) is an instant prize. This reward may be considered the quality of the selected candidate document during the ranking process.

π(a|s):A×S→[0,1]Representing the behaviour of an agent, i.e. optional action a _t The probability distribution of (2) is a probability distribution calculated by quantum probability in the present invention.

According to the present invention, the initial state setting of the MDP model includes: given a coding feature Q corresponding to query information Q and a feature coding set X of a candidate document set ₀ (length M) and the corresponding set of relevance tags Y (length M), with q as the initial state s ₀ ＝[0，q]. Action a _t Selected candidate documents

Is coded as

Is a candidate document

The relevance tags of (2). In the invention, the relevance label of each candidate document is preset and optional, the relevance label of each candidate document is the relevance between the query information marked by a user and the corresponding candidate document and is a digital label, wherein 0 is used for representing irrelevant, 1 is used for representing weak relevance, and 2 is used for representing strong relevance.

According to the invention, the agent is in an initial state s ₀ ＝[0，q]Next selects an action a ₀ Selecting action a ₀ The probability distribution of (c) is:

wherein w is a preset trainable parameter for initialization. Optionally, w is input by a user or read from a configuration file; those skilled in the art will appreciate that any means for obtaining w in the prior art is within the scope of the present invention.

At this time, action a can be obtained ₀ Selected m (a) ₀ ) A candidate document. Defining the evaluation index as a reward function under the information retrieval scene:

accordingly, the state of the agent will go to s ₁ ＝T([0，q],a ₀ )＝[1,X ₁ ]Wherein

Is to encode the characteristics of the selected document

Remove candidate set X ₀ And obtaining the latest feature code set of the candidate document.

As a specific embodiment, taking 3 rd order Markov decision process as an example, at step t the agent needs to be in initial state s _t ＝[t，X _t ]Next select an action a _t Selecting action a _t The probability distribution of (c) is:

where ρ is _t A quantum probability distribution operator for the first 2 selected candidate documents containing the agent:

thus, an action a in consideration of candidate document information selected by the first two agents under 3-order Markov decision can be obtained _t Selected m (a) _t ) A candidate document.

The reward function that takes into account a 3 rd order markov decision is:

thus, the state of the agent will go to s _t+1 ＝T([t，X _t ],a _t )＝[t+1,X _t+1 ]Wherein

Is to encode the characteristics of the selected document

Removing candidate set X _t Resulting in the latest candidate set.

The above process is repeated until M candidate documents are ranked, and the process of generating an ordered document set by this ranking algorithm is shown in fig. 1 and 2.

S300, performing model training on the MDP model according to the long-term reward; wherein the long-term reward L = E [. Sigma ] ^M _k＝1 (λ ^k-1 *r _k )]λ is a predetermined discount factor, r _k Feature coding of kth candidate document returned for MDP model

It should be understood that the model-trained MDP model may be used for information retrieval. Because the feature codes and the candidate documents have one-to-one correspondence, the ordered candidate document set can be obtained according to the feature code set of the ordered candidate documents returned by the MDP model.

Although some specific embodiments of the present invention have been described in detail by way of example, it should be understood by those skilled in the art that the above examples are for illustration only and are not intended to limit the scope of the invention. It will also be appreciated by those skilled in the art that various modifications may be made to the embodiments without departing from the scope and spirit of the invention. The scope of the invention is defined by the appended claims.

Claims

1. A reinforcement learning model construction method for information retrieval is characterized by comprising the following steps:

s100, acquiring a feature code Q of query information Q and a feature code of each candidate document in a candidate document set;

As candidate documents

W is a predetermined initial trainable parameter, x _m(a) Candidate document d selected from the set of candidate documents for action a _m(a) Characteristic code of (1), A(s) ₀ ) Is in an initial state s ₀ Set of actions selectable from ") ^H A conjugate transpose matrix of (); initial reward function for MDP model

As preset candidate documents

Characteristic code of (1), A(s) _t ) For the state s corresponding to step t _t A set of next selectable actions; rho _t The operator is a quantum probability distribution operator containing the first n-1 selected candidate documents of the agent, n is a preset value,

n is more than or equal to 2; of MDP modelsDecision reward function

The relevance tag of (a);

s300, performing model training on the MDP model according to the long-term reward; wherein the long-term awards

2. The method according to claim 1, wherein in S100, the method for obtaining the feature code of each candidate document in the candidate document set comprises:

s110, for the e-th candidate document doc in the candidate document set _e Performing word segmentation to obtain m words;

s120, obtaining a complex word vector of the first word

Is a complex number, { w _j } ^N _j＝1 Is non-negative and real and satisfies sigma ^N _j＝1 w ² _j ＝1，θ _j Is a real number w _j Corresponding complex phase and satisfies theta _j ∈[-π,π]，ψ _j Is a Hilbert space H ^N The jth base vector, N being the number of independent potential semantics included in the candidate document set, i being an imaginary unit;

3. The method of claim 2, wherein in S130, the first word is in doc _e Term frequency acquisition u in _l 。

4. The method of claim 2, wherein in S130, u is obtained according to an inverse text frequency index of the ith word _l 。

5. The method of claim 1, wherein in S100, the method for obtaining q comprises:

s111, segmenting words of the query information Q to obtain c words;

s121, obtaining a complex word vector of the b-th word

Is a complex number, { w _j } ^N _j＝1 Is non-negative and real and satisfies sigma ^N _j＝1 w ² _j ＝1，θ _j Is a real number w _j Corresponding complex phase and satisfies theta _j ∈[-π,π]，ψ _j Is a Hilbert space H ^N The jth base vector, N being the number of independent potential semantics included in the candidate document set, and i being an imaginary unit;

6. The method of claim 5, wherein u is _b ＝1/c。

7. The method of claim 1, wherein n =3.

8. The method of claim 1, wherein the correlation tag is 0 or 1 or 2.