CN111782870B

CN111782870B - Antagonistic video time retrieval method and device based on reinforcement learning, computer equipment and storage medium

Info

Publication number: CN111782870B
Application number: CN202010557372.XA
Authority: CN
Inventors: 曹达; 曾雅文; 荣辉桂; 朱宁波; 陈浩; 秦拯
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2020-06-18
Filing date: 2020-06-18
Publication date: 2021-11-30
Anticipated expiration: 2040-06-18
Also published as: CN111782870A

Abstract

The invention provides a method, a device, computer equipment and a storage medium for searching antagonistic videos based on reinforcement learning at any moment, wherein a complete video and an inquiry sentence are input as environmental information of a reinforcement learning agent; extracting query sentence features, global video features, position features and local video features of the environment information to form the state of the current video moment segment; the reinforcement learning agent makes an action of moving on the time sequence boundary according to the state, obtains an incentive for executing the action and outputs a plurality of updated time sequence boundaries and local video characteristics according to the incentive, wherein the time sequence boundaries are updated current video time candidate segments; matching the time sequence boundary with the query statement by a Bayes personalized sorting method, outputting a matching score, and returning the matching score as a reward to the reinforcement learning agent; and mutually enhancing the counterlearning until convergence to obtain a video time segment corresponding to the query sentence.

Description

Antagonistic video time retrieval method and device based on reinforcement learning, computer equipment and storage medium

[ technical field ] A method for producing a semiconductor device

The invention relates to the field of video time retrieval, in particular to a method and a device for retrieval of antagonistic video time based on reinforcement learning, computer equipment and a storage medium.

[ background of the invention ]

Video retrieval, aiming at retrieving the video which is most related to the semantic described by the query sentence text from a group of possible video sets. Due to the rapid pace of modern life and the increasing amount of information, there is an urgent need to quickly find relevant information that best meets the actual needs of people, especially in the video field, people have become more and more willing to browse a short video moment that matches their interests rather than the entire video. To meet this need, a video moment retrieval task under linguistic query has emerged, with the goal of locating the start and end points of the video moment that are most relevant to the query statement semantics.

The existing video time retrieval method, such as 'video time positioning through language query', mainly comprises the following steps: 1. extracting the characteristics of the video clips and the characteristics of the query sentences; 2. performing multi-modal processing on the video segment characteristics and the query sentences to obtain richer semantic information; 3. the multi-layered perceptron predicts the matching degree score and time bias of the video and sentence respectively. The method is based on query statements, selects the best matching video segments from a candidate set and adds time bias, wherein the candidate set is generated by segmentation through a sliding window strategy, however, in order to meet the positioning accuracy, the strategy often needs intensive segmentation, so that the method is time-consuming and cannot meet the requirement of dynamic query, and the length of the video segments is required to be long rather than fixed. On the other hand, using a time bias, while it is possible to make the positioning not limited by the size of the window, the prediction of the bias is not stable enough but rather compromises the quality of the video segment returned to the query.

Also by "reading, viewing and moving: the video positioning reinforcement learning method based on the time described by the natural language comprises the following main steps: 1. inputting a complete video and query sentences to become an environment of the reinforcement learning agent; 2. extracting the global features of the video, the features of the video segments, the positioning information of the video segments and the features of the query text to form the state of the current moment; 3. and the reinforcement learning agent outputs the movement action of the positioning boundary according to the current state, and the movement action is repeated continuously until the positioning is converged gradually. The work of realizing video time positioning based on reinforcement learning is the first work of introducing reinforcement learning, and the work can get rid of dependence on sliding window candidates and realize more accurate positioning. But the design of the agent rewards has not been much explored. Existing reinforcement learning-based methods compute by means of a cross-over ratio (IoU) before and after each positioning boundary movement, which is semantically unexplored and fixed reward values lead to slow and unstable convergence of the model.

In summary, the two existing methods for processing video time retrieval mainly include two categories: the sorting method based on the sliding window candidate set, the positioning method based on the reinforcement learning and the sorting method based on the sliding window candidate set are that a strategy of a sliding window is used for segmenting a video in advance to generate a candidate set, then the candidate set is matched with a query text, and a result is obtained according to the matching degree sorting. Obviously, the method generates too many segments and is long in time consumption, so that a learner introduces reinforcement learning to abstract the problem into a continuity decision problem to directly position (the start frame and the end frame of the video), and although the learner also obtains good effect, the learner does not explore too much the reward design of the agent, and the methods are not stable.

The sorting method based on the sliding window candidate set and the positioning method based on the reinforcement learning have advantages and disadvantages, the sorting method is good at sorting a plurality of video time candidates, but the time consumption is too large when a certain number of reasonable candidate sets cannot be formed, and the positioning method utilizes the reinforcement learning agent to control to position boundaries, but cannot be applied to large-scale retrieval scenes, and the efficiency is low.

Therefore, there is a need to provide an improved video time retrieval method to solve the above problems.

[ summary of the invention ]

The invention overcomes the defects of the prior art and provides a resistant video time retrieval method and device based on reinforcement learning, computer equipment and a storage medium.

In order to achieve the purpose, the invention adopts the technical scheme for solving the technical problems: the adversarial video time retrieval method based on reinforcement learning is provided, and comprises the following steps:

s1: inputting a complete video v and a query sentence q as environmental information of a reinforcement learning agent;

s2: extracting query statement feature f of the environmental information_qGlobal video feature f_gLocation characteristics I^tAnd the position characteristics I^tCorresponding local video feature f_I ^tState s forming current video time segment^t＝[f_q,f_g,I^t,f_I ^t]Where t is the time step, location feature I^tIs an initial timing boundary I^t；

S3: the reinforcement learning agent is based on the state s^tMaking at the timing boundary I^tMovement action a^tObtaining to execute the action a^tIs awarded r^tAnd according to the reward r^tOutputting a plurality of updated timing boundaries I^t+1And the timing boundary I^t+1Corresponding local video feature f_I ^t+1Reconstructing the state s' of the current video time slice, at which time the temporal boundary I is present^t+1The current video time candidate segment is updated;

s4: the time sequence boundary I is sorted through a Bayes personalized sorting method^tMatching with the query statement q, outputting a matching score, and using the matching score as a reward r^tReturning to the reinforcement learning agent;

s5: the reinforcement learning agent and the Bayes individual sorting method mutually reinforce each other through counterwork learning until convergence, and obtain a video time segment I (I) corresponding to the query sentence q_s，I_e) Wherein, I_sIs the video start time, I_eIs the video end time.

Preferably, in step S3, the method further includes: updating the reinforcement learning agent by a deep deterministic policy gradient algorithm to output a number of updated timing boundaries I^t+1The deep certainty strategy gradient algorithm is composed of a critic network, an actor network, a critic network parameter lag network and an actor network parameter lag network, wherein the critic network is used for rewarding according to the reward r^tJudging the action a^tWhether it is an optimal action, the actor network is used to perform the optimal action to obtain an updated timing boundary/^t+1And the comment family network parameter lag network and the actor network parameter lag network update the parameters of the respective lag networks through a soft update method.

Preferably, the critic network learns the action value function Q (s, a) corresponding to the optimal strategy pi by minimizing the loss function L:

L(ω)＝E_s,a,r,s'～M[(Q(s,a|ω)-r+γmaxQ^*(s',a'|ω*))²]

wherein Q (s, a) is an action value function of the critic network, ω is a variation parameter of the action value function Q (s, a), and γ is a discount factor of the action value function Q (s, a) for balancing the reward r^tAnd an estimated value of said action value function Q (s, a), Q^*Is a preset parameter lag network, omega^*Is Q^*Of [ s, a, r, s']Are sampled from the memory base M to derive the hint from past experience, s is the state of the non-updated video time segment, a is the non-updated action, a' is the updated action, the reinforcement learning agent will get the maximum reward when the action value function Q (s, a) most closely approaches the optimal strategy pi.

Preferably, the actor network performs the action a ═ pi (s; θ) to update the time-series boundary I^tA derivative in the increasing direction of the action value function Q (s, a) is obtained by a loss function J so that the action value function Q (s, a) has a maximum value, and the derived strategy gradient is:

where μ is a deterministic policy gradient and θ is a parameter of the deterministic policy gradient μ.

Preferably, step S4 includes:

s41: the query statement q comprises a marked real video moment τ ═ (τ)_s、τ_e) Extracting the query statement q and the time sequence boundary I^tAnd the characteristic of the real video time τ, where τ_sFor marked real video start time, τ_eIs the marked real video end time;

s42: through a predetermined public spaceAnd the characteristics of the query statement q, the timing boundary I^tObtaining the mapping function and the time sequence boundary I of the query statement q according to the characteristics of the query statement q and the characteristics of the real video time tau^tAnd a mapping function of the real video time τ;

s43: obtaining the mapping function of the query statement q and the timing boundary I by element-level multiplication, element-level addition and full concatenation^tThe mapping function of the query statement q and the mapping function of the real video time τ;

s44: mapping function according to the query statement q and the timing boundary I^tAnd outputting an updated time sequence boundary I according to the mapping function of the query statement q and the mapping function of the real video time tau^tTo the matching score near the real video moment τ.

Preferably, step S5 includes:

s51: acquiring the intersection ratio of the time sequence boundary and the real video time tau;

s52: according to the intersection ratio, the query statement q and the time sequence boundary I^tObtaining a joint loss function by the mapping function;

s53: obtaining the maximum reward r by combining the loss of the Bayes personalized sorting method and the joint loss function;

s54: (I) a timing boundary at which the reinforcement learning agent outputs the maximum reward_s，I_e)。

Preferably, the parameter θ of the reinforcement learning agent and the parameter of the bayesian personalized ranking method

The formula is as follows:

wherein K is the updateOf the total number of timing boundaries, L_scAnd combining the mapping function of the query statement q and the mapping function of the real video time tau.

The invention also provides a resistant video time retrieval device based on reinforcement learning, which is characterized by comprising the following components:

the input module is used for inputting the complete video v and the query sentence q as the environmental information of the reinforcement learning agent;

an extraction feature module for extracting the query sentence feature f of the environment information_qGlobal video feature f_gLocation characteristics I^tAnd the position characteristics I^tCorresponding local video feature f_I ^tState s forming current video time segment^t＝[f_q,f_g,I^t,f_I ^t]Where t is the time step, location feature I^tIs an initial timing boundary I^t；

A candidate set generation module for generating a candidate set according to the state s^tMaking at the timing boundary I^tMovement action a^tObtaining to execute the action a^tIs awarded r^tAnd according to the reward r^tOutputting a plurality of updated timing boundaries I^t+1And the timing boundary I^t+1Corresponding local video feature f_I ^t+1Reconstructing the state s' of the current video time slice, at which time the temporal boundary I is present^t+1The current video time candidate segment is updated;

a Bayes personalized ranking identification module for identifying the time sequence boundary I^tMatching with the query statement q, outputting a matching score, and using the matching score as a reward r^tReturning to the reinforcement learning agent;

a confrontation learning module for enhancing each other through confrontation learning until convergence to obtain a video time segment I ═ I (I) corresponding to the query sentence q_s，I_e) Wherein, I_sIs the video start time, I_eIs the video end time.

A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor when executing the computer program implements the steps of:

inputting a complete video v and a query sentence q as environmental information of a reinforcement learning agent;

extracting query statement feature f of the environmental information_qGlobal video feature f_gLocation characteristics I^tAnd the position characteristics I^tCorresponding local video feature f_I ^tState s forming current video time segment^t＝[f_q,f_g,I^t,f_I ^t]Where t is the time step, location feature I^tIs an initial timing boundary I^t；

The reinforcement learning agent is based on the state s^tMaking at the timing boundary I^tMovement action a^tObtaining to execute the action a^tIs awarded r^tAnd according to the reward r^tOutputting a plurality of updated timing boundaries I^t+1And the timing boundary I^t+1Corresponding local video feature f_I ^t+1Reconstructing the state s' of the current video time slice, at which time the temporal boundary I is present^t+1The current video time candidate segment is updated;

the time sequence boundary I is sorted through a Bayes personalized sorting method^tMatching with the query statement q, outputting a matching score, and using the matching score as a reward r^tReturning to the reinforcement learning agent;

the reinforcement learning agent and the Bayes individual sorting method mutually reinforce each other through counterwork learning until convergence, and obtain a video time segment I (I) corresponding to the query sentence q_s，I_e) Wherein, I_sIs the video start time, I_eIs the video end time.

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:

Compared with the prior art, the adversarial video time retrieval method and device based on reinforcement learning, the computer equipment and the storage medium have the following beneficial effects: by combining the reinforcement learning positioning method and the Bayes personalized sorting method, on one hand, a small number of reasonable candidate sets can be obtained by the sorting-based method, on the other hand, a more flexible reward function and more stable convergence can be obtained by the reinforcement learning positioning method, then the sorting and positioning methods are mutually enhanced under the framework of counterstudy, more accurate video time segments are returned, and the accuracy and speed of query and retrieval of a user are effectively improved.

[ description of the drawings ]

FIG. 1 is a flowchart of a robust learning-based adversarial video time retrieval method provided by the present invention;

FIG. 2 is a schematic diagram illustrating the principle of a robust learning-based adversarial video time retrieval method provided by the present invention;

FIG. 3 is a sub-flowchart of step S4 in FIG. 1;

FIG. 4 is a sub-flowchart of step S5 in FIG. 1;

FIG. 5 is a functional block diagram of a resistant video time retrieval apparatus according to the present invention;

fig. 6 is an internal structural view of a computer device provided by the present invention.

[ detailed description ] embodiments

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1 and fig. 2, the present invention provides a robust learning-based antagonistic video time retrieval method, which includes the following steps:

s1: and inputting the complete video v and the query sentence q as environment information of the reinforcement learning agent.

S2: extracting query statement feature f of the environmental information_qGlobal, globalVideo feature f_gLocation characteristics I^tAnd the position characteristics I^tCorresponding local video feature f_I ^tState s forming current video time segment^t＝[f_q,f_g,I^t,f_I ^t]Where t is the time step, location feature I^tIs an initial timing boundary I^t。

S3: the reinforcement learning agent is based on the state s^tMaking at the timing boundary I^tMovement action a^tObtaining to execute the action a^tIs awarded r^tAnd according to the reward r^tOutputting a plurality of updated timing boundaries I^t+1And the timing boundary I^t+1Corresponding local video feature f_I ^t+1Reconstructing the state s' of the current video time slice, at which time the temporal boundary I is present^t+1And the current video time candidate segment is updated.

Action space A of the reinforcement learning agent_eConsisting of 7 predefined actions, namely the action a^tBoth the starting point and the end point of (a), both the starting point and the end point of (b), move backwards, one of the starting point or the end point move forwards or backwards separately, and said action a^tThe movement is stopped.

Specifically, the initial position of the reinforcement learning agent movement is set as I⁰＝[0.25*h，0.75*h]Where h is the total length of image frames in the complete video v, the action a^tThe move size per step is set to h/2 e, where e is a certain hyper-parameter that defines the maximum number of search steps for the reinforcement learning agent, which ensures that the complete video v is traversed at the maximum number of steps.

In this embodiment, the reinforcement learning agent is updated by a deep deterministic policy gradient algorithm to output a number of updated timing boundaries I^t+1The deep certainty strategy gradient algorithm is composed of a critic network, an actor network, a critic network parameter lag network and an actor network parameter lag network, wherein the critic network is used for rewarding according to the reward r^tJudging the action a^tWhether the actor network is performing the optimal action to obtain an updated temporal boundary I^t ⁺¹And the comment family network parameter lag network and the actor network parameter lag network update the parameters of the respective lag networks through a soft update method.

It should be noted that, the deep deterministic policy gradient algorithm uses a deep neural network of function approximation, and effectively utilizes the implementation of empirical replay and a dual target lag network, and the critic network learns the action value function Q (s, a) corresponding to the optimal policy pi by minimizing the loss function L:

L(ω)＝E_s,a,r,s'～M[(Q(s,a|ω)-r+γmaxQ^*(s',a'|ω*))²]

The actor network performs the action a ═ pi (s; θ) to update the timing boundary I^tA derivative in the increasing direction of the action value function Q (s, a) is obtained by a loss function J so that the action value function Q (s, a) has a maximum value, and the derived strategy gradient is:

where μ is a deterministic strategy gradient and θ is a parameter of the deterministic strategy gradient μ, the actor network maximizes the action value function Q (s, a) by directly adjusting θ.

referring to fig. 3, in step S4, the method includes the following steps:

s41: the query statement q comprises a marked real video moment τ ═ (τ)_s、τ_e) Extracting the query statement q and the time sequence boundary I^tAnd the characteristic of the real video time tau, respectively f_q、f_IAnd f_τWherein, τ_sFor marked real video start time, τ_eIs the marked real video end time;

s42: by presetting a public space and the characteristics and time sequence boundary I of the query statement q^tObtaining the mapping function and the time sequence boundary I of the query statement q according to the characteristics of the query statement q and the characteristics of the real video time tau^tAnd a mapping function of the real video instants τ.

In particular, under the constraint of semantic consistency, f is_q、f_lAnd f_τThe projection is inverted into the public space, so that different modes are regularized, and the retrieval performance is effectively improved:

wherein o is_vAnd o_lIs a projection function approximated by a multi-layer perceptron,

are projected features having the same dimensions. In the common space, under the constraint of semantic consistency, different modal representations will be forced to approach:

s43: obtaining the mapping function of the query statement q and the timing boundary I by element-level multiplication, element-level addition and full concatenation^tThe mapping function of the query statement q and the mapping function of the real video time τ are combined as follows:

Wherein the matching degree of the real video time tau and the query statement q is higher than that of the time sequence boundary I^tThe matching degree with the query statement q is high, and the optimization mode is as follows:

where σ is the Sigmoid activation function, o_sIs the approximate score of the multi-layer perceptron, Δ is the hyper-parameter controlling the difference between the two, by which the match score of a positive case pair can be greater than the match score of a negative case pair, effectively distinguishing the true video time τ from the temporal boundary I^tThe positive case pair refers to the real video time τ and the query statement q, and the negative case pair refers to the time sequence boundary I^tAnd the query statement q.

S5: the reinforcement learning agent and the Bayes individual ordering method mutually reinforce through counterwork learning until convergence to obtain the corresponding query languageVideo time segment I of sentence q ═ I (I)_s，I_e)。

Referring to fig. 4, in step S5, the method includes the following steps:

s51: acquiring the timing boundary I^tAnd the cross-over ratio of the real video time tau;

s52: according to the intersection ratio, the query statement q and the time sequence boundary I^tThe mapping function of (a) yields a joint loss function,

s53: and combining the loss of the Bayes personalized sorting method with the joint loss function to obtain the maximum reward r:

r＝-L_bpr-λ_sL_sc+λ_jL_joint

s54: (I) a timing boundary at which the reinforcement learning agent outputs the maximum reward_s，I_e) Wherein, I_sIs the video start time, I_eIs the video end time.

In this embodiment, the parameter θ of the reinforcement learning agent is the parameter of the bayesian personalized ranking method

The formula is as follows:

wherein K is the updated timing boundary I^tTotal amount of (2), L_scAnd combining the mapping function of the query statement q and the mapping function of the real video time tau.

According to the adversarial video time retrieval method based on reinforcement learning, by combining reinforcement learning positioning and Bayesian personalized sorting methods, on one hand, a small number of reasonable candidate sets can be obtained by the method based on sorting, on the other hand, a more flexible reward function and more stable convergence can be obtained by the method based on reinforcement learning, and then the methods of sorting and positioning are mutually enhanced under the framework of adversarial learning, so that more accurate video time segments are returned, and the accuracy and speed of query retrieval of a user are effectively improved.

It should be understood that although the steps in the flowcharts of fig. 1, 3 and 4 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 1, 3, and 4 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or stages is not necessarily sequential, but may be performed alternately or alternatingly with other steps or at least some of the sub-steps or stages of other steps.

In an embodiment, please refer to fig. 5, which provides a reinforcement learning based antagonistic video retrieval apparatus, the apparatus includes:

an input module 100, wherein the input module 100 is used for inputting a complete video v and a query sentence q as environment information of a reinforcement learning agent;

an extraction feature module 200, wherein the extraction feature module 200 is configured to extract the query statement feature f of the environment information_qGlobal video feature f_gLocation characteristics I^tAnd the position characteristics I^tCorresponding local video feature f_I ^tState s forming current video time segment^t＝[f_q,f_g,I^t,f_I ^t]Where t is the time step, location feature I^tIs an initialTiming boundary I of^t；

A candidate set generating module 300, said candidate set generating module 300 being configured to generate a candidate set according to said state s^tMaking at the timing boundary I^tMovement action a^tObtaining to execute the action a^tIs awarded r^tAnd according to the reward r^tOutputting a plurality of updated timing boundaries I^t+1And the timing boundary I^t+1Corresponding local video feature f_I ^t+1Reconstructing the state s' of the current video time slice, at which time the temporal boundary I is present^t+1The current video time candidate segment is updated;

a Bayesian personalized ranking module 400, the Bayesian personalized ranking module 400 being configured to rank the timing boundary I^tMatching with the query statement q, outputting a matching score, and using the matching score as a reward r^tReturning to the reinforcement learning agent;

a confrontation learning module 500, the confrontation learning module 500 being configured to enhance each other through confrontation learning until convergence, resulting in a video time segment I ═ (I) corresponding to the query sentence q_s，I_e) Wherein, I_sIs the video start time, I_eIs the video end time.

For specific limitations of the antagonistic video time retrieval device, reference may be made to the above limitations of the antagonistic video time retrieval method, which are not described herein again. The modules in the antagonistic video time retrieval device can be wholly or partially implemented by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In this embodiment, a computer device is provided, and the computer device may be a server, and its internal structure diagram may be as shown in fig. 6. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The nonvolatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a reinforcement learning-based antagonistic video moment retrieval method.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware associated with instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, databases, or other media used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above embodiments and drawings are not intended to limit the form and style of the present invention, and any suitable changes or modifications thereof by those skilled in the art should be considered as not departing from the scope of the present invention.

Claims

1. A resistant video time retrieval method based on reinforcement learning is characterized by comprising the following steps:

s5: the reinforcement learning agent and the Bayes individual sorting method mutually reinforce each other through counterwork learning until convergence, and obtain a video time segment I (I) corresponding to the query sentence q_s，I_e) Wherein, I_sIs the video start time, I_eIs the video end time;

in step S4, the method includes:

s42: by presetting a public space and the characteristics and time sequence boundary I of the query statement q^tObtaining the mapping function and the time sequence boundary I of the query statement q according to the characteristics of the query statement q and the characteristics of the real video time tau^tAnd a mapping function of the real video time τ;

2. The adversarial video time retrieval method based on reinforcement learning of claim 1, wherein in step S3, it further comprises: updating the reinforcement learning agent by a deep deterministic policy gradient algorithm to output a number of updated timing boundaries I^t+1The deep certainty strategy gradient algorithm is composed of a critic network, an actor network, a critic network parameter lag network and an actor network parameter lag network, wherein the critic network is used for rewarding according to the reward r^tJudging the action a^tWhether it is an optimal action, the actor network is used to perform the optimal action to obtain an updated timing boundary/^t+1The comment family network parameter lag network and the actor network parameter lagThe network updates the parameters of the respective lag network by a soft update method.

3. The reinforcement learning-based antagonistic video moment retrieval method according to claim 2, wherein said critic network learns the action value function Q (s, a) corresponding to the optimal strategy pi by minimizing the loss function L:

L(ω)＝E_s,a,r,s'～M[(Q(s,a|ω)-r+γmax Q^*(s',a'|ω*))²]

4. The reinforcement learning-based antagonistic video moment retrieval method according to claim 3, wherein said actor network performs the action a ═ pi (s; θ) to update said time-series boundary I^tA derivative in the increasing direction of the action value function Q (s, a) is obtained by a loss function J so that the action value function Q (s, a) has a maximum value, and the derived strategy gradient is:

5. The adversarial video time retrieval method based on reinforcement learning of claim 3, wherein in step S5, it comprises:

6. The reinforcement learning-based adversarial video time retrieval method of claim 1, wherein the parameters θ of the reinforcement learning agent and the parameters of the Bayesian personalized ranking method are the same

The formula is as follows:

wherein K is the total number of the updated timing boundaries, L_scAnd combining the mapping function of the query statement q and the mapping function of the real video time tau.

7. An apparatus for searching antagonistic video moments based on reinforcement learning, the apparatus comprising:

a feature extraction module for extracting query words of the environmental informationSentence characteristic f_qGlobal video feature f_gLocation characteristics I^tAnd the position characteristics I^tCorresponding local video feature f_I ^tState s forming current video time segment^t＝[f_q,f_g,I^t,f_I ^t]Where t is the time step, location feature I^tIs an initial timing boundary I^t；

a confrontation learning module, wherein the confrontation learning module is used for mutually enhancing the candidate set generation module and the Bayes personalized ranking identification module through confrontation learning until convergence, and obtaining a video time segment I (I) corresponding to the query statement q_s，I_e) Wherein, I_sIs the video start time, I_eIs the video end time;

the Bayesian personalized ranking identification module is specifically used for:

the query statement q comprises a marked real video moment τ ═ (τ)_s、τ_e) Extracting the query statement q and the time sequence boundary I^tAnd the characteristic of the real video time τ, where τ_sFor marked real video start time, τ_eFor true view of the markA frequency ending time;

by presetting a public space and the characteristics and time sequence boundary I of the query statement q^tObtaining the mapping function and the time sequence boundary I of the query statement q according to the characteristics of the query statement q and the characteristics of the real video time tau^tAnd a mapping function of the real video time τ;

obtaining the mapping function of the query statement q and the timing boundary I by element-level multiplication, element-level addition and full concatenation^tThe mapping function of the query statement q and the mapping function of the real video time τ;

mapping function according to the query statement q and the timing boundary I^tAnd outputting an updated time sequence boundary I according to the mapping function of the query statement q and the mapping function of the real video time tau^tTo the matching score near the real video moment τ.

8. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 6 when executing the computer program.

9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.