CN111782870B - Antagonistic video time retrieval method and device based on reinforcement learning, computer equipment and storage medium - Google Patents

Antagonistic video time retrieval method and device based on reinforcement learning, computer equipment and storage medium Download PDF

Info

Publication number
CN111782870B
CN111782870B CN202010557372.XA CN202010557372A CN111782870B CN 111782870 B CN111782870 B CN 111782870B CN 202010557372 A CN202010557372 A CN 202010557372A CN 111782870 B CN111782870 B CN 111782870B
Authority
CN
China
Prior art keywords
video
time
reinforcement learning
query statement
boundary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010557372.XA
Other languages
Chinese (zh)
Other versions
CN111782870A (en
Inventor
曹达
曾雅文
荣辉桂
朱宁波
陈浩
秦拯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan University
Original Assignee
Hunan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan University filed Critical Hunan University
Priority to CN202010557372.XA priority Critical patent/CN111782870B/en
Publication of CN111782870A publication Critical patent/CN111782870A/en
Application granted granted Critical
Publication of CN111782870B publication Critical patent/CN111782870B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • G06F16/732Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • G06F18/24155Bayesian classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Library & Information Science (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method, a device, computer equipment and a storage medium for searching antagonistic videos based on reinforcement learning at any moment, wherein a complete video and an inquiry sentence are input as environmental information of a reinforcement learning agent; extracting query sentence features, global video features, position features and local video features of the environment information to form the state of the current video moment segment; the reinforcement learning agent makes an action of moving on the time sequence boundary according to the state, obtains an incentive for executing the action and outputs a plurality of updated time sequence boundaries and local video characteristics according to the incentive, wherein the time sequence boundaries are updated current video time candidate segments; matching the time sequence boundary with the query statement by a Bayes personalized sorting method, outputting a matching score, and returning the matching score as a reward to the reinforcement learning agent; and mutually enhancing the counterlearning until convergence to obtain a video time segment corresponding to the query sentence.

Description

Antagonistic video time retrieval method and device based on reinforcement learning, computer equipment and storage medium
[ technical field ] A method for producing a semiconductor device
The invention relates to the field of video time retrieval, in particular to a method and a device for retrieval of antagonistic video time based on reinforcement learning, computer equipment and a storage medium.
[ background of the invention ]
Video retrieval, aiming at retrieving the video which is most related to the semantic described by the query sentence text from a group of possible video sets. Due to the rapid pace of modern life and the increasing amount of information, there is an urgent need to quickly find relevant information that best meets the actual needs of people, especially in the video field, people have become more and more willing to browse a short video moment that matches their interests rather than the entire video. To meet this need, a video moment retrieval task under linguistic query has emerged, with the goal of locating the start and end points of the video moment that are most relevant to the query statement semantics.
The existing video time retrieval method, such as 'video time positioning through language query', mainly comprises the following steps: 1. extracting the characteristics of the video clips and the characteristics of the query sentences; 2. performing multi-modal processing on the video segment characteristics and the query sentences to obtain richer semantic information; 3. the multi-layered perceptron predicts the matching degree score and time bias of the video and sentence respectively. The method is based on query statements, selects the best matching video segments from a candidate set and adds time bias, wherein the candidate set is generated by segmentation through a sliding window strategy, however, in order to meet the positioning accuracy, the strategy often needs intensive segmentation, so that the method is time-consuming and cannot meet the requirement of dynamic query, and the length of the video segments is required to be long rather than fixed. On the other hand, using a time bias, while it is possible to make the positioning not limited by the size of the window, the prediction of the bias is not stable enough but rather compromises the quality of the video segment returned to the query.
Also by "reading, viewing and moving: the video positioning reinforcement learning method based on the time described by the natural language comprises the following main steps: 1. inputting a complete video and query sentences to become an environment of the reinforcement learning agent; 2. extracting the global features of the video, the features of the video segments, the positioning information of the video segments and the features of the query text to form the state of the current moment; 3. and the reinforcement learning agent outputs the movement action of the positioning boundary according to the current state, and the movement action is repeated continuously until the positioning is converged gradually. The work of realizing video time positioning based on reinforcement learning is the first work of introducing reinforcement learning, and the work can get rid of dependence on sliding window candidates and realize more accurate positioning. But the design of the agent rewards has not been much explored. Existing reinforcement learning-based methods compute by means of a cross-over ratio (IoU) before and after each positioning boundary movement, which is semantically unexplored and fixed reward values lead to slow and unstable convergence of the model.
In summary, the two existing methods for processing video time retrieval mainly include two categories: the sorting method based on the sliding window candidate set, the positioning method based on the reinforcement learning and the sorting method based on the sliding window candidate set are that a strategy of a sliding window is used for segmenting a video in advance to generate a candidate set, then the candidate set is matched with a query text, and a result is obtained according to the matching degree sorting. Obviously, the method generates too many segments and is long in time consumption, so that a learner introduces reinforcement learning to abstract the problem into a continuity decision problem to directly position (the start frame and the end frame of the video), and although the learner also obtains good effect, the learner does not explore too much the reward design of the agent, and the methods are not stable.
The sorting method based on the sliding window candidate set and the positioning method based on the reinforcement learning have advantages and disadvantages, the sorting method is good at sorting a plurality of video time candidates, but the time consumption is too large when a certain number of reasonable candidate sets cannot be formed, and the positioning method utilizes the reinforcement learning agent to control to position boundaries, but cannot be applied to large-scale retrieval scenes, and the efficiency is low.
Therefore, there is a need to provide an improved video time retrieval method to solve the above problems.
[ summary of the invention ]
The invention overcomes the defects of the prior art and provides a resistant video time retrieval method and device based on reinforcement learning, computer equipment and a storage medium.
In order to achieve the purpose, the invention adopts the technical scheme for solving the technical problems: the adversarial video time retrieval method based on reinforcement learning is provided, and comprises the following steps:
s1: inputting a complete video v and a query sentence q as environmental information of a reinforcement learning agent;
s2: extracting query statement feature f of the environmental informationqGlobal video feature fgLocation characteristics ItAnd the position characteristics ItCorresponding local video feature fI tState s forming current video time segmentt=[fq,fg,It,fI t]Where t is the time step, location feature ItIs an initial timing boundary It
S3: the reinforcement learning agent is based on the state stMaking at the timing boundary ItMovement action atObtaining to execute the action atIs awarded rtAnd according to the reward rtOutputting a plurality of updated timing boundaries It+1And the timing boundary It+1Corresponding local video feature fI t+1Reconstructing the state s' of the current video time slice, at which time the temporal boundary I is presentt+1The current video time candidate segment is updated;
s4: the time sequence boundary I is sorted through a Bayes personalized sorting methodtMatching with the query statement q, outputting a matching score, and using the matching score as a reward rtReturning to the reinforcement learning agent;
s5: the reinforcement learning agent and the Bayes individual sorting method mutually reinforce each other through counterwork learning until convergence, and obtain a video time segment I (I) corresponding to the query sentence qs,Ie) Wherein, IsIs the video start time, IeIs the video end time.
Preferably, in step S3, the method further includes: updating the reinforcement learning agent by a deep deterministic policy gradient algorithm to output a number of updated timing boundaries It+1The deep certainty strategy gradient algorithm is composed of a critic network, an actor network, a critic network parameter lag network and an actor network parameter lag network, wherein the critic network is used for rewarding according to the reward rtJudging the action atWhether it is an optimal action, the actor network is used to perform the optimal action to obtain an updated timing boundary/t+1And the comment family network parameter lag network and the actor network parameter lag network update the parameters of the respective lag networks through a soft update method.
Preferably, the critic network learns the action value function Q (s, a) corresponding to the optimal strategy pi by minimizing the loss function L:
L(ω)=Es,a,r,s'~M[(Q(s,a|ω)-r+γmaxQ*(s',a'|ω*))2]
wherein Q (s, a) is an action value function of the critic network, ω is a variation parameter of the action value function Q (s, a), and γ is a discount factor of the action value function Q (s, a) for balancing the reward rtAnd an estimated value of said action value function Q (s, a), Q*Is a preset parameter lag network, omega*Is Q*Of [ s, a, r, s']Are sampled from the memory base M to derive the hint from past experience, s is the state of the non-updated video time segment, a is the non-updated action, a' is the updated action, the reinforcement learning agent will get the maximum reward when the action value function Q (s, a) most closely approaches the optimal strategy pi.
Preferably, the actor network performs the action a ═ pi (s; θ) to update the time-series boundary ItA derivative in the increasing direction of the action value function Q (s, a) is obtained by a loss function J so that the action value function Q (s, a) has a maximum value, and the derived strategy gradient is:
Figure BDA0002544767290000041
where μ is a deterministic policy gradient and θ is a parameter of the deterministic policy gradient μ.
Preferably, step S4 includes:
s41: the query statement q comprises a marked real video moment τ ═ (τ)s、τe) Extracting the query statement q and the time sequence boundary ItAnd the characteristic of the real video time τ, where τsFor marked real video start time, τeIs the marked real video end time;
s42: through a predetermined public spaceAnd the characteristics of the query statement q, the timing boundary ItObtaining the mapping function and the time sequence boundary I of the query statement q according to the characteristics of the query statement q and the characteristics of the real video time tautAnd a mapping function of the real video time τ;
s43: obtaining the mapping function of the query statement q and the timing boundary I by element-level multiplication, element-level addition and full concatenationtThe mapping function of the query statement q and the mapping function of the real video time τ;
s44: mapping function according to the query statement q and the timing boundary ItAnd outputting an updated time sequence boundary I according to the mapping function of the query statement q and the mapping function of the real video time tautTo the matching score near the real video moment τ.
Preferably, step S5 includes:
s51: acquiring the intersection ratio of the time sequence boundary and the real video time tau;
s52: according to the intersection ratio, the query statement q and the time sequence boundary ItObtaining a joint loss function by the mapping function;
s53: obtaining the maximum reward r by combining the loss of the Bayes personalized sorting method and the joint loss function;
s54: (I) a timing boundary at which the reinforcement learning agent outputs the maximum rewards,Ie)。
Preferably, the parameter θ of the reinforcement learning agent and the parameter of the bayesian personalized ranking method
Figure BDA0002544767290000051
The formula is as follows:
Figure BDA0002544767290000052
wherein K is the updateOf the total number of timing boundaries, LscAnd combining the mapping function of the query statement q and the mapping function of the real video time tau.
The invention also provides a resistant video time retrieval device based on reinforcement learning, which is characterized by comprising the following components:
the input module is used for inputting the complete video v and the query sentence q as the environmental information of the reinforcement learning agent;
an extraction feature module for extracting the query sentence feature f of the environment informationqGlobal video feature fgLocation characteristics ItAnd the position characteristics ItCorresponding local video feature fI tState s forming current video time segmentt=[fq,fg,It,fI t]Where t is the time step, location feature ItIs an initial timing boundary It
A candidate set generation module for generating a candidate set according to the state stMaking at the timing boundary ItMovement action atObtaining to execute the action atIs awarded rtAnd according to the reward rtOutputting a plurality of updated timing boundaries It+1And the timing boundary It+1Corresponding local video feature fI t+1Reconstructing the state s' of the current video time slice, at which time the temporal boundary I is presentt+1The current video time candidate segment is updated;
a Bayes personalized ranking identification module for identifying the time sequence boundary ItMatching with the query statement q, outputting a matching score, and using the matching score as a reward rtReturning to the reinforcement learning agent;
a confrontation learning module for enhancing each other through confrontation learning until convergence to obtain a video time segment I ═ I (I) corresponding to the query sentence qs,Ie) Wherein, IsIs the video start time, IeIs the video end time.
A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor when executing the computer program implements the steps of:
inputting a complete video v and a query sentence q as environmental information of a reinforcement learning agent;
extracting query statement feature f of the environmental informationqGlobal video feature fgLocation characteristics ItAnd the position characteristics ItCorresponding local video feature fI tState s forming current video time segmentt=[fq,fg,It,fI t]Where t is the time step, location feature ItIs an initial timing boundary It
The reinforcement learning agent is based on the state stMaking at the timing boundary ItMovement action atObtaining to execute the action atIs awarded rtAnd according to the reward rtOutputting a plurality of updated timing boundaries It+1And the timing boundary It+1Corresponding local video feature fI t+1Reconstructing the state s' of the current video time slice, at which time the temporal boundary I is presentt+1The current video time candidate segment is updated;
the time sequence boundary I is sorted through a Bayes personalized sorting methodtMatching with the query statement q, outputting a matching score, and using the matching score as a reward rtReturning to the reinforcement learning agent;
the reinforcement learning agent and the Bayes individual sorting method mutually reinforce each other through counterwork learning until convergence, and obtain a video time segment I (I) corresponding to the query sentence qs,Ie) Wherein, IsIs the video start time, IeIs the video end time.
A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:
inputting a complete video v and a query sentence q as environmental information of a reinforcement learning agent;
extracting query statement feature f of the environmental informationqGlobal video feature fgLocation characteristics ItAnd the position characteristics ItCorresponding local video feature fI tState s forming current video time segmentt=[fq,fg,It,fI t]Where t is the time step, location feature ItIs an initial timing boundary It
The reinforcement learning agent is based on the state stMaking at the timing boundary ItMovement action atObtaining to execute the action atIs awarded rtAnd according to the reward rtOutputting a plurality of updated timing boundaries It+1And the timing boundary It+1Corresponding local video feature fI t+1Reconstructing the state s' of the current video time slice, at which time the temporal boundary I is presentt+1The current video time candidate segment is updated;
the time sequence boundary I is sorted through a Bayes personalized sorting methodtMatching with the query statement q, outputting a matching score, and using the matching score as a reward rtReturning to the reinforcement learning agent;
the reinforcement learning agent and the Bayes individual sorting method mutually reinforce each other through counterwork learning until convergence, and obtain a video time segment I (I) corresponding to the query sentence qs,Ie) Wherein, IsIs the video start time, IeIs the video end time.
Compared with the prior art, the adversarial video time retrieval method and device based on reinforcement learning, the computer equipment and the storage medium have the following beneficial effects: by combining the reinforcement learning positioning method and the Bayes personalized sorting method, on one hand, a small number of reasonable candidate sets can be obtained by the sorting-based method, on the other hand, a more flexible reward function and more stable convergence can be obtained by the reinforcement learning positioning method, then the sorting and positioning methods are mutually enhanced under the framework of counterstudy, more accurate video time segments are returned, and the accuracy and speed of query and retrieval of a user are effectively improved.
[ description of the drawings ]
FIG. 1 is a flowchart of a robust learning-based adversarial video time retrieval method provided by the present invention;
FIG. 2 is a schematic diagram illustrating the principle of a robust learning-based adversarial video time retrieval method provided by the present invention;
FIG. 3 is a sub-flowchart of step S4 in FIG. 1;
FIG. 4 is a sub-flowchart of step S5 in FIG. 1;
FIG. 5 is a functional block diagram of a resistant video time retrieval apparatus according to the present invention;
fig. 6 is an internal structural view of a computer device provided by the present invention.
[ detailed description ] embodiments
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1 and fig. 2, the present invention provides a robust learning-based antagonistic video time retrieval method, which includes the following steps:
s1: and inputting the complete video v and the query sentence q as environment information of the reinforcement learning agent.
S2: extracting query statement feature f of the environmental informationqGlobal, globalVideo feature fgLocation characteristics ItAnd the position characteristics ItCorresponding local video feature fI tState s forming current video time segmentt=[fq,fg,It,fI t]Where t is the time step, location feature ItIs an initial timing boundary It
S3: the reinforcement learning agent is based on the state stMaking at the timing boundary ItMovement action atObtaining to execute the action atIs awarded rtAnd according to the reward rtOutputting a plurality of updated timing boundaries It+1And the timing boundary It+1Corresponding local video feature fI t+1Reconstructing the state s' of the current video time slice, at which time the temporal boundary I is presentt+1And the current video time candidate segment is updated.
Action space A of the reinforcement learning agenteConsisting of 7 predefined actions, namely the action atBoth the starting point and the end point of (a), both the starting point and the end point of (b), move backwards, one of the starting point or the end point move forwards or backwards separately, and said action atThe movement is stopped.
Specifically, the initial position of the reinforcement learning agent movement is set as I0=[0.25*h,0.75*h]Where h is the total length of image frames in the complete video v, the action atThe move size per step is set to h/2 e, where e is a certain hyper-parameter that defines the maximum number of search steps for the reinforcement learning agent, which ensures that the complete video v is traversed at the maximum number of steps.
In this embodiment, the reinforcement learning agent is updated by a deep deterministic policy gradient algorithm to output a number of updated timing boundaries It+1The deep certainty strategy gradient algorithm is composed of a critic network, an actor network, a critic network parameter lag network and an actor network parameter lag network, wherein the critic network is used for rewarding according to the reward rtJudging the action atWhether the actor network is performing the optimal action to obtain an updated temporal boundary It +1And the comment family network parameter lag network and the actor network parameter lag network update the parameters of the respective lag networks through a soft update method.
It should be noted that, the deep deterministic policy gradient algorithm uses a deep neural network of function approximation, and effectively utilizes the implementation of empirical replay and a dual target lag network, and the critic network learns the action value function Q (s, a) corresponding to the optimal policy pi by minimizing the loss function L:
L(ω)=Es,a,r,s'~M[(Q(s,a|ω)-r+γmaxQ*(s',a'|ω*))2]
wherein Q (s, a) is an action value function of the critic network, ω is a variation parameter of the action value function Q (s, a), and γ is a discount factor of the action value function Q (s, a) for balancing the reward rtAnd an estimated value of said action value function Q (s, a), Q*Is a preset parameter lag network, omega*Is Q*Of [ s, a, r, s']Are sampled from the memory base M to derive the hint from past experience, s is the state of the non-updated video time segment, a is the non-updated action, a' is the updated action, the reinforcement learning agent will get the maximum reward when the action value function Q (s, a) most closely approaches the optimal strategy pi.
The actor network performs the action a ═ pi (s; θ) to update the timing boundary ItA derivative in the increasing direction of the action value function Q (s, a) is obtained by a loss function J so that the action value function Q (s, a) has a maximum value, and the derived strategy gradient is:
Figure BDA0002544767290000091
where μ is a deterministic strategy gradient and θ is a parameter of the deterministic strategy gradient μ, the actor network maximizes the action value function Q (s, a) by directly adjusting θ.
S4: the time sequence boundary I is sorted through a Bayes personalized sorting methodtMatching with the query statement q, outputting a matching score, and using the matching score as a reward rtReturning to the reinforcement learning agent;
referring to fig. 3, in step S4, the method includes the following steps:
s41: the query statement q comprises a marked real video moment τ ═ (τ)s、τe) Extracting the query statement q and the time sequence boundary ItAnd the characteristic of the real video time tau, respectively fq、fIAnd fτWherein, τsFor marked real video start time, τeIs the marked real video end time;
s42: by presetting a public space and the characteristics and time sequence boundary I of the query statement qtObtaining the mapping function and the time sequence boundary I of the query statement q according to the characteristics of the query statement q and the characteristics of the real video time tautAnd a mapping function of the real video instants τ.
In particular, under the constraint of semantic consistency, f isq、flAnd fτThe projection is inverted into the public space, so that different modes are regularized, and the retrieval performance is effectively improved:
Figure BDA0002544767290000101
wherein o isvAnd olIs a projection function approximated by a multi-layer perceptron,
Figure BDA0002544767290000102
are projected features having the same dimensions. In the common space, under the constraint of semantic consistency, different modal representations will be forced to approach:
Figure BDA0002544767290000103
s43: obtaining the mapping function of the query statement q and the timing boundary I by element-level multiplication, element-level addition and full concatenationtThe mapping function of the query statement q and the mapping function of the real video time τ are combined as follows:
Figure BDA0002544767290000104
Figure BDA0002544767290000105
s44: mapping function according to the query statement q and the timing boundary ItAnd outputting an updated time sequence boundary I according to the mapping function of the query statement q and the mapping function of the real video time tautTo the matching score near the real video moment τ.
Wherein the matching degree of the real video time tau and the query statement q is higher than that of the time sequence boundary ItThe matching degree with the query statement q is high, and the optimization mode is as follows:
Figure BDA0002544767290000111
where σ is the Sigmoid activation function, osIs the approximate score of the multi-layer perceptron, Δ is the hyper-parameter controlling the difference between the two, by which the match score of a positive case pair can be greater than the match score of a negative case pair, effectively distinguishing the true video time τ from the temporal boundary ItThe positive case pair refers to the real video time τ and the query statement q, and the negative case pair refers to the time sequence boundary ItAnd the query statement q.
S5: the reinforcement learning agent and the Bayes individual ordering method mutually reinforce through counterwork learning until convergence to obtain the corresponding query languageVideo time segment I of sentence q ═ I (I)s,Ie)。
Referring to fig. 4, in step S5, the method includes the following steps:
s51: acquiring the timing boundary ItAnd the cross-over ratio of the real video time tau;
Figure BDA0002544767290000112
s52: according to the intersection ratio, the query statement q and the time sequence boundary ItThe mapping function of (a) yields a joint loss function,
s53: and combining the loss of the Bayes personalized sorting method with the joint loss function to obtain the maximum reward r:
r=-LbprsLscjLjoint
Figure BDA0002544767290000113
s54: (I) a timing boundary at which the reinforcement learning agent outputs the maximum rewards,Ie) Wherein, IsIs the video start time, IeIs the video end time.
In this embodiment, the parameter θ of the reinforcement learning agent is the parameter of the bayesian personalized ranking method
Figure BDA0002544767290000114
The formula is as follows:
Figure BDA0002544767290000121
wherein K is the updated timing boundary ItTotal amount of (2), LscAnd combining the mapping function of the query statement q and the mapping function of the real video time tau.
According to the adversarial video time retrieval method based on reinforcement learning, by combining reinforcement learning positioning and Bayesian personalized sorting methods, on one hand, a small number of reasonable candidate sets can be obtained by the method based on sorting, on the other hand, a more flexible reward function and more stable convergence can be obtained by the method based on reinforcement learning, and then the methods of sorting and positioning are mutually enhanced under the framework of adversarial learning, so that more accurate video time segments are returned, and the accuracy and speed of query retrieval of a user are effectively improved.
It should be understood that although the steps in the flowcharts of fig. 1, 3 and 4 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 1, 3, and 4 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or stages is not necessarily sequential, but may be performed alternately or alternatingly with other steps or at least some of the sub-steps or stages of other steps.
In an embodiment, please refer to fig. 5, which provides a reinforcement learning based antagonistic video retrieval apparatus, the apparatus includes:
an input module 100, wherein the input module 100 is used for inputting a complete video v and a query sentence q as environment information of a reinforcement learning agent;
an extraction feature module 200, wherein the extraction feature module 200 is configured to extract the query statement feature f of the environment informationqGlobal video feature fgLocation characteristics ItAnd the position characteristics ItCorresponding local video feature fI tState s forming current video time segmentt=[fq,fg,It,fI t]Where t is the time step, location feature ItIs an initialTiming boundary I oft
A candidate set generating module 300, said candidate set generating module 300 being configured to generate a candidate set according to said state stMaking at the timing boundary ItMovement action atObtaining to execute the action atIs awarded rtAnd according to the reward rtOutputting a plurality of updated timing boundaries It+1And the timing boundary It+1Corresponding local video feature fI t+1Reconstructing the state s' of the current video time slice, at which time the temporal boundary I is presentt+1The current video time candidate segment is updated;
a Bayesian personalized ranking module 400, the Bayesian personalized ranking module 400 being configured to rank the timing boundary ItMatching with the query statement q, outputting a matching score, and using the matching score as a reward rtReturning to the reinforcement learning agent;
a confrontation learning module 500, the confrontation learning module 500 being configured to enhance each other through confrontation learning until convergence, resulting in a video time segment I ═ (I) corresponding to the query sentence qs,Ie) Wherein, IsIs the video start time, IeIs the video end time.
For specific limitations of the antagonistic video time retrieval device, reference may be made to the above limitations of the antagonistic video time retrieval method, which are not described herein again. The modules in the antagonistic video time retrieval device can be wholly or partially implemented by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In this embodiment, a computer device is provided, and the computer device may be a server, and its internal structure diagram may be as shown in fig. 6. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The nonvolatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a reinforcement learning-based antagonistic video moment retrieval method.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware associated with instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, databases, or other media used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above embodiments and drawings are not intended to limit the form and style of the present invention, and any suitable changes or modifications thereof by those skilled in the art should be considered as not departing from the scope of the present invention.

Claims (9)

1. A resistant video time retrieval method based on reinforcement learning is characterized by comprising the following steps:
s1: inputting a complete video v and a query sentence q as environmental information of a reinforcement learning agent;
s2: extracting query statement feature f of the environmental informationqGlobal video feature fgLocation characteristics ItAnd the position characteristics ItCorresponding local video feature fI tState s forming current video time segmentt=[fq,fg,It,fI t]Where t is the time step, location feature ItIs an initial timing boundary It
S3: the reinforcement learning agent is based on the state stMaking at the timing boundary ItMovement action atObtaining to execute the action atIs awarded rtAnd according to the reward rtOutputting a plurality of updated timing boundaries It+1And the timing boundary It+1Corresponding local video feature fI t+1Reconstructing the state s' of the current video time slice, at which time the temporal boundary I is presentt+1The current video time candidate segment is updated;
s4: the time sequence boundary I is sorted through a Bayes personalized sorting methodtMatching with the query statement q, outputting a matching score, and using the matching score as a reward rtReturning to the reinforcement learning agent;
s5: the reinforcement learning agent and the Bayes individual sorting method mutually reinforce each other through counterwork learning until convergence, and obtain a video time segment I (I) corresponding to the query sentence qs,Ie) Wherein, IsIs the video start time, IeIs the video end time;
in step S4, the method includes:
s41: the query statement q comprises a marked real video moment τ ═ (τ)s、τe) Extracting the query statement q and the time sequence boundary ItAnd the characteristic of the real video time τ, where τsFor marked real video start time, τeIs the marked real video end time;
s42: by presetting a public space and the characteristics and time sequence boundary I of the query statement qtObtaining the mapping function and the time sequence boundary I of the query statement q according to the characteristics of the query statement q and the characteristics of the real video time tautAnd a mapping function of the real video time τ;
s43: obtaining the mapping function of the query statement q and the timing boundary I by element-level multiplication, element-level addition and full concatenationtThe mapping function of the query statement q and the mapping function of the real video time τ;
s44: mapping function according to the query statement q and the timing boundary ItAnd outputting an updated time sequence boundary I according to the mapping function of the query statement q and the mapping function of the real video time tautTo the matching score near the real video moment τ.
2. The adversarial video time retrieval method based on reinforcement learning of claim 1, wherein in step S3, it further comprises: updating the reinforcement learning agent by a deep deterministic policy gradient algorithm to output a number of updated timing boundaries It+1The deep certainty strategy gradient algorithm is composed of a critic network, an actor network, a critic network parameter lag network and an actor network parameter lag network, wherein the critic network is used for rewarding according to the reward rtJudging the action atWhether it is an optimal action, the actor network is used to perform the optimal action to obtain an updated timing boundary/t+1The comment family network parameter lag network and the actor network parameter lagThe network updates the parameters of the respective lag network by a soft update method.
3. The reinforcement learning-based antagonistic video moment retrieval method according to claim 2, wherein said critic network learns the action value function Q (s, a) corresponding to the optimal strategy pi by minimizing the loss function L:
L(ω)=Es,a,r,s'~M[(Q(s,a|ω)-r+γmax Q*(s',a'|ω*))2]
wherein Q (s, a) is an action value function of the critic network, ω is a variation parameter of the action value function Q (s, a), and γ is a discount factor of the action value function Q (s, a) for balancing the reward rtAnd an estimated value of said action value function Q (s, a), Q*Is a preset parameter lag network, omega*Is Q*Of [ s, a, r, s']Are sampled from the memory base M to derive the hint from past experience, s is the state of the non-updated video time segment, a is the non-updated action, a' is the updated action, the reinforcement learning agent will get the maximum reward when the action value function Q (s, a) most closely approaches the optimal strategy pi.
4. The reinforcement learning-based antagonistic video moment retrieval method according to claim 3, wherein said actor network performs the action a ═ pi (s; θ) to update said time-series boundary ItA derivative in the increasing direction of the action value function Q (s, a) is obtained by a loss function J so that the action value function Q (s, a) has a maximum value, and the derived strategy gradient is:
Figure FDA0003233926960000031
where μ is a deterministic policy gradient and θ is a parameter of the deterministic policy gradient μ.
5. The adversarial video time retrieval method based on reinforcement learning of claim 3, wherein in step S5, it comprises:
s51: acquiring the intersection ratio of the time sequence boundary and the real video time tau;
s52: according to the intersection ratio, the query statement q and the time sequence boundary ItObtaining a joint loss function by the mapping function;
s53: obtaining the maximum reward r by combining the loss of the Bayes personalized sorting method and the joint loss function;
s54: (I) a timing boundary at which the reinforcement learning agent outputs the maximum rewards,Ie) Wherein, IsIs the video start time, IeIs the video end time.
6. The reinforcement learning-based adversarial video time retrieval method of claim 1, wherein the parameters θ of the reinforcement learning agent and the parameters of the Bayesian personalized ranking method are the same
Figure FDA0003233926960000032
The formula is as follows:
Figure FDA0003233926960000033
wherein K is the total number of the updated timing boundaries, LscAnd combining the mapping function of the query statement q and the mapping function of the real video time tau.
7. An apparatus for searching antagonistic video moments based on reinforcement learning, the apparatus comprising:
the input module is used for inputting the complete video v and the query sentence q as the environmental information of the reinforcement learning agent;
a feature extraction module for extracting query words of the environmental informationSentence characteristic fqGlobal video feature fgLocation characteristics ItAnd the position characteristics ItCorresponding local video feature fI tState s forming current video time segmentt=[fq,fg,It,fI t]Where t is the time step, location feature ItIs an initial timing boundary It
A candidate set generation module for generating a candidate set according to the state stMaking at the timing boundary ItMovement action atObtaining to execute the action atIs awarded rtAnd according to the reward rtOutputting a plurality of updated timing boundaries It+1And the timing boundary It+1Corresponding local video feature fI t+1Reconstructing the state s' of the current video time slice, at which time the temporal boundary I is presentt+1The current video time candidate segment is updated;
a Bayes personalized ranking identification module for identifying the time sequence boundary ItMatching with the query statement q, outputting a matching score, and using the matching score as a reward rtReturning to the reinforcement learning agent;
a confrontation learning module, wherein the confrontation learning module is used for mutually enhancing the candidate set generation module and the Bayes personalized ranking identification module through confrontation learning until convergence, and obtaining a video time segment I (I) corresponding to the query statement qs,Ie) Wherein, IsIs the video start time, IeIs the video end time;
the Bayesian personalized ranking identification module is specifically used for:
the query statement q comprises a marked real video moment τ ═ (τ)s、τe) Extracting the query statement q and the time sequence boundary ItAnd the characteristic of the real video time τ, where τsFor marked real video start time, τeFor true view of the markA frequency ending time;
by presetting a public space and the characteristics and time sequence boundary I of the query statement qtObtaining the mapping function and the time sequence boundary I of the query statement q according to the characteristics of the query statement q and the characteristics of the real video time tautAnd a mapping function of the real video time τ;
obtaining the mapping function of the query statement q and the timing boundary I by element-level multiplication, element-level addition and full concatenationtThe mapping function of the query statement q and the mapping function of the real video time τ;
mapping function according to the query statement q and the timing boundary ItAnd outputting an updated time sequence boundary I according to the mapping function of the query statement q and the mapping function of the real video time tautTo the matching score near the real video moment τ.
8. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 6 when executing the computer program.
9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.
CN202010557372.XA 2020-06-18 2020-06-18 Antagonistic video time retrieval method and device based on reinforcement learning, computer equipment and storage medium Active CN111782870B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010557372.XA CN111782870B (en) 2020-06-18 2020-06-18 Antagonistic video time retrieval method and device based on reinforcement learning, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010557372.XA CN111782870B (en) 2020-06-18 2020-06-18 Antagonistic video time retrieval method and device based on reinforcement learning, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN111782870A CN111782870A (en) 2020-10-16
CN111782870B true CN111782870B (en) 2021-11-30

Family

ID=72756759

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010557372.XA Active CN111782870B (en) 2020-06-18 2020-06-18 Antagonistic video time retrieval method and device based on reinforcement learning, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111782870B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112894809B (en) * 2021-01-18 2022-08-02 华中科技大学 Impedance controller design method and system based on reinforcement learning
CN113204674B (en) * 2021-07-05 2021-09-17 杭州一知智能科技有限公司 Video-paragraph retrieval method and system based on local-overall graph inference network
CN115757464B (en) * 2022-11-18 2023-07-25 中国科学院软件研究所 Intelligent materialized view query method based on deep reinforcement learning

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10698876B2 (en) * 2017-08-11 2020-06-30 Micro Focus Llc Distinguish phrases in displayed content
CN110751287B (en) * 2018-07-23 2024-02-20 第四范式(北京)技术有限公司 Training method and system and prediction method and system for neural network model
CN111241345A (en) * 2020-02-18 2020-06-05 腾讯科技(深圳)有限公司 Video retrieval method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN111782870A (en) 2020-10-16

Similar Documents

Publication Publication Date Title
CN111782870B (en) Antagonistic video time retrieval method and device based on reinforcement learning, computer equipment and storage medium
CN111581510B (en) Shared content processing method, device, computer equipment and storage medium
CN110866184B (en) Short video data label recommendation method and device, computer equipment and storage medium
US11144831B2 (en) Regularized neural network architecture search
CN109783655B (en) Cross-modal retrieval method and device, computer equipment and storage medium
US20210019599A1 (en) Adaptive neural architecture search
Yan et al. Video captioning using global-local representation
US20230316733A1 (en) Video behavior recognition method and apparatus, and computer device and storage medium
CN109919221B (en) Image description method based on bidirectional double-attention machine
CN111651671B (en) User object recommendation method, device, computer equipment and storage medium
CN112182154B (en) Personalized search model for eliminating keyword ambiguity by using personal word vector
US9189708B2 (en) Pruning and label selection in hidden markov model-based OCR
CN110929114A (en) Tracking digital dialog states and generating responses using dynamic memory networks
CN111782786B (en) Multi-model fusion question-answering method, system and medium for urban brain
CN112989212B (en) Media content recommendation method, device and equipment and computer storage medium
CN110750523A (en) Data annotation method, system, computer equipment and storage medium
WO2021030899A1 (en) Automated image retrieval with graph neural network
CN111512299A (en) Method for content search and electronic device thereof
Zhang et al. Emotion attention-aware collaborative deep reinforcement learning for image cropping
CN111783895A (en) Travel plan recommendation method and device based on neural network, computer equipment and storage medium
CN113051468B (en) Movie recommendation method and system based on knowledge graph and reinforcement learning
CN117170648A (en) Robot flow automation component recommendation method, device, equipment and storage medium
Ma et al. Deep unsupervised active learning on learnable graphs
CN113010717B (en) Image verse description generation method, device and equipment
US11902548B2 (en) Systems, methods and computer media for joint attention video processing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant