CN109905772B - Video clip query method, device, computer equipment and storage medium - Google Patents

Video clip query method, device, computer equipment and storage medium Download PDF

Info

Publication number
CN109905772B
CN109905772B CN201910186054.4A CN201910186054A CN109905772B CN 109905772 B CN109905772 B CN 109905772B CN 201910186054 A CN201910186054 A CN 201910186054A CN 109905772 B CN109905772 B CN 109905772B
Authority
CN
China
Prior art keywords
video
anchor
information
target
segment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910186054.4A
Other languages
Chinese (zh)
Other versions
CN109905772A (en
Inventor
王景文
马林
刘威
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201910186054.4A priority Critical patent/CN109905772B/en
Publication of CN109905772A publication Critical patent/CN109905772A/en
Application granted granted Critical
Publication of CN109905772B publication Critical patent/CN109905772B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The application relates to a video clip query method, a video clip query device, computer equipment and a storage medium. The method comprises the following steps: acquiring text video interaction information according to the query text and the target video; acquiring context interaction information according to the text video interaction information; acquiring the matching probability information of the anchor point segment corresponding to each video frame and the boundary probability information of each video frame according to the context interaction information; according to the matching probability information of the anchor point segments corresponding to the video frames and the boundary probability information of the video frames, the target video segments matched with the query text in the target video are obtained, and a scheme for accurately querying the specific video segments in the target video through the text is provided.

Description

Video clip query method and device, computer equipment and storage medium
Technical Field
The embodiment of the application relates to the technical field of video processing, in particular to a video clip query method, a video clip query device, computer equipment and a storage medium.
Background
With the continuous development of computer and network technologies, video playing has become one of the functions that are used for the longest time and the most frequently in the process of network activities performed by users using computer devices.
A user playing a video through a computer device may only be interested in a particular piece of content. In the related art, in order to facilitate a user to quickly find a video clip of interest, a video player generally provides a play progress adjusting function, and during playing of a video, the user can adjust the current play progress through operations such as dragging a progress bar and the like so as to adjust the video clip of interest as soon as possible.
However, in the related art, the user is required to manually adjust the playing progress to find the video segment that is interested in himself, and when the total playing time of the video is long and/or the video segment that the user wants to find is short, the user may be required to repeatedly adjust the playing progress, which results in that it is inefficient to query a specific video segment in the video.
Disclosure of Invention
The embodiment of the application provides a video clip query method, a video clip query device, computer equipment and a storage medium, which can improve the efficiency of querying a specific video clip in a video, and the technical scheme is as follows:
in one aspect, a video segment query method is provided, and the method includes:
acquiring text video interaction information according to a query text and a target video, wherein the text video interaction information comprises related elements corresponding to video frames in the target video, and the related elements are used for indicating the correlation between the corresponding video frames and the query text;
obtaining context interaction information according to the text video interaction information, wherein the context interaction information is used for indicating the association relationship between the relevant elements corresponding to the video frames;
acquiring the matching probability information of anchor point segments corresponding to the video frames and the boundary probability information of the video frames according to the context interaction information; the anchor segment is a video segment ending with a corresponding video frame in the target video, the matching probability information indicates the probability that the corresponding anchor segment is matched with the query text, the boundary probability information indicates the probability that the corresponding video frame is a target boundary, and the target boundary is the boundary of the video segment matched with the query text;
and acquiring a target video clip matched with the query text in the target video according to the matching probability information of the anchor clip corresponding to each video frame and the boundary probability information of each video frame.
In one aspect, a video segment query method is provided, and the method includes:
displaying a video playing interface for playing a target video, wherein the video playing interface comprises a query control;
when the triggering operation of the query control is received, acquiring a query text input based on the query control;
sending a query request containing the query text to a server; the query request is used for triggering the server to acquire text video interaction information according to the query text and the target video, the text video interaction information includes related elements corresponding to video frames in the target video, the related elements are used for indicating the correlation between the corresponding video frames and the query text, context interaction information is acquired according to the text video interaction information, the context interaction information is used for indicating the association relationship between the related elements corresponding to the video frames, the matching probability information of anchor point segments corresponding to the video frames and the boundary probability information of the video frames are acquired according to the context interaction information, the anchor point segments are video segments ending with the corresponding video frames in the target video, and the matching probability information indicates the probability that the corresponding anchor point segments are matched with the query text, the boundary probability information indicates the probability that the corresponding video frame is a target boundary, the target boundary is the boundary of a video segment matched with the query text, and a target video segment matched with the query text in the target video is obtained according to the matching probability information of the anchor segment corresponding to each video frame and the boundary probability information of each video frame;
receiving a query result returned by the server, wherein the query result is used for indicating the target video clip;
and adjusting the playing progress of the target video to the initial position of the target video clip according to the query result.
In another aspect, an apparatus for querying a video segment is provided, the apparatus comprising:
the text video interaction information acquisition module is used for acquiring text video interaction information according to a query text and a target video, wherein the text video interaction information comprises relevant elements corresponding to all video frames in the target video, and the relevant elements are used for indicating the correlation between the corresponding video frames and the query text;
the context interaction information acquisition module is used for acquiring context interaction information according to the text video interaction information, and the context interaction information is used for indicating the association relationship between the relevant elements corresponding to each video frame;
a probability obtaining module, configured to obtain, according to the context interaction information, matching probability information of anchor segments corresponding to the video frames and boundary probability information of the video frames; the anchor segment is a video segment ending with a corresponding video frame in the target video, the matching probability information indicates the probability that the corresponding anchor segment is matched with the query text, the boundary probability information indicates the probability that the corresponding video frame is a target boundary, and the target boundary is the boundary of the video segment matched with the query text;
and the video clip acquisition module is used for acquiring a target video clip matched with the query text in the target video according to the matching probability information of the anchor clip corresponding to each video frame and the boundary probability information of each video frame.
Optionally, the text-to-video interaction information obtaining module is configured to,
obtaining the dependency relationship among all terms in the query text;
acquiring the dependency relationship among video frames;
and acquiring the text video interaction information according to the dependency relationship among the words and the dependency relationship among the video frames.
Optionally, when obtaining the dependency relationship between the terms in the query text, the text-to-video interaction information obtaining module is configured to,
sequentially inputting the embedded characteristic vectors of all the words into a first long-short term memory network (LSTM), and acquiring a first implicit vector obtained by processing all the words by the first LSTM as a dependency relationship among all the words;
when the dependency relationship between the video frames is obtained, the text-video interaction information obtaining module is configured to sequentially input the feature information of each video frame into a second LSTM, and obtain a second hidden vector obtained by processing the feature information of each video frame by the second LSTM as the dependency relationship between the video frames.
Optionally, when the text-video interaction information is obtained according to the dependency relationship between the words and the dependency relationship between the video frames, the text-video interaction information obtaining module is configured to,
performing attention mechanism-based weighting processing on the first hidden vector according to the second hidden vector to obtain a text feature hidden vector;
splicing the text feature hidden vector with the second hidden vector to obtain a first spliced vector;
and inputting the first splicing vector into a third LSTM, and acquiring a third implicit vector obtained by processing the first splicing vector by the third LSTM as the text video interaction information.
Optionally, the context interaction information obtaining module is configured to,
obtaining respective relevance weights of relevant elements corresponding to the video frames, wherein the relevance weights are used for indicating the relevance between the corresponding elements and the elements in a front preset range and a rear preset range;
according to the respective relevance weight of the relevant element corresponding to each video frame, carrying out context fusion on the relevant element corresponding to each video frame to obtain context fusion information;
and acquiring the context interaction information according to the context fusion information.
Optionally, when the context interaction information is obtained according to the context fusion information, the context interaction information obtaining module is configured to,
and splicing the context fusion information and the text video interaction information in a residual connection mode to obtain the context interaction information.
Optionally, the probability obtaining module is configured to,
processing the context interaction information through a first classifier to obtain the matching probability information of anchor point segments corresponding to all the video frames;
and processing the context interaction information through a second classifier to obtain boundary probability information of each video frame.
Optionally, the video clip obtaining module is configured to,
correcting the matching probability information of the anchor point segment corresponding to each video frame according to the boundary probability information of each video frame to obtain the corrected matching probability information of the anchor point segment corresponding to each video frame;
and acquiring the target video clip from the anchor clip corresponding to each video frame according to the corrected matching probability information of the anchor clip corresponding to each video frame.
Optionally, when the matching probability information of the anchor segment corresponding to each video frame is corrected according to the boundary probability information of each video frame to obtain the corrected matching probability information of the anchor segment corresponding to each video frame, the video segment obtaining module is configured to,
acquiring boundary probability information of a first video frame of a target anchor point fragment and boundary probability information of a last video frame of the target anchor point fragment; the target anchor segment is any anchor segment in the anchor segments corresponding to the video frames;
and correcting the matching probability information of the target anchor segment according to the boundary probability information of the first video frame of the target anchor segment and the boundary probability information of the last video frame of the target anchor segment to obtain the corrected matching probability information of the target anchor segment.
Optionally, when the matching probability information of the target anchor segment is modified according to the boundary probability information of the first video frame of the target anchor segment and the boundary probability information of the last video frame of the target anchor segment to obtain the modified matching probability information of the anchor segment corresponding to each video frame, the video segment obtaining module is configured to,
acquiring boundary probability information of a first video frame of the target anchor point segment and average probability information of the boundary probability information of a last video frame of the target anchor point segment;
and acquiring the sum of the average probability information and the matching probability information of the target anchor point segment as the corrected matching probability information of the target anchor point segment.
Optionally, when the target video clip is acquired from the anchor clips corresponding to the video frames according to the corrected matching probability information of the anchor clips corresponding to the video frames, the video clip acquisition module is configured to sort the anchor clips corresponding to the video frames according to a descending order of the probability values corresponding to the corrected matching probability information, so as to obtain the anchor clip queue;
extracting the first M anchor segments of the anchor segment queue, wherein M is an integer greater than or equal to 2;
and acquiring the target video clip according to the first M anchor clips.
Optionally, the text-to-video interaction information obtaining module is configured to,
when a query request sent by a terminal is received, acquiring the text video interaction information according to the query text and the target video; the query request comprises the query text;
the device further comprises:
and the result returning module is used for returning a query result to the terminal, and the query result is used for indicating the target video clip.
In another aspect, a computer device is provided, the computer device comprising a processor and a memory, the memory having stored therein at least one instruction, at least one program, set of codes, or set of instructions, which is loaded and executed by the processor to implement the video segment query method as described above.
In another aspect, there is provided a computer readable storage medium having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, which is loaded and executed by a processor to implement the video segment query method as described above.
The technical scheme provided by the application can comprise the following beneficial effects:
the method comprises the steps of predicting the matching probability of each anchor segment matched with a query text in a target video through the correlation between the query text and each video frame in the target video, simultaneously predicting the probability that each video frame is the boundary of the video segment matched with the query text, and finally comprehensively determining the target video segment matched with the query text in the target video by combining the matching probability of each anchor segment and the probability that each video frame is the boundary of the video segment matched with the query text.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.
FIG. 1 is a schematic diagram illustrating the structure of a video playback system in accordance with an exemplary embodiment;
FIG. 2 is a flow diagram illustrating a video clip querying method in accordance with an exemplary embodiment;
FIG. 3 is a schematic diagram of a model architecture of a video clip query to which the embodiment shown in FIG. 2 relates;
FIG. 4 is a flow diagram illustrating a method of video clip querying in accordance with an exemplary embodiment;
FIG. 5 is a schematic structural diagram of an LSTM unit according to the embodiment shown in FIG. 4;
FIG. 6 is a schematic diagram of a process for integrating text and video in accordance with the embodiment shown in FIG. 4;
FIG. 7 is an exemplary diagram of context fusion involved in the embodiment shown in FIG. 4;
FIG. 8 is a schematic diagram of local score fusion according to the embodiment shown in FIG. 4;
FIG. 9 is a block diagram illustrating a video clip query in accordance with an exemplary embodiment;
fig. 10 is a block diagram showing the construction of a video clip querying device according to an exemplary embodiment;
FIG. 11 is a block diagram of a computer device shown in accordance with an example embodiment.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. The following description refers to the accompanying drawings in which the same numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the application, as detailed in the appended claims.
The embodiment of the application provides an efficient and high-accuracy video clip query scheme, which can rapidly and accurately query a video clip matched with a query text in a video through the query text. To facilitate understanding, several terms referred to in this application are explained below.
1) Long Short Term Memory network (Long Short-Term Memory, LSTM)
LSTM is a time-recursive neural network suitable for processing and predicting significant events of relatively long intervals and delays in a time series. LSTM has found many applications in the scientific field. For example, an LSTM-based system may perform the tasks of learning translation language, machine control, image analysis, document summarization, speech recognition, image recognition, handwriting recognition, artificial intelligence chat, disease prediction, click-through rate prediction, stock quotes, and music composition, to name a few.
2) Anchor point (Anchor)
The Anchor is called as "Anchor point" in Chinese, and in the application, the Anchor is a way of vividly expressing bounding boxes with different scales corresponding to the same position. In the video behavior segment positioning related to the application, an anchor method is adopted to discretize the time quantity which is continuous originally, so that the solution space of the problem is reduced. For example, in the process of aggregating video information along the time axis by LSTM, at each time, there are candidate frames of different lengths ending at that position, and these candidate frames are discretized into K values of different lengths, namely anchors. In other words, at each time instant, only K length candidate boxes that end at that time instant need to be predicted.
3) Attention mechanism and self-attention mechanism
The attention mechanism is a mechanism that simulates an internal process of biological observation behavior, aligns internal experience with external senses, and thereby increases the fineness of observation of a partial region. Attention mechanisms can quickly extract important features of sparse data and are therefore widely used for natural language processing tasks. The self-attention mechanism is an improvement of the attention mechanism, reduces the dependence on external information, and is better at capturing the internal correlation of data or characteristics.
Fig. 1 is a schematic diagram illustrating a video playback system according to an exemplary embodiment. The system comprises: a server 120 and a number of terminals 140.
The server 120 is a server, or a plurality of servers, or a virtualization platform, or a cloud computing service center.
The terminal 140 may be a terminal device having a video playing function, for example, the terminal 140 may be a mobile phone, a tablet computer, an e-book reader, smart glasses, a smart watch, an MP3 player (Moving Picture Experts Group Audio Layer III, motion Picture Experts Group Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, motion Picture Experts Group Audio Layer 4), a laptop computer, a desktop computer, and the like.
The terminal 140 and the server 120 are connected via a communication network. Optionally, the communication network is a wired network or a wireless network.
Optionally, the system may further include a management device (not shown in fig. 1), which is connected to the server 120 through a communication network. Optionally, the communication network is a wired network or a wireless network.
Optionally, the wireless network or wired network described above uses standard communication techniques and/or protocols. The Network is typically the Internet, but may be any Network including, but not limited to, a Local Area Network (LAN), a Metropolitan Area Network (MAN), a Wide Area Network (WAN), a mobile, wireline or wireless Network, a private Network, or any combination of virtual private networks. In some embodiments, data exchanged over a network is represented using techniques and/or formats including HyperText Mark-up Language (HTML), Extensible Mark-up Language (XML), and the like. All or some of the links may also be encrypted using conventional encryption techniques such as Secure Socket Layer (SSL), Transport Layer Security (TLS), Virtual Private Network (VPN), Internet Protocol Security (IPsec). In other embodiments, custom and/or dedicated data communication techniques may also be used in place of, or in addition to, the data communication techniques described above.
In the related art, in the process of playing the video by the terminal 140, if the user wants to find the video frequency band in which the user is interested in the currently played video, the playing progress of the video may be adjusted through operations such as dragging/clicking the progress bar, and the terminal performs skip playing of the video according to the adjusted playing progress. If the video data corresponding to the adjusted playing progress is stored in the terminal (for example, the video currently played by the terminal is a complete video which has been downloaded or is stored in advance), the terminal directly starts video playing from the adjusted playing progress; on the contrary, if the video data corresponding to the adjusted playing progress is not stored in the terminal (for example, the video currently played by the terminal is an online video, and a part of the video data is not buffered completely), the terminal may request the server to acquire the video data corresponding to the adjusted playing progress. However, when the total playing time of the video is long and/or the video segment that the user wants to find is short, the above method may require the user to repeatedly adjust the playing progress, which results in a low efficiency of querying a specific video segment in the video.
In order to solve the technical problem in the related technical solutions, embodiments of the present application provide a scheme for directly querying a video frequency band matched with a text from a video. The scheme can be realized by two modes:
in one implementation, the matching degree can be calculated by matching the video clip information with the text information in a multi-modal manner by using a sliding window mode. For the sliding window method, since different time positions and scales are considered, each frame in the video is processed repeatedly many times, and the complexity is high. A more efficient approach is to use a sliding window on a fixed time scale and then adjust the positioning position using a regression model. However, this approach still requires computations in overlapping time windows, resulting in redundant computations.
In one implementation, a single-stream (single-stream) mode can be used, text and video information are fused, timing information is retained, a segment is predicted at each moment, and repeated calculation of a sliding window method is avoided.
In a single-stream method provided by the present application, a query device (such as a terminal or a server) may determine positive and negative samples according to a value of IoU (interaction-over-unity, which measures the overlapping degree of two segments) in an anchor manner. Segments that overlap more than the true nominal segment will be marked as positive samples and segments that overlap less will be marked as negative samples.
In the training process of the scheme, a plurality of anchors are marked as positive classes for the same prediction time. Therefore, in the training process, the boundary information is ignored, so that the model cannot find the optimal anchor, and the most accurate segment cannot be located. Furthermore, although the anchor-based approach may use boundary correction (boundary refinement) to get a more accurate location. However, the boundary correction is calculated based on the anchor of the high score, and the problem of how to select the optimal anchor is difficult to solve.
On this basis, the embodiment of the application also provides a video clip query scheme based on the anchor, and the scheme can correct the probability score of the anchor according to the boundary information, so as to find the optimal anchor, and further improve the accuracy of video clip query.
Fig. 2 is a flowchart illustrating a video segment query method according to an exemplary embodiment, which may be used in a computer device, such as the server 120 or the terminal 140 of the system illustrated in fig. 1. As shown in fig. 2, the video segment query method may include the following steps:
and step 21, acquiring text video interaction information according to the query text and the target video.
The text video interaction information comprises related elements corresponding to video frames in the target video, and the related elements are used for indicating the correlation between the corresponding video frames and the query text.
Each related element in the text video interaction information corresponds to one video frame in the video, and the related elements in the text video interaction information are arranged according to the sequence of the corresponding video frames in the video.
Alternatively, each video frame in the target video may be a series of video frames sampled from the target video according to a preset sampling frequency. The sampling frequency may be set in advance by a developer, for example, the sampling frequency is set to 0.5 s/time or 1 s/time in advance, that is, the playing time interval between two adjacent video frames is 0.5s or 1 s.
And step 22, obtaining context interaction information according to the text video interaction information, wherein the context interaction information is used for indicating the association relationship between the relevant elements corresponding to each video frame.
And step 23, obtaining the matching probability information of the anchor point segment corresponding to each video frame and the boundary probability information of each video frame according to the context interaction information.
The anchor point segment is a video segment ending with a corresponding video frame in the target video, the matching probability information indicates the probability that the corresponding anchor point segment matches the query text, the boundary probability information indicates the probability that the corresponding video frame is a target boundary, and the target boundary is the boundary (i.e., a starting point or an ending point) of the video segment matching the query text.
In this embodiment of the present application, for each time point (i.e., the playing time corresponding to one video frame), K anchor segments may be preset, where the last frame of the K anchor segments is a video frame corresponding to the time point, and lengths of the K anchor segments are different from each other. Where K is a positive integer, and the value of K and the lengths of the K anchor segments may be preset by a developer.
And step 24, acquiring a target video clip matched with the query text in the target video according to the matching probability information of the anchor clip corresponding to each video frame and the boundary probability information of each video frame.
Please refer to fig. 3, which illustrates a schematic diagram of a model architecture of a video clip query according to an embodiment of the present application. As shown in fig. 3, the model for querying a video for a segment matching query text may include three major components, namely a text-to-video interaction component 31, a context interaction component 32, and an output component 33, where the output component 33 contains two subcomponents, namely an anchor (anchor) segment prediction component 33a and a boundary prediction component 33 b.
The text-video interaction component 31 is configured to capture a dependency relationship between text words and video frames according to the query text and the target video, obtain text-video interaction information, and output the text-video interaction information to the context interaction component 32.
The context interaction component 32 is configured to perform context fusion on each relevant element in the text-video interaction information according to the text-video interaction information input by the text-video interaction component 31, obtain context interaction information, and output the context interaction information to the output component 33. For example, the context interaction component 32 may output the context interaction information to the anchor segment prediction component 33a and the boundary prediction component 33b in the output component 33, respectively.
The anchor segment prediction component 33a in the output component 33 is configured to obtain probability information that K anchor segments corresponding to each video frame (time point) match the query text according to the context interaction information. And the boundary prediction component 33b in the output component 33 obtains probability information that each video frame (time point) is the boundary of the segment matching the query text according to the context interaction information.
In addition, the output component 33 determines the target video segment matching the query text in the target video in combination with the probability information predicted by the anchor segment prediction component 33a and the probability information predicted by the boundary prediction component 33 b.
In summary, according to the scheme provided by the embodiment of the application, the matching probability of each anchor segment matched with the query text in the target video is predicted through the correlation between the query text and each video frame in the target video, the probability that each video frame is the boundary of the video segment matched with the query text is predicted at the same time, and finally the target video segment matched with the query text in the target video is determined comprehensively according to the matching probability of each anchor segment and the probability that each video frame is the boundary of the video segment matched with the query text, so that the scheme for accurately querying the specific video segment in the target video through the text is provided.
The solution described above can be used to solve a novel task, namely, given a video and an associated natural language description, to locate in the video a corresponding video clip, which corresponds to the given natural language description. The techniques may support fine-grained video behavior retrieval based on natural language. For example, in a possible application scenario, when a video is played, the start of a related video segment is automatically jumped to through a sentence input by a user, so that browsing of uninteresting content is avoided, and user experience is effectively improved.
In order to improve the query efficiency of the video segments as much as possible, shorten the query duration, and improve the response speed of the video segment query, the steps in the embodiment shown in fig. 2 may be performed by the server. That is to say, the trained video segment query model (such as the model shown in fig. 3) may be stored in the server, and when the user searches for video content through the query text, the background server obtains the video features and the query text of the target video through the pre-trained model, and returns the optimal positioning segment obtained through calculation to the terminal, so that the player in the terminal can automatically skip irrelevant video content for the user.
Or, in another possible implementation manner, the trained model may also be set in the terminal, and when the terminal already stores a complete video, query and skip of the video clip may be implemented in the terminal.
Fig. 4 is a flowchart illustrating a video segment query method according to an exemplary embodiment, which may be implemented by the server 120 and the terminal 140 of the system illustrated in fig. 1, for example, the terminal initiates a video segment query to the server. As shown in fig. 4, the video segment query method may include the following steps:
step 401, a terminal sends a query request to a server, wherein the query request comprises a query text; the server receives the query request.
Optionally, the query request may further include an identifier of the target video.
In the embodiment of the present application, a video player is installed in a terminal, and the video player may be a video player application program preset in the terminal, or the video player may also be a video player application program installed by a user.
In a possible implementation manner, during the process of playing a video through a video player in the terminal, a user may input a sentence (i.e., a query text) related to a segment that the user wants to query. For example, when a user wants to query a female character to scroll up a picture segment, the user may query an input box in a video segment in a playing interface and input a query text "the girl jumps down and then jumps up", then click a query button, and the terminal obtains the query text input by the user and an identifier of a currently played target video, generates a query request including the query text and the identifier of the target video, sends the generated query request to a server, and correspondingly, the server receives the query request. After receiving the query request, the server may obtain the target video according to the identifier of the target video in the query request.
In another possible implementation, the user may also query the server for videos that contain a particular video clip. For example, when a user wants to query a video including a segment of a picture rolled up by a female character, the user may query an input box of the video segment in the playing interface and input a query text "the girl falls and jumps up", then click a query button, the terminal obtains the query text input by the user, generates a query request including the query text, and sends the generated query request to the server, and accordingly, the server receives the query request. After the server receives the query request, any video in the videos stored in the database corresponding to the server can be acquired as the target video.
Step 402, the server obtains the dependency relationship among all terms in the query text; and acquiring the dependency relationship among the video frames in the target video.
In the embodiment of the present application, a trained video segment query model (such as the model shown in fig. 3) is provided in the server, and the server executes the step of video segment query through the video segment query model.
Since both text and video are time sequence signals, the dependency relationship implied in them needs to be learned. The LSTM has the capability of learning the timing dependency relationship and can effectively model timing signals including texts, videos and the like. The scheme adopts matching LSTM (Match-LSTM) to interactively model the text and the video, and simultaneously learns the matching relation between the text and the video frames. Please refer to fig. 5, which illustrates a schematic structural diagram of a basic LSTM unit according to an embodiment of the present application. As shown in fig. 5, a basic LSTM unit consists of the following basic operations:
it=sigmoid(Wixxt+Wihht-1+bi);
ft=sigmoid(Wfxxt+Wfhht-1+bf);
ot=sigmoid(Woxxt+Wohht-1+bo);
gt=tanh(Wgxxt+Wghht-1+bg);
ct=ft⊙ct-1+it⊙gt
ht=ot⊙tanh(ct)。
in the above formula, it,ft,ot,gt,ct,htRespectively representing input gating, forgetting gating, output gating, an input unit, a memory unit and a hidden state vector. Each gating operation is equivalent to a one-layer nonlinear network and is responsible for integrating the input at the current moment and the current hidden state (capturing the information of all previous moments) and mapping the input and the current hidden state into the same space. And respectively obtaining the input gating coefficient, the gating coefficient of the memory unit and the output gating coefficient by Sigmoid activation. W, b represent weight matrices and bias vectors, respectively, which can be learned through back propagation. The above formula is recursive and the output at the next time can be calculated. All weight matrices are shared at all times, so that the transformation and dependency relationship on the time sequence can be learned.
In the model used in the embodiment of the present application, the dependency relationship between the terms in the query text and the dependency relationship between the video frames in the target video may be obtained through two LSTM sequences, respectively.
One of the LSTM sequences is used to capture the LSTM of the dependency relationship between text words, the input x in the LSTM sequencetIs an embedded token vector (word vector) for each word in the query text.
Another LSTM sequence is used to capture the dependency between different frames in the target video, where x is inputtIs the characteristic information of different video frames. The feature information of the video frame can be subjected to feature extraction through a preset feature extraction algorithm; alternatively, the feature information of the video frame may be extracted by a pre-trained feature extraction sub-model.
Optionally, when obtaining the dependency relationship between the terms in the query text, the server may sequentially input the embedded characterization vectors of the terms into the first long-short term memory network LSTM, and obtain the first hidden vector obtained by processing the terms by the first LSTM as the dependency relationship between the terms.
Correspondingly, when acquiring the dependency relationship between the video frames, the server may sequentially input the feature information of each video frame into the second LSTM, and acquire the second hidden vector obtained by processing the feature information of each video frame by the second LSTM as the dependency relationship between the video frames.
Step 403, the server obtains the text video interaction information according to the dependency relationship between the words and the dependency relationship between the video frames.
The text video interaction information comprises related elements corresponding to video frames in the target video, and the related elements are used for indicating the correlation between the corresponding video frames and the query text.
Optionally, when the text-video interaction information is obtained according to the dependency relationship between the words and the dependency relationship between the video frames, the server may perform attention-based weighting processing on the first hidden vector according to the second hidden vector to obtain a text feature hidden vector; splicing the text characteristic hidden vector with the second hidden vector to obtain a first spliced vector; and inputting the first splicing vector into a third LSTM, and acquiring a third hidden vector obtained by processing the first splicing vector by the third LSTM as the text video interaction information.
Refer to fig. 6, which shows a schematic diagram of a process for dynamically integrating text and video via Match-LSTM according to an embodiment of the present application. The hidden state vectors obtained by the text LSTM and the video LSTM are respectively expressed as follows:
Figure BDA0001992893790000141
Match-LSTM dynamically matches the relationship between words in the text and the current video frame using a soft-attention mechanism (soft-attention).
Figure BDA0001992893790000142
Figure BDA0001992893790000143
Figure BDA0001992893790000144
Figure BDA0001992893790000151
Wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0001992893790000152
is a text hidden state (which can also be regarded as a feature of the text) after being weighted by an attention mechanism, and depends on the current video hidden state
Figure BDA0001992893790000153
And state of Match-LSTM
Figure BDA0001992893790000154
Figure BDA0001992893790000155
Passing and video status
Figure BDA0001992893790000156
After vector splicing, the vector is sent to Match-LSTM to calculate the next hidden state
Figure BDA0001992893790000157
Calculated alphatjThe degree of relevance of the different words in the query text to the current video frame at time t is captured.
The N is the number of words contained in the sentence, and is a preset value, if the length of a sentence exceeds the threshold, the rest words will be discarded, and if not, the rest words will be filled up by using a 0 vector. Alpha is as defined abovetjIn effect, the response coefficients calculated for the different terms in the normalized query text. In other words, different words may have different degrees of importance at different times, requiring dynamic computation and learning. And repeating the steps at the next moment. All times share the same set of network parameters. The attention weight calculated at each instant is different because it depends on the state of the current video as well as the state of Match-LSTM.
Through the steps, the text and the video information are deeply fused together. The above steps 402 and 403 can be executed by the text-video interaction component in the model shown in fig. 3, and the output hidden state sequence
Figure BDA0001992893790000158
Will be the input to the contextual interaction component.
In step 404, the server obtains a correlation weight of each of the relevant elements corresponding to the respective video frames, where the correlation weight is used to indicate a correlation between the corresponding element and each of the elements in the front and back preset ranges.
In this embodiment, the server may obtain, by using a self-attention mechanism, respective correlation weights of the relevant elements corresponding to the respective video frames.
Step 405, the server performs context fusion on the relevant elements corresponding to each video frame according to the respective relevance weights of the relevant elements corresponding to each video frame, so as to obtain context fusion information.
In order to effectively capture the context information of the current prediction time, the scheme inserts a self-attention modeling layer (self-attention) in the Match-LSTM upper layer. Unlike the context modeling method in the related art, the present solution can explicitly calculate the contribution degree of different context units to the current time. The proposed context fusion strategy enables gathering and enhancing of relevant localization cues. For example, the embodiment of the present application may use a standard inner product operation to calculate the weight between every two elements in a time region:
Figure BDA0001992893790000159
wherein Q, K, V represent a query vector (query vector), a key vector (key vector), and a value vector (value vector), respectively.
Figure BDA00019928937900001510
Is a scaling factor, dkIs the number of columns of Q. In the self-attention mechanism, Q, K, V are the same source. Here, the output hidden state of the text-video interaction component is:
Hc=Hm·(Attention(WqHm,WkHm,Hm))T
wherein, Wq,WkIs a mapping matrix.
In a possible implementation manner, the server may directly provide the context fusion information obtained in this step as the context interaction information to the output component. Alternatively, the server may process the context fusion information in a processing manner as shown in step 406 below to obtain the context interaction information.
In step 406, the server obtains the context interaction information according to the context fusion information.
Optionally, when the context interaction information is obtained according to the context fusion information, the server may splice the context fusion information and the text video interaction information in a residual connection manner to obtain the context interaction information.
The purpose of the scheme shown in this application is to make HcEach hidden state in (1)
Figure BDA0001992893790000161
Discriminative context information can be captured. To further improve the computational efficiency, the solution shown in the embodiment of the present application may use a limited (restricted) way to calculate the self-attention, i.e. only the neighborhood with the range size D is considered for each time instant (i.e. each video frame). In order not to lose timing sequence information, the embodiments of the present application may use a residual connection method to splice vectors before and after context fusion:
Hc=Hm||(Hm·(Attention(WqHm,WkHm,Hm))T)
refer to FIG. 7, which illustrates an example diagram of context fusion in accordance with an embodiment of the present application. As shown in fig. 7, the query text is: that girl fell and then jumped up. The action that needs to be located is "jump up", and the context that should be most concerned is "fall", since this action of "jumping up" follows the action of "falling". As can be seen from the example shown in FIG. 7, the context weights (Attention (Q, K, V)) learned by the context interaction component in the model have the greatest response (i.e., the image frame indicated by the dashed arrow in FIG. 7) when the act of falling occurs.
Step 407, the server obtains the matching probability information of the anchor segment corresponding to each video frame and the boundary probability information of each video frame according to the context interaction information.
The anchor segment is a video segment ending with a corresponding video frame in the target video, the matching probability information indicates the probability that the corresponding anchor segment is matched with the query text, the boundary probability information indicates the probability that the corresponding video frame is a target boundary, and the target boundary is the boundary of the video segment matched with the query text.
Optionally, when the matching probability information of the anchor segment corresponding to each video frame and the boundary probability information of each video frame are obtained according to the context interaction information, the server may process the context interaction information through the first classifier to obtain the matching probability information of the anchor segment corresponding to each video frame; and processing the context interaction information through a second classifier to obtain boundary probability information of each video frame.
In the embodiment of the application, K anchors can be designed to match with segments with different lengths. This embodiment can use { l }iDenotes the length of the ith anchor. With a corresponding hidden state at each instant
Figure BDA0001992893790000171
The predicted start and stop points of the corresponding anchors are (t-l)iT). In the embodiment of the present application, K classifiers are trained in each hidden state in advance to determine whether the corresponding kth anchor segment successfully matches the behavior segment corresponding to the text (i.e., whether IoU corresponding to the segment with the true mark exceeds a predetermined threshold). Classifiers at different times share a set of network parameters. The scores (i.e., the matching probability information) corresponding to the K anchors obtained at the time t may be:
Figure BDA0001992893790000172
where δ represents the sigmoid activation function, Wc,bcIs the transformation matrix and the offset shared at all times. The process of obtaining the corresponding scores of the anchor segments can be predicted by the anchor segments in the output component of the modelThe component executes.
The idea of boundary prediction is to adopt a two-class classification to determine whether a video frame corresponding to time t is a boundary point (starting point or ending point) of a behavior segment corresponding to a text query. The output boundary score (i.e., boundary probability information) may be:
Figure BDA0001992893790000173
this formula measures the likelihood that the LSTM network will pass a boundary point. Note that the input vector
Figure BDA0001992893790000174
The historical information of the video can be captured, so that whether the video frame corresponding to the current time is a boundary point or not can be better judged by comparing the current time with the historical information.
The above-mentioned process of obtaining the boundary score may be performed by a boundary prediction component in an output component of the video segment query model.
Step 408, the server corrects the matching probability information of the anchor segment corresponding to each video frame according to the boundary probability information of each video frame, so as to obtain the corrected matching probability information of the anchor segment corresponding to each video frame.
Optionally, the server corrects the matching probability information of the anchor segment corresponding to each video frame according to the boundary probability information of each video frame, and when the corrected matching probability information of the anchor segment corresponding to each video frame is obtained, the boundary probability information of the first video frame of the target anchor segment and the boundary probability information of the last video frame of the target anchor segment may be obtained; the target anchor segment is any anchor segment in the anchor segments corresponding to the video frames; and correcting the matching probability information of the target anchor segment according to the boundary probability information of the first video frame of the target anchor segment and the boundary probability information of the last video frame of the target anchor segment to obtain the corrected matching probability information of the target anchor segment.
Optionally, when the matching probability information of the target anchor segment is modified according to the boundary probability information of the first video frame of the target anchor segment and the boundary probability information of the last video frame of the target anchor segment to obtain the modified matching probability information of the anchor segment corresponding to each video frame, the server may obtain the boundary probability information of the first video frame of the target anchor segment and the average probability information of the boundary probability information of the last video frame of the target anchor segment; and acquiring the sum of the average probability information and the matching probability information of the target anchor segment as the corrected matching probability information of the target anchor segment.
In order to obtain the final behavior segment ordering, the embodiment of the present application may adopt a local to global prediction manner. Local means that the prediction score and the boundary score of the anchor segment are first fused at each time instant. Global refers to post-processing the output at all times to obtain the final sorting result.
Please refer to fig. 8, which shows a schematic diagram of local score fusion according to an embodiment of the present application, and as shown in fig. 8, the local score fusion manner is as follows: for the K anchors predicted at the time t, the starting and ending boundary positions of the K anchors are considered, the boundary prediction fraction corresponding to the starting and ending positions of each anchor is extracted, and the prediction fraction of each anchor is corrected, wherein the correction formula can be as follows:
Figure BDA0001992893790000181
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0001992893790000182
in step 409, the server obtains the target video clip from the anchor clip corresponding to each video frame according to the modified matching probability information of the anchor clip corresponding to each video frame.
Optionally, when the target video segment is obtained from the anchor segment corresponding to each video frame according to the corrected matching probability information of the anchor segment corresponding to each video frame, the server may sort the anchor segments corresponding to each video frame according to a descending order of the probability values corresponding to the corrected matching probability information, so as to obtain the anchor segment queue; extracting the first M anchor segments of the anchor segment queue, wherein M is an integer greater than or equal to 2; and acquiring the target video clip according to the first M anchor clips.
Global score ordering: after local score fusion, the server can obtain the scores of all the segments of one video
Figure BDA0001992893790000183
And sorts them, selects the top M top-ranked fragments, and performs Non-Maximum Suppression (NMS) to eliminate redundant fragments.
The process of suppressing the non-maximum value may be as follows: if two segments in the set overlap more heavily (IoU is larger), the one with the smaller score is discarded. Through the sorting of the scores, the segments in the high-score stage are considered preferentially, and the iteration is carried out continuously, so that the higher positioning recall rate can be achieved with less segment number finally.
Wherein the above steps 408 and 409 can be performed by an output component of the video segment query model.
Step 410, the server returns a query result to the terminal, wherein the query result is used for indicating the target video segment; the terminal receives the query result.
Optionally, the terminal jumps to the target video segment according to the query result.
In a possible implementation manner, after the terminal receives the query result, if the target video segment is a single video segment, the terminal may directly play a video picture of the target video from a start position of the target video segment in a play window of the video player.
For example, if a user inputs a query text and clicks a query in the process of playing a target video by a video player, after receiving a query result returned by the server, if the query result only indicates a single target video segment, the terminal adjusts the progress of playing the target video by the video player to the start position of the target video segment.
Or, in another possible implementation manner, the terminal may present the query result, and jump to the target video segment when receiving the user click operation.
In another possible implementation manner, after receiving the query result, the terminal displays the query result, and when receiving a specified trigger operation on the query result, the terminal may start to play the video picture of the target video from the start position of the target video segment in a play window of the video player.
For example, if a user inputs a query text and clicks a query in a process of playing a target video by a video player, after the terminal receives a query result returned by the server, the terminal first displays segment information of a target video segment (such as a first video frame of the target video segment, a duration of the target video segment, and the like) according to the query result, when the target video segment has multiple segments, the terminal may display the segment information of the multiple segments in a list manner, and the user may select to click the segment information of one segment, thereby triggering the terminal to adjust a progress of playing the target video by the video player to a start position of the segment.
Or, if the user queries a video including a specific video segment from the server, after the terminal receives a query result returned by the server, the terminal first displays segment information of the target video segment (for example, a video name corresponding to the target video segment, a first video frame of the target video segment, a duration of the target video segment, and the like) according to the query result, and the user can select to click the segment information of one segment, so that the terminal is triggered to open the target video corresponding to the segment in the video player, and the progress of the video player in playing the target video is adjusted to the start position of the segment.
In summary, according to the scheme provided by the embodiment of the application, the matching probability of each anchor segment matched with the query text in the target video is predicted through the correlation between the query text and each video frame in the target video, the probability that each video frame is the boundary of the video segment matched with the query text is predicted at the same time, and finally the target video segment matched with the query text in the target video is determined comprehensively according to the matching probability of each anchor segment and the probability that each video frame is the boundary of the video segment matched with the query text, so that the scheme for accurately querying the specific video segment in the target video through the text is provided.
The above embodiments of the present application provide a scheme for locating behavior segments matching text descriptions in a video given the text descriptions. The main contents of the application are two points:
1) a context fusion strategy using a self-attention mechanism.
The method adopts a matching LSTM model (Match-LSTM) as a backbone network, and text information and video information are deeply fused. After obtaining a series of hidden state vectors with time sequence relation and coding video information, a component for effectively fusing context information by using a self-attention mechanism (self-attention) is provided, and the component explicitly calculates the relation between the current prediction time and other hidden states of the prediction time. Unlike the commonly used LSTM, the computation path length of the context interaction component in this application is still 1 for far away hidden state vectors. The new hidden state sequence will be sent to the output component to get the ranking score of the candidate video segments.
2) Boundary-sensitive positioning strategies.
Wherein the output assembly comprises two subcomponents. One is an anchor prediction subcomponent, which generates K anchor segments at each prediction instant; and the second is a boundary classifier which judges whether the current time is the boundary of the behavior, namely whether the current time is the starting time or the ending time of the behavior. In the model shown in fig. 3 in the application, training of multiple tasks (anchors + boundaries) can be performed in a training stage, and in a prediction stage, the confidence level (i.e., the corrected matching probability information) of each anchor segment is corrected by inquiring the starting position and the ending position of each anchor segment, so that new boundary-sensitive behavior positioning sequences are obtained.
For example, a user inputs query text and clicks on a segment in a query target video in the process of playing the target video by a video player, please refer to fig. 9, which is a structural diagram illustrating a video segment query according to an exemplary embodiment. As shown in fig. 9, the process of the terminal querying the server for the specific video segment in the currently played target video is as follows:
1) and the terminal displays a video playing interface for playing the target video, wherein the video playing interface comprises an inquiry control.
As shown in fig. 9, in the embodiment of the present application, a video player is installed in a terminal 91, and the video player shows a video playing interface 91a that is playing a target video, and meanwhile, the video playing interface 91a further includes an inquiry control 91 b.
In one possible implementation, the query control 91b may comprise a text entry box 91b1 and a query button 91b 2.
2) And when the terminal receives the trigger operation of the query control, acquiring a query text input based on the query control.
As shown in fig. 9, the video player receives the text input by the user in the text input box 91b1, and upon receiving the user's click operation on the query button 91b2, acquires the text in the text input box 91b1 as query text.
3) The terminal sends an inquiry request containing the inquiry text to the server, and the server receives the inquiry request.
The server 92 is provided with a pre-trained video clip query model 92a, and the model 92a comprises a text-video interaction component 92a1, a context interaction component 92a2 and an output component 92a 3; wherein the output component 92a3 further includes an anchor segment prediction component 92a31 and a boundary prediction component 92a 32.
Wherein the query request is used to trigger the server to perform the following steps 4) to 8) through the video segment query model 92 a.
4) And the server acquires text video interaction information according to the inquired text and the target video.
The text video interaction information comprises related elements corresponding to the video frames in the target video, and the related elements are used for indicating the correlation between the corresponding video frames and the query text.
Wherein, the server can input the query text in the query request and each video frame in the target video into the text-video interaction component 92a1, and the text-video interaction component 92a1 processes the text-video interaction information to obtain the text-video interaction information. The execution of the text-to-video interaction component 92a1 can refer to the description under step 402 and step 403 in the embodiment shown in fig. 4.
5) And the server acquires the context interaction information according to the text video interaction information.
The context interaction information is used for indicating the association relationship between the relevant elements corresponding to the video frames.
As shown in FIG. 9, the text-to-video interaction component 92a1 outputs the processed text-to-video interaction information to the context interaction component 92a2, which is processed by the context interaction component 92a2 to obtain context interaction information. The execution of the context interaction component 92a2 can refer to the description given above under step 404, step 405 and step 406 in the embodiment shown in fig. 4.
6) And the server acquires the matching probability information of the anchor point segment corresponding to each video frame and the boundary probability information of each video frame according to the context interaction information.
The anchor point segment is a video segment ending with a corresponding video frame in the target video, the matching probability information indicates the probability that the corresponding anchor point segment is matched with the query text, the boundary probability information indicates the probability that the corresponding video frame is a target boundary, and the target boundary is the boundary of the video segment matched with the query text.
As shown in fig. 9, after obtaining the context interaction information, the context interaction component 92a2 provides the context interaction information to the anchor segment prediction component 92a31 and the boundary prediction component 92a32 in the output component 92a3, respectively, where the anchor segment prediction component 92a31 obtains the matching probability information of the anchor segment corresponding to each video frame according to the context interaction information, and the boundary prediction component 92a32 obtains the boundary probability information of each video frame according to the context interaction information. The execution of the anchor segment prediction component 92a31 and the boundary prediction component 92a32 may refer to the description given above under step 407 in the embodiment shown in fig. 4.
7) And the server acquires a target video clip matched with the query text in the target video according to the matching probability information of the anchor clip corresponding to each video frame and the boundary probability information of each video frame.
The output component 92a3 determines a target video segment of the target video in conjunction with the results obtained by the anchor segment prediction component 92a31 and the boundary prediction component 92a32, respectively, and outputs relevant information of the target video segment, such as start-stop time information of the target video segment. The process of the output component 92a3 determining the target video segment can be as described above under step 408.
8) And the server returns a query result to the terminal, and the terminal receives the query result returned by the server, wherein the query result is used for indicating the target video clip.
In this embodiment, the server may generate a query result according to the relevant information of the target video segment output by the video segment query model 92a, and return the query result to the video player in the terminal.
9) And the terminal adjusts the playing progress of the target video to the initial position of the target video clip according to the query result.
The video player in the terminal plays the target video segment according to the query result, and the process may refer to the description in step 411 in the embodiment shown in fig. 4.
By the scheme, video fragment searching application based on Artificial Intelligence (AI) can be realized. For example, a developer may construct and obtain the video segment query model shown in fig. 3 or fig. 9 based on an artificial intelligence learning algorithm, and train network parameters (including but not limited to a weight matrix, a mapping matrix, a transformation matrix, a bias vector, and the like related to each component) related to each component in the video segment query model through labeled training data, where the training data may include a training video and a training text, a technician labels video segments in the training video matching the training text in advance (for example, the labeling information may be start and stop time points of the video segments matching the training text, or the labeling information may also be start time points and duration of the video segments), and then inputs the training data and the labeling information into the video segment query model for machine learning training to adjust the network parameters of the model, and then, the trained video clip query model is issued to a server for online application, so that a user can directly query a matched video clip in a certain video segment through a text.
Fig. 10 is a block diagram illustrating a configuration of a video clip querying device according to an exemplary embodiment. The video segment query device can be used in a computer device to execute all or part of the steps in the embodiments shown in fig. 2 or fig. 4. The video clip inquiring apparatus may include:
a text video interaction information obtaining module 1001, configured to obtain text video interaction information according to a query text and a target video, where the text video interaction information includes a relevant element corresponding to each video frame in the target video, and the relevant element is used to indicate a correlation between the corresponding video frame and the query text;
a context interaction information obtaining module 1002, configured to obtain context interaction information according to the text video interaction information, where the context interaction information is used to indicate an association relationship between related elements corresponding to the video frames;
a probability obtaining module 1003, configured to obtain, according to the context interaction information, matching probability information of anchor segments corresponding to the video frames and boundary probability information of the video frames; the anchor segment is a video segment ending with a corresponding video frame in the target video, the matching probability information indicates the probability that the corresponding anchor segment is matched with the query text, the boundary probability information indicates the probability that the corresponding video frame is a target boundary, and the target boundary is the boundary of the video segment matched with the query text;
a video clip obtaining module 1004, configured to obtain, according to the matching probability information of the anchor clip corresponding to each video frame and the boundary probability information of each video frame, a target video clip that matches the query text in the target video.
Optionally, the text-to-video interaction information obtaining module 1001 is configured to,
acquiring the dependency relationship among all terms in the query text;
acquiring the dependency relationship among video frames;
and acquiring the text video interaction information according to the dependency relationship among the words and the dependency relationship among the video frames.
Optionally, when obtaining the dependency relationship between the terms in the query text, the text-to-video interaction information obtaining module 1001 is configured to,
sequentially inputting the embedded characteristic vectors of all the words into a first long-short term memory network (LSTM), and acquiring a first implicit vector obtained by processing all the words by the first LSTM as a dependency relationship among all the words;
when obtaining the dependency relationship between the video frames, the text-to-video interaction information obtaining module 1001 is configured to sequentially input the feature information of each video frame into a second LSTM, and obtain a second hidden vector obtained by processing the feature information of each video frame by the second LSTM as the dependency relationship between the video frames.
Optionally, when the text video interaction information is obtained according to the dependency relationship between the words and the dependency relationship between the video frames, the text video interaction information obtaining module 1001 is configured to,
performing attention mechanism-based weighting processing on the first hidden vector according to the second hidden vector to obtain a text feature hidden vector;
splicing the text feature hidden vector with the second hidden vector to obtain a first spliced vector;
and inputting the first splicing vector into a third LSTM, and acquiring a third hidden vector obtained by processing the first splicing vector by the third LSTM as the text video interaction information.
Optionally, the context interaction information obtaining module 1002 is configured to,
obtaining respective relevance weights of relevant elements corresponding to the video frames, wherein the relevance weights are used for indicating the relevance between the corresponding elements and the elements in a front preset range and a rear preset range;
according to the respective relevance weight of the relevant element corresponding to each video frame, carrying out context fusion on the relevant element corresponding to each video frame to obtain context fusion information;
and acquiring the context interaction information according to the context fusion information.
Optionally, when the context interaction information is obtained according to the context fusion information, the context interaction information obtaining module 1002 is configured to,
and splicing the context fusion information and the text video interaction information in a residual connection mode to obtain the context interaction information.
Optionally, the probability obtaining module 1003 is configured to,
processing the context interaction information through a first classifier to obtain the matching probability information of anchor point segments corresponding to all the video frames;
and processing the context interaction information through a second classifier to obtain boundary probability information of each video frame.
Optionally, the video clip obtaining module 1004 is configured to,
correcting the matching probability information of the anchor point segment corresponding to each video frame according to the boundary probability information of each video frame to obtain the corrected matching probability information of the anchor point segment corresponding to each video frame;
and acquiring the target video clip from the anchor clip corresponding to each video frame according to the corrected matching probability information of the anchor clip corresponding to each video frame.
Optionally, when the matching probability information of the anchor segment corresponding to each video frame is corrected according to the boundary probability information of each video frame, so as to obtain the corrected matching probability information of the anchor segment corresponding to each video frame, the video segment obtaining module 1004 is configured to,
acquiring boundary probability information of a first video frame of a target anchor point fragment and boundary probability information of a last video frame of the target anchor point fragment; the target anchor segment is any anchor segment in the anchor segments corresponding to the video frames;
and correcting the matching probability information of the target anchor segment according to the boundary probability information of the first video frame of the target anchor segment and the boundary probability information of the last video frame of the target anchor segment to obtain the corrected matching probability information of the target anchor segment.
Optionally, when the matching probability information of the target anchor segment is modified according to the boundary probability information of the first video frame of the target anchor segment and the boundary probability information of the last video frame of the target anchor segment, so as to obtain the modified matching probability information of the anchor segment corresponding to each video frame, the video segment obtaining module 1004 is configured to,
acquiring boundary probability information of a first video frame of the target anchor point segment and average probability information of the boundary probability information of a last video frame of the target anchor point segment;
and acquiring the sum of the average probability information and the matching probability information of the target anchor point segment as the corrected matching probability information of the target anchor point segment.
Optionally, when the target video clip is acquired from the anchor clips corresponding to the video frames according to the modified matching probability information of the anchor clips corresponding to the video frames, the video clip acquiring module 904 is configured to sort the anchor clips corresponding to the video frames according to a descending order of the probability values corresponding to the modified matching probability information, so as to obtain the anchor clip queue;
extracting the first M anchor segments of the anchor segment queue, wherein M is an integer greater than or equal to 2;
and acquiring the target video clip according to the first M anchor clips.
Optionally, the text-to-video interaction information obtaining module 1001 is configured to,
when an inquiry request sent by a terminal is received, acquiring the text video interaction information according to the inquiry text and the target video; the query request comprises the query text;
the device further comprises:
and the result returning module is used for returning a query result to the terminal, and the query result is used for indicating the target video clip.
FIG. 11 is a block diagram of a computer device shown in accordance with an example embodiment. The computer device 1100 includes a Central Processing Unit (CPU)1101, a system memory 1104 including a Random Access Memory (RAM)1102 and a Read Only Memory (ROM)1103, and a system bus 1105 connecting the system memory 1104 and the central processing unit 1101. The computer device 1100 also includes a basic input/output system (I/O system) 1106, which facilitates transfer of information between devices within the computer, and a mass storage device 1107 for storing an operating system 1113, application programs 1114, and other program modules 1115.
The basic input/output system 1106 includes a display 1108 for displaying information and an input device 1109 such as a mouse, keyboard, etc. for user input of information. Wherein the display 1108 and the input device 1109 are connected to the central processing unit 1101 through an input output controller 1110 connected to a system bus 1105. The basic input/output system 1106 may also include an input/output controller 1110 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, input-output controller 1110 also provides output to a display screen, a printer, or other type of output device.
The mass storage device 1107 is connected to the central processing unit 1101 through a mass storage controller (not shown) connected to the system bus 1105. The mass storage device 1107 and its associated computer-readable media provide non-volatile storage for the computer device 1100. That is, the mass storage device 1107 may include a computer-readable medium (not shown) such as a hard disk or CD-ROM drive.
Without loss of generality, the computer readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, EPROM, EEPROM, flash memory or other solid state memory technology, CD-ROM, DVD, or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will appreciate that the computer storage media is not limited to the foregoing. The system memory 1104 and mass storage device 1107 described above may collectively be referred to as memory.
The computer device 1100 may connect to the internet or other network devices through the network interface unit 1111 that is connected to the system bus 1105.
The memory further includes one or more programs, the one or more programs are stored in the memory, and the central processing unit 1101 implements all or part of the steps of the method shown in fig. 2 or fig. 4 by executing the one or more programs.
In an exemplary embodiment, a non-transitory computer readable storage medium comprising instructions, such as a memory including computer programs (instructions), executable by a processor of a computer device to perform all or part of the steps of the methods shown in the various embodiments of the present application, is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.
It will be understood that the present application is not limited to the precise arrangements that have been described above and shown in the drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims (12)

1. A method for querying a video segment, the method comprising:
acquiring text video interaction information according to a query text and a target video, wherein the text video interaction information comprises related elements corresponding to all video frames in the target video, and the related elements are used for indicating the correlation between the corresponding video frames and the query text;
acquiring context interaction information according to the text video interaction information, wherein the context interaction information is used for indicating the association relationship between the relevant elements corresponding to the video frames;
acquiring the matching probability information of anchor point segments corresponding to the video frames and the boundary probability information of the video frames according to the context interaction information; the anchor segment is a video segment ending with a corresponding video frame in the target video, the matching probability information indicates the probability that the corresponding anchor segment is matched with the query text, the boundary probability information indicates the probability that the corresponding video frame is a target boundary, and the target boundary is the boundary of the video segment matched with the query text;
correcting the matching probability information of the anchor segment corresponding to each video frame according to the boundary probability information of each video frame to obtain the corrected matching probability information of the anchor segment corresponding to each video frame;
sequencing the anchor clips corresponding to the video frames according to the sequence from the big to the small of the probability value corresponding to the correction matching probability information of the anchor clips corresponding to the video frames to obtain an anchor clip queue;
extracting the first M anchor segments of the anchor segment queue, wherein M is an integer greater than or equal to 2;
acquiring target video clips according to the first M anchor clips;
wherein, the correcting the matching probability information of the anchor segment corresponding to each video frame according to the boundary probability information of each video frame to obtain the corrected matching probability information of the anchor segment corresponding to each video frame includes:
acquiring boundary probability information of a first video frame of a target anchor point fragment and boundary probability information of a last video frame of the target anchor point fragment; the target anchor segment is any anchor segment in the anchor segments corresponding to the video frames;
acquiring boundary probability information of a first video frame of the target anchor point segment and average probability information of the boundary probability information of a last video frame of the target anchor point segment;
and acquiring the sum of the average probability information and the matching probability information of the target anchor point segment as the corrected matching probability information of the target anchor point segment.
2. The method of claim 1, wherein the obtaining text-video interaction information according to the query text and the target video comprises:
acquiring the dependency relationship among all terms in the query text;
acquiring the dependency relationship among video frames;
and acquiring the text video interaction information according to the dependency relationship among the words and the dependency relationship among the video frames.
3. The method of claim 2,
the obtaining of the dependency relationship among the terms in the query text includes:
sequentially inputting the embedded characteristic vectors of all the words into a first long-short term memory network (LSTM), and acquiring a first implicit vector obtained by processing all the words by the first LSTM as a dependency relationship among all the words;
the obtaining of the dependency relationship between the video frames includes:
and sequentially inputting the characteristic information of each video frame into a second LSTM, and acquiring a second implicit vector obtained by processing the characteristic information of each video frame by the second LSTM as the dependency relationship among the video frames.
4. The method according to claim 3, wherein the obtaining the text-video interaction information according to the dependency relationship between the words and the dependency relationship between the video frames comprises:
performing attention mechanism-based weighting processing on the first hidden vector according to the second hidden vector to obtain a text feature hidden vector;
splicing the text feature hidden vector with the second hidden vector to obtain a first spliced vector;
and inputting the first splicing vector into a third LSTM, and acquiring a third implicit vector obtained by processing the first splicing vector by the third LSTM as the text video interaction information.
5. The method of claim 1, wherein the obtaining contextual interaction information based on the textual video interaction information comprises:
obtaining respective relevance weights of relevant elements corresponding to the video frames, wherein the relevance weights are used for indicating the relevance between the corresponding elements and the elements in a front preset range and a rear preset range;
according to the respective relevance weight of the relevant element corresponding to each video frame, carrying out context fusion on the relevant element corresponding to each video frame to obtain context fusion information;
and acquiring the context interaction information according to the context fusion information.
6. The method according to claim 5, wherein said obtaining the context interaction information according to the context fusion information comprises:
and splicing the context fusion information and the text video interaction information in a residual connection mode to obtain the context interaction information.
7. The method according to any one of claims 1 to 6, wherein the obtaining, according to the context interaction information, matching probability information of anchor segments corresponding to the respective video frames and boundary probability information of the respective video frames includes:
processing the context interaction information through a first classifier to obtain the matching probability information of the anchor point segments corresponding to the video frames;
and processing the context interaction information through a second classifier to obtain boundary probability information of each video frame.
8. The method according to any one of claims 1 to 6, wherein the obtaining text-video interaction information according to the query text and the target video comprises:
when a query request sent by a terminal is received, acquiring the text video interaction information according to the query text and the target video; the query request comprises the query text;
the method further comprises the following steps:
and returning a query result to the terminal, wherein the query result is used for indicating the target video clip.
9. A method for querying a video clip, the method comprising:
displaying a video playing interface for playing a target video, wherein the video playing interface comprises a query control;
when receiving the trigger operation of the query control, acquiring a query text input based on the query control;
sending a query request containing the query text to a server; the query request is used for triggering the server to acquire text video interaction information according to the query text and the target video, the text video interaction information includes related elements corresponding to video frames in the target video, the related elements are used for indicating the correlation between the corresponding video frames and the query text, context interaction information is acquired according to the text video interaction information, the context interaction information is used for indicating the association relationship between the related elements corresponding to the video frames, the matching probability information of anchor point segments corresponding to the video frames and the boundary probability information of the video frames are acquired according to the context interaction information, the anchor point segments are video segments ending with the corresponding video frames in the target video, and the matching probability information indicates the matching probability of the corresponding anchor point segments and the query text, the boundary probability information indicates the probability that the corresponding video frame is a target boundary, the target boundary is the boundary of a video segment matched with the query text, the matching probability information of the anchor segment corresponding to each video frame is corrected through the boundary probability information of each video frame to obtain the corrected matching probability information of the anchor segment corresponding to each video frame, the anchor segments corresponding to each video frame are sequenced according to the sequence from the big to the small of the probability values corresponding to the corrected matching probability information of the anchor segment corresponding to each video frame to obtain an anchor segment queue, the first M anchor segments of the anchor segment queue are extracted, M is an integer greater than or equal to 2, the target video segment is obtained according to the first M anchor segments, wherein the boundary probability information of each video frame is passed, correcting the matching probability information of the anchor segment corresponding to each video frame to obtain the corrected matching probability information of the anchor segment corresponding to each video frame, including: acquiring boundary probability information of a first video frame of a target anchor segment and boundary probability information of a last video frame of the target anchor segment, wherein the target anchor segment is any anchor segment in anchor segments corresponding to all the video frames, acquiring boundary probability information of the first video frame of the target anchor segment and average probability information of the boundary probability information of the last video frame of the target anchor segment, and acquiring the sum of the average probability information and matching probability information of the target anchor segment as modified matching probability information of the target anchor segment;
receiving a query result returned by the server, wherein the query result is used for indicating the target video clip;
and adjusting the playing progress of the target video to the initial position of the target video clip according to the query result.
10. An apparatus for querying a video clip, the apparatus comprising:
the text video interaction information acquisition module is used for acquiring text video interaction information according to a query text and a target video, wherein the text video interaction information comprises relevant elements corresponding to all video frames in the target video, and the relevant elements are used for indicating the correlation between the corresponding video frames and the query text;
the context interaction information acquisition module is used for acquiring context interaction information according to the text video interaction information, and the context interaction information is used for indicating the association relationship between the relevant elements corresponding to each video frame;
a probability obtaining module, configured to obtain, according to the context interaction information, matching probability information of anchor segments corresponding to the video frames and boundary probability information of the video frames; the anchor segment is a video segment ending with a corresponding video frame in the target video, the matching probability information indicates the probability that the corresponding anchor segment is matched with the query text, the boundary probability information indicates the probability that the corresponding video frame is a target boundary, and the target boundary is the boundary of the video segment matched with the query text;
a video clip obtaining module, configured to modify the matching probability information of the anchor clip corresponding to each video frame according to the boundary probability information of each video frame, so as to obtain modified matching probability information of the anchor clip corresponding to each video frame;
the video clip acquisition module is further configured to sort the anchor clips corresponding to the video frames according to a descending order of the probability values corresponding to the modified matching probability information of the anchor clips corresponding to the video frames, so as to obtain an anchor clip queue;
the video clip acquisition module is further configured to extract the first M anchor clips of the anchor clip queue, where M is an integer greater than or equal to 2;
the video clip acquisition module is further configured to acquire a target video clip according to the first M anchor clips;
wherein, the video clip acquisition module includes:
acquiring boundary probability information of a first video frame of a target anchor point fragment and boundary probability information of a last video frame of the target anchor point fragment; the target anchor segment is any anchor segment in the anchor segments corresponding to the video frames;
acquiring boundary probability information of a first video frame of the target anchor point segment and average probability information of the boundary probability information of a last video frame of the target anchor point segment;
and acquiring the sum of the average probability information and the matching probability information of the target anchor point segment as the corrected matching probability information of the target anchor point segment.
11. A computer device comprising a processor and a memory, the memory having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, the at least one instruction, the at least one program, the set of codes, or the set of instructions being loaded and executed by the processor to implement the video clip querying method according to any one of claims 1 to 9.
12. A computer readable storage medium having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, which is loaded and executed by a processor to implement the video segment query method of any of claims 1 to 9.
CN201910186054.4A 2019-03-12 2019-03-12 Video clip query method, device, computer equipment and storage medium Active CN109905772B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910186054.4A CN109905772B (en) 2019-03-12 2019-03-12 Video clip query method, device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910186054.4A CN109905772B (en) 2019-03-12 2019-03-12 Video clip query method, device, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN109905772A CN109905772A (en) 2019-06-18
CN109905772B true CN109905772B (en) 2022-07-22

Family

ID=66946948

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910186054.4A Active CN109905772B (en) 2019-03-12 2019-03-12 Video clip query method, device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN109905772B (en)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110689177B (en) * 2019-09-17 2020-11-20 北京三快在线科技有限公司 Method and device for predicting order preparation time, electronic equipment and storage medium
CN111860289B (en) * 2020-07-16 2024-04-02 北京思图场景数据科技服务有限公司 Time sequence action detection method and device and computer equipment
CN111930999B (en) * 2020-07-21 2022-09-30 山东省人工智能研究院 Method for implementing text query and positioning video clip by frame-by-frame cross-modal similarity correlation
CN112256916B (en) * 2020-11-12 2021-06-18 中国计量大学 Short video click rate prediction method based on graph capsule network
CN112261491B (en) * 2020-12-22 2021-04-16 北京达佳互联信息技术有限公司 Video time sequence marking method and device, electronic equipment and storage medium
CN112738556B (en) * 2020-12-22 2023-03-31 上海幻电信息科技有限公司 Video processing method and device
CN113590874B (en) * 2021-09-28 2022-02-11 山东力聚机器人科技股份有限公司 Video positioning method and device, and model training method and device
CN114245171B (en) * 2021-12-15 2023-08-29 百度在线网络技术(北京)有限公司 Video editing method and device, electronic equipment and medium
CN114390365B (en) * 2022-01-04 2024-04-26 京东科技信息技术有限公司 Method and apparatus for generating video information
CN116186329B (en) * 2023-02-10 2023-09-12 阿里巴巴(中国)有限公司 Video processing, searching and index constructing method, device, equipment and storage medium
CN116644212B (en) * 2023-07-24 2023-12-01 科大讯飞股份有限公司 Video detection method, device, equipment and readable storage medium
CN116886991B (en) * 2023-08-21 2024-05-03 珠海嘉立信发展有限公司 Method, apparatus, terminal device and readable storage medium for generating video data
CN117076712B (en) * 2023-10-16 2024-02-23 中国科学技术大学 Video retrieval method, system, device and storage medium
CN117668298B (en) * 2023-12-15 2024-05-07 青岛酒店管理职业技术学院 Artificial intelligence method and system for application data analysis

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104301771A (en) * 2013-07-15 2015-01-21 中兴通讯股份有限公司 Method and device for adjusting playing progress of video file
DK201670608A1 (en) * 2016-06-12 2018-01-02 Apple Inc User interfaces for retrieving contextually relevant media content
CN106792212A (en) * 2016-12-02 2017-05-31 乐视控股(北京)有限公司 A kind of video progress adjusting method, device and electronic equipment
CN108388583A (en) * 2018-01-26 2018-08-10 北京览科技有限公司 A kind of video searching method and video searching apparatus based on video content
CN108875610B (en) * 2018-06-05 2022-04-05 北京大学深圳研究生院 Method for positioning action time axis in video based on boundary search

Also Published As

Publication number Publication date
CN109905772A (en) 2019-06-18

Similar Documents

Publication Publication Date Title
CN109905772B (en) Video clip query method, device, computer equipment and storage medium
US10958748B2 (en) Resource push method and apparatus
CN108829822B (en) Media content recommendation method and device, storage medium and electronic device
WO2022022152A1 (en) Video clip positioning method and apparatus, and computer device and storage medium
JP7183385B2 (en) Node classification method, model training method, and its device, equipment and computer program
CN107357875B (en) Voice search method and device and electronic equipment
CN108563722B (en) Industry classification method, system, computer device and storage medium for text information
WO2023060795A1 (en) Automatic keyword extraction method and apparatus, and device and storage medium
CN110347872B (en) Video cover image extraction method and device, storage medium and electronic equipment
WO2020238353A1 (en) Data processing method and apparatus, storage medium, and electronic apparatus
US20130304468A1 (en) Contextual Voice Query Dilation
US11907821B2 (en) Population-based training of machine learning models
CN111563158B (en) Text ranking method, ranking apparatus, server and computer-readable storage medium
CN110263218B (en) Video description text generation method, device, equipment and medium
US20210248425A1 (en) Reinforced text representation learning
CN110147494A (en) Information search method, device, storage medium and electronic equipment
CN110717038B (en) Object classification method and device
US11941867B2 (en) Neural network training using the soft nearest neighbor loss
US20140257810A1 (en) Pattern classifier device, pattern classifying method, computer program product, learning device, and learning method
CN111444380B (en) Music search ordering method, device, equipment and storage medium
KR20190108958A (en) Method and Apparatus for Explicit Lyrics Classification Using Automated Explicit Lexicon Generation and Machine Learning
CN113836388A (en) Information recommendation method and device, server and storage medium
CN113010788B (en) Information pushing method and device, electronic equipment and computer readable storage medium
JP7121819B2 (en) Image processing method and apparatus, electronic device, computer-readable storage medium, and computer program
CN115063858A (en) Video facial expression recognition model training method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant