CN116186329A - Video processing, searching and index constructing method, device, equipment and storage medium - Google Patents

Video processing, searching and index constructing method, device, equipment and storage medium Download PDF

Info

Publication number
CN116186329A
CN116186329A CN202310147893.1A CN202310147893A CN116186329A CN 116186329 A CN116186329 A CN 116186329A CN 202310147893 A CN202310147893 A CN 202310147893A CN 116186329 A CN116186329 A CN 116186329A
Authority
CN
China
Prior art keywords
video
target
similarity
segment
representation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310147893.1A
Other languages
Chinese (zh)
Other versions
CN116186329B (en
Inventor
潘玉霖
吕逸良
龚镖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba China Co Ltd
Original Assignee
Alibaba China Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba China Co Ltd filed Critical Alibaba China Co Ltd
Priority to CN202310147893.1A priority Critical patent/CN116186329B/en
Publication of CN116186329A publication Critical patent/CN116186329A/en
Application granted granted Critical
Publication of CN116186329B publication Critical patent/CN116186329B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7844Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/71Indexing; Data structures therefor; Storage structures

Abstract

The disclosure relates to a video processing, searching and index constructing method, device, equipment and storage medium. According to the method and the device, the target video is segmented into the plurality of mutually non-overlapping video segments, so that redundant calculation caused by time sequence overlapping and occurrence of a large number of highly repeated candidate segments are avoided, subsequent de-duplication processing is not needed, calculation force is saved, and multi-mode searching efficiency is improved. Meanwhile, the embodiment avoids time sequence downsampling of target videos such as long videos, so that a plurality of cut video fragments which are not overlapped with each other keep complete time sequence resolution, and therefore video fragments related to the text information can be accurately positioned.

Description

Video processing, searching and index constructing method, device, equipment and storage medium
Technical Field
The disclosure relates to the field of information technology, and in particular, to a method, a device, equipment and a storage medium for video processing, searching and index construction.
Background
Currently, a multi-modal searching method can search out one modal information according to another modal information. For example, from text information, a video clip associated with the text information is located from a long video. Specifically, a long video is cut into a plurality of short videos by means of a sliding window, and then video clips are searched in the short videos, so that the video clips are matched with the text information. Or, performing time sequence downsampling on the long video to obtain a short video, determining a target video frame related to the text information from the short video, and further determining a video clip related to the text information from the long video according to the downsampling interval and the target video frame.
However, since there is an overlapping portion between the short videos cut by the sliding window, the overlapping portion may introduce additional computation, which not only causes a waste of computation force, but also causes a low efficiency of multi-modal searching. In addition, since the time sequence downsampling loses a large amount of time sequence information, a video clip related to the text information cannot be accurately located.
Disclosure of Invention
In order to solve the above technical problems or at least partially solve the above technical problems, the present disclosure provides a method, apparatus, device and storage medium for video processing, searching and index construction, so as to improve the efficiency of multi-modal searching and accurately locate video clips related to the text information.
In a first aspect, an embodiment of the present disclosure provides a video processing method, including:
dividing a target video into a plurality of mutually non-overlapping video clips;
calculating a first representation of the video clip according to the context information of the video clip, and calculating a first similarity between the video clip and a target text according to the first representation;
calculating a second similarity of the video clip to the target text according to a second representation of each video frame included in the video clip;
Determining target similarity between the video segment and the target text according to the first similarity and the second similarity;
and if the target similarity between the video segment and the target text meets a first preset condition, determining that the video segment is a target segment matched with the target text.
In a second aspect, an embodiment of the present disclosure provides a video processing method, including:
acquiring a target text sent by a terminal;
dividing a target video into a plurality of mutually non-overlapping video clips;
calculating a first representation of the video clip according to the context information of the video clip, and calculating a first similarity between the video clip and the target text according to the first representation;
calculating a second similarity of the video clip to the target text according to a second representation of each video frame included in the video clip;
determining target similarity between the video segment and the target text according to the first similarity and the second similarity;
if the target similarity between the video segment and the target text meets a first preset condition, determining that the video segment is a target segment matched with the target text;
And feeding back at least one of the time boundary of the target segment and the target segment to the terminal.
In a third aspect, an embodiment of the present disclosure provides a video searching method, including:
receiving a search request input by a user, wherein the search request comprises target text;
searching a target segment matched with the target text from a target video based on the search request, wherein the target similarity of the target segment and the target text comprises a first similarity and a second similarity, the first similarity is calculated according to a first representation of the target segment, the first representation is calculated according to the context information of the target segment, and the second similarity is calculated according to a second representation of each video frame included in the target segment;
and feeding back the target fragment to a user.
In a fourth aspect, an embodiment of the present disclosure provides an index building method, including:
acquiring a target video, and segmenting the target video into a plurality of mutually non-overlapping video segments;
an index record is respectively created for each video segment, wherein the index record comprises a first representation of the video segment and a second representation of each video frame included in the video segment, and the first representation is calculated according to the context information of the target segment.
And forming an index database by the index records corresponding to each video segment respectively.
In a fifth aspect, embodiments of the present disclosure provide a video searching method, the method including:
receiving a search request input by a user, wherein the search request comprises target text;
inputting the target text into a pre-constructed index database, searching to obtain a target segment matched with the target text, wherein the target similarity between the target segment and the target text comprises a first similarity and a second similarity, the first similarity is calculated according to a first representation of the target segment, the first representation is calculated according to context information of the target segment, and the second similarity is calculated according to a second representation of each video frame included in the target segment.
In a sixth aspect, an embodiment of the present disclosure provides a video processing apparatus, including:
the segmentation module is used for segmenting the target video into a plurality of mutually non-overlapping video clips;
the first computing module is used for computing a first representation of the video clip according to the context information of the video clip, and computing a first similarity between the video clip and a target text according to the first representation;
A second calculation module, configured to calculate a second similarity between the video segment and the target text according to a second representation of each video frame included in the video segment;
the first determining module is used for determining the target similarity between the video segment and the target text according to the first similarity and the second similarity;
and the second determining module is used for determining that the video fragment is a target fragment matched with the target text if the target similarity between the video fragment and the target text meets a first preset condition.
In a seventh aspect, embodiments of the present disclosure provide an electronic device, including:
a memory;
a processor; and
a computer program;
wherein the computer program is stored in the memory and configured to be executed by the processor to implement the method of the first, second, third, fourth or fifth aspect.
In an eighth aspect, an embodiment of the present disclosure provides a computer readable storage medium having stored thereon a computer program for execution by a processor to implement the method of the first, second, third, fourth, or fifth aspects.
According to the video processing, searching and index constructing method, device and equipment and storage medium, the target video is segmented into the plurality of mutually non-overlapping video segments, so that redundant calculation caused by time sequence overlapping and occurrence of a large number of highly repeated candidate segments are avoided, subsequent de-duplication processing is not needed, calculation force is saved, and multi-mode searching efficiency is improved. In addition, a first representation of the video clip is calculated based on the context information of the video clip, and a first similarity of the video clip to the target text is calculated based on the first representation. And calculating a second similarity of the video clip to the target text based on the second representation of each video frame included in the video clip. Further, a target similarity of the video clip to the target text is determined based on the first similarity and the second similarity. The embodiment combines the external information and the internal information of the video segment to determine the target similarity between the video segment and the target text, improves the calculation accuracy of the target similarity, and ensures that the target segment selected according to the target similarity is the video segment most matched or related with the target text, thereby improving the positioning accuracy of the video segment. Meanwhile, the embodiment avoids time sequence downsampling of target videos such as long videos, so that a plurality of cut video fragments which are not overlapped with each other keep complete time sequence resolution, and therefore video fragments related to the text information can be accurately positioned.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure.
In order to more clearly illustrate the embodiments of the present disclosure or the solutions in the prior art, the drawings that are required for the description of the embodiments or the prior art will be briefly described below, and it will be obvious to those skilled in the art that other drawings can be obtained from these drawings without inventive effort.
Fig. 1 is a schematic diagram of an application scenario provided in an embodiment of the present disclosure;
fig. 2 is a schematic diagram of another application scenario provided in an embodiment of the present disclosure;
fig. 3 is a schematic diagram of still another application scenario provided in an embodiment of the present disclosure;
fig. 4 is a flowchart of a video processing method according to an embodiment of the present disclosure;
FIG. 5 is a flowchart of a video processing method according to another embodiment of the present disclosure;
FIG. 6 is a flowchart of a video processing method according to another embodiment of the present disclosure;
FIG. 7 is a schematic diagram of coarse ordering provided by another embodiment of the present disclosure;
FIG. 8 is a flowchart of a video processing method according to another embodiment of the present disclosure;
FIG. 9 is a schematic illustration of a fine ordering provided by another embodiment of the present disclosure;
FIG. 10 is a flowchart of a video processing method according to another embodiment of the present disclosure;
FIG. 11 is a flowchart of a video processing method according to an embodiment of the present disclosure;
FIG. 12 is a flowchart of a video processing method according to another embodiment of the present disclosure;
fig. 13 is a schematic structural diagram of a video processing apparatus according to another embodiment of the present disclosure;
fig. 14 is a schematic structural diagram of an embodiment of an electronic device provided in an embodiment of the disclosure;
fig. 15 is a flowchart of a video searching method provided in an embodiment of the present disclosure;
fig. 16 is a schematic diagram of another application scenario provided in an embodiment of the present disclosure;
FIG. 17 is a flowchart of an index building method provided by an embodiment of the present disclosure;
fig. 18 is a flowchart of a video searching method provided in an embodiment of the present disclosure.
Detailed Description
In order that the above objects, features and advantages of the present disclosure may be more clearly understood, a further description of aspects of the present disclosure will be provided below. It should be noted that, without conflict, the embodiments of the present disclosure and features in the embodiments may be combined with each other.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure, but the present disclosure may be practiced otherwise than as described herein; it will be apparent that the embodiments in the specification are only some, but not all, embodiments of the disclosure.
It should be noted that, the target video (including but not limited to the video uploaded to the cloud server by the user, the video pre-stored in the cloud server, and the like) and the target text (including but not limited to the text information sent to the cloud server by the user through the terminal, and the like) related to the application are information and data authorized by the user or fully authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region, and are provided with corresponding operation entries for the user to select authorization or rejection.
In addition, the video processing method provided in the present application may involve the following terms for explanation, and the details are as follows:
anchor: refers to a predefined segment interval.
frame: video frames refer to images in a video that are at a particular time stamp.
pre-ranking: coarse ordering.
re-ranking: and (5) fine sorting.
self-attitution: self-attention mechanism.
query: text is entered.
In general, a multi-modal searching method can search out one modal information according to another modal information, for example, searching for video clips by using text information. At present, two ways of searching video clips by using text information exist, the first way is to convert the text information into a predefined mark, and then search the marked video clips in the database by using the mark. The second implementation means is to input the original video and text information into a machine learning model which is trained by a deep learning method, so that the machine learning model can output the start and stop time of a video segment which is matched with the text information in the original video. Since the first implementation requires that the tag name be a tag name in the tag set, if the user cannot obtain the tag set, the application of the first implementation is limited. While the second implementation has no similar limitations, the second implementation is therefore more user friendly. However, for the second implementation, the machine learning model can well support the positioning of video segments in short videos (e.g., tens of seconds to minutes in length) with current deep learning-based methods. If it is desired to locate the start-stop time of a video segment matching text information in a long video (e.g., a 2 hour full movie) by means of a machine learning model, some additional strategies need to be introduced. These additional strategies generally include the following two strategies, each of which is described below.
One additional strategy is a sliding window based strategy. Specifically, a window covering a shorter time sequence is preset, then the window is slid on a long video from beginning to end, so that the long video is cut into short video sets with intersections in time sequence, then fragments of each cut short video are respectively removed for positioning to obtain a large number of candidate fragment sets, finally global candidate fragment reordering is carried out, and the first N (top-N) candidate fragments in the ordering are used as final search results, namely target fragments.
Another additional strategy is based on a time sequence downsampling scheme, wherein the time sequence downsampling is performed on the long video to obtain a short video, and a target video frame related to the text information is determined from the short video. Further, a video clip associated with the text information is determined from the long video based on the downsampling interval and the target video frame, and the video clip may be denoted as a target clip.
Although there is a time sequence overlapping portion between the short videos cut by the sliding window to ensure the completeness of the candidate segment, for example, if the target segment is just at the junction of two short videos, if the two short videos have no intersection, the target segment cannot be positioned, that is, the start-stop time of the target segment cannot be determined. However, this overlap introduces additional computation and also results in the occurrence of a large number of highly repetitive candidate segments, which not only results in wasted computational effort, but also results in less efficient multi-modal searching. In addition, since the time sequence downsampling loses a large amount of time sequence information, a video clip related to the text information cannot be accurately located.
In view of this problem, the embodiments of the present disclosure provide a video processing method, which is applicable to the following several application scenarios, and is described below.
In one scenario, the video processing method may serve a customer who owns a large amount of video data, which may be long video, which may be an individual user (e.g., a video blogger) or a company. The large amount of video data may be stored on a database server local to the client (e.g., server 12 shown in fig. 1) to which the video blogger or company staff may send text information via a terminal (e.g., terminal 11 shown in fig. 1), including in particular a cell phone, computer, tablet computer, or the like. The text information may be text information entered by the video blogger or corporate staff on the user interface of the terminal, such as "please find a video clip of a snowfall". Alternatively, the terminal may collect voice instructions of the video blogger or corporate staff, and further, recognize the voice instructions as text information through automatic speech recognition (Automatic Speech Recognition, ASR) techniques. Or the terminal sends the collected voice command to the database server, and the database server recognizes the voice command as text information through ASR technology.
When the database server acquires the text information, the database server may perform the video processing method described in this embodiment for each long video, and search for a video clip matching the text information from some long videos. It will be appreciated that in a long video, there may be one or more video clips that match the text information. Further, the database server may feed back one or more video clips that it finally searches to the terminal. That is, in this scenario, the client may search for the video resource owned by itself, and quickly locate the target segment using the search function. Thereby activating the value of the video resource and reducing the silencing cost of the video resource. Alternatively, the client may also authorize other clients to search for the video asset.
In addition, a large amount of video data owned by the client can be stored in a cloud server, wherein the cloud server is a server cluster and is provided with a plurality of servers, and the cloud server comprises a processor, a hard disk, a memory, a system bus and the like, similar to a general computer architecture. The cloud server 20 shown in fig. 2 includes a server 21, a server 22, and a server 23, and the server 21, the server 22, and the server 23 constitute a server cluster. It will be appreciated that the number of servers included in a server cluster is not limited to the number shown in fig. 2, but is merely illustrative and not particularly limiting.
In addition, when the storage space of the terminal is enough to store the long video or the processing performance of the terminal is good enough, the terminal can search the video fragments matched with the text information in the long video.
In another application scenario, a user may search for long videos uploaded by other users, or a user may search for long videos provided by a third party (e.g., a video provider). As shown in fig. 3, the user a of the terminal 31 uploads a long video shot by the user a, for example, a video of the dish made by the user a, to the server 32. If user B of terminal 33 is interested in the seasoning used when user a is cooking while downloading the long video from server 32 or viewing the long video online, user B may input text information, such as "what seasoning he is cooking," on the user interface provided by terminal 33, or user B speaks "what seasoning he is cooking," to terminal 33 by voice, and terminal 33 recognizes the voice information of user B as text information. Further, the terminal 33 may search the long video for a video clip matching the text information and present the video clip to the user B. Alternatively, the terminal 33 transmits the text information to the server 32, so that the server 32 searches for a video clip matching the text information from the long video, and feeds the video clip back to the terminal 33. In addition, the server 32 may be not only a single server, but also a cluster of servers, such as a cloud server as described above.
It can be understood that the video processing method described in this embodiment may be applicable to not only the above-mentioned several scenes, but also other scenes where the target segment needs to be searched, which is not described here again.
In addition, as can be seen from the above application scenario, the video processing method of the present embodiment may be executed not only on the terminal but also on the cloud (for example, on a cloud server).
The following describes the video processing method as an example executed on a cloud server, and the method is described in connection with a specific embodiment.
Fig. 4 is a flowchart of a video processing method according to an embodiment of the present disclosure. As shown in fig. 4, the method specifically comprises the following steps:
s401, segmenting the target video into a plurality of mutually non-overlapping video clips.
For example, in fig. 1, the target video may be one of a plurality of long videos stored in advance on the server 12, and the video processing method performed by the server 12 for the target video is identical to the processing method performed by the server 12 for other long videos, so this embodiment will be schematically described by taking a certain long video, for example, the target video as an example. In particular, a trained machine learning model may be deployed in the server 12, and the training process of the machine learning model may be performed on the server 12, or may be performed on other servers or other devices. In addition, in other embodiments, the trained machine learning model may also be deployed on a terminal. Specifically, the machine learning model after training can realize end-to-end long video segment positioning, that is, after inputting a long video and text information into the machine learning model, the machine learning model can directly predict and output the video segment in the long video, which is matched with the text information, or predict and output the start-stop time of the video segment in the long video, which is matched with the text information, which is related to the text information. In other embodiments, the trained machine learning model may be a multi-modal pre-trained large model. Specifically, the process inside the trained machine learning model is shown in fig. 5. For example, when the long video includes a plurality of video frames that form a sequence of video frames, the machine learning model includes a visual feature encoder and a text feature encoder, and when the long video and the text information (e.g., query text as shown in fig. 5) are input to the machine learning model, the visual feature encoder may encode each video frame in the sequence of video frames, and the process of encoding a certain video frame may be a process of extracting features from the video frame, such that each video frame corresponds to a feature vector, and the feature vector corresponding to a certain video frame may be recorded as an original representation of the video frame. In addition, the feature vector corresponding to each video frame in the video frame sequence can be recorded as a video frame feature sequence. The text feature encoder can encode the query text to obtain feature vectors, namely the characterization, corresponding to the query text. For example, when the query text is a whole text, the whole text corresponds to a feature vector. Further, the machine learning module may form a video clip from each M consecutive video frames according to a preset length M from a first frame of the long video, and a video clip may be denoted as an anchor. There is no overlap between two adjacent video clips, so the long video can be split into multiple non-overlapping video clips according to a preset length M. That is, M video frames are included in each video clip.
S402, calculating a first representation of the video clip according to the context information of the video clip, and calculating a first similarity between the video clip and a target text according to the first representation.
For example, for a video segment, compressing, e.g., averaging, the feature vector for each of the M video frames that it includes may result in a feature vector that may be recorded as an original representation of the video segment. For example, a long video is split into a plurality of mutually non-overlapping video segments, and context information of a certain video segment can be obtained according to an original representation of the certain video segment and original representations of other video segments, for example, the context information can be an association relationship between the certain video segment and other video segments. Further, the context information of the certain video segment is encoded according to the converter module, for example, the original representation of the certain video segment is updated according to the context information of the certain video segment, resulting in a first representation of the certain video segment. I.e. the first characterization of the certain video segment is based on the characterization of the context information of the certain video segment. The converter may specifically be a temporary hierarchical visual converter (Temporal Hierarchical Vision Transformer using Shifted Windows) using a shift window, wherein the hierarchical visual converter using a shift window is a Swin converter, which is a new converter (converter) architecture.
Further, a first similarity of the video segment and the target text is calculated based on the first representation of the certain video segment and the feature vector of the query text, i.e. the target text, as described above. That is, the first similarity is calculated according to external information of the video clip (for example, an association relationship between the video clip and other video clips).
It will be appreciated that, assuming that the long video is split into 1000 video segments, each of the 1000 video segments may have a first similarity with the target text, so as to obtain 1000 first similarities, where only the video segments are different, the values of the corresponding first similarities may be different, but the calculation process of each first similarity is similar, which is not described in detail herein.
S403, calculating second similarity between the video clip and the target text according to the second representation of each video frame included in the video clip.
For example, a video segment includes M video frames, each of which corresponds to an original representation, and according to the original representation corresponding to each of the M video frames, an association relationship between the video frame and other video frames can be calculated. Further, updating the original representation corresponding to the video frame according to the association relation, so as to obtain a second representation of the video frame. A second representation of each video frame included in the certain video segment may be calculated according to a similar method. Further, according to the second representation of each video frame included in the certain video segment and the representation of the target text, the similarity between each video frame in the certain video segment and the target text is calculated. And calculating the second similarity between the certain video segment and the target text according to the similarity between each video frame in the certain video segment and the target text. That is, the second similarity is calculated based on the internal information of the certain video clip (for example, the association relationship between the video frames within the video clip).
In a possible implementation, assuming that the long video is split into 1000 video segments, each of the 1000 video segments may have a second similarity with the target text, so as to obtain 1000 second similarities, where only the video segments are different, the values of the corresponding second similarities may be different, but the calculation process of each second similarity is similar, and will not be described in detail herein.
In another possible implementation, calculating a second similarity of the video clip to the target text based on the second representation of each video frame included in the video clip includes: and when the first similarity between the video segment and the target text meets a second preset condition, calculating the second similarity between the video segment and the target text according to the second representation of each video frame included in the video segment. For example, before calculating the second similarity between a certain video segment and the target text, the first similarity between the certain video segment and the target text is required to satisfy the second preset condition.
Optionally, the first similarity between the video segment and the target text meets a second preset condition, including: the first similarity between the video segments and the target text enables the video segments to be located in a first preset number in front of a first ordering result, and the first ordering result is obtained by performing first descending order on the plurality of non-overlapping video segments according to the first similarity between the plurality of non-overlapping video segments and the target text.
Specifically, according to the first similarity between the plurality of mutually non-overlapping video clips and the target text, performing first descending arrangement on the plurality of mutually non-overlapping video clips to obtain a first ordering result; the first similarity between the video segments and the target text enables the video segments to be located in a first preset number before the first sequencing result.
For example, according to the size of 1000 first similarities as described above, 1000 video clips are arranged in descending order, that is, the first similarity corresponding to the video clip arranged in front is greater than the first similarity corresponding to the video clip arranged in rear. The descending order is denoted as first descending order. For example, the first descending order may be a coarse ordering as shown in FIG. 5. The ordered result is marked as a first ordered result. Further, the first K (Top-K) video clips in the first sorting result are intercepted, and K is recorded as a first preset number. For the K video clips, a second similarity of each of the K video clips to the target text is calculated.
S404, determining the target similarity between the video segment and the target text according to the first similarity and the second similarity.
If a certain video segment has a first similarity and a second similarity with a target text, determining the target similarity between the video segment and the target text according to the first similarity and the second similarity, wherein the target similarity can be the sum of the first similarity and the second similarity.
S405, if the target similarity between the video segment and the target text meets a first preset condition, determining that the video segment is a target segment matched with the target text.
And if the target similarity between the video segment and the target text meets a first preset condition, determining that the video segment is a target segment matched with the target text.
Optionally, the target similarity between the video segment and the target text meets a first preset condition, including: the target similarity between the video segments and the target text enables the video segments to be located in a first second preset number of second sorting results, and the second sorting results are obtained after each video segment in the first sorting results is subjected to second descending order according to the target similarity between each video segment in the first sorting results and the target text.
Specifically, according to the target similarity between each video segment in the first sequencing result and the target text, performing second descending order on each video segment in the first sequencing result to obtain a second sequencing result; and the target similarity between the video clips and the target text enables the video clips to be located in a first second preset number of the second sequencing result.
For example, each of the K video clips described above has both the first similarity and the second similarity to the target text, and thus each of the K video clips has the target similarity to the target text, i.e., there are K target similarities. Further, according to the magnitudes of the K target similarities, the K video clips are arranged in a descending order, that is, the target similarities corresponding to the video clips arranged in the front are larger than the target similarities corresponding to the video clips arranged in the rear. The descending order is noted as a second descending order. For example, the second descending order may be a fine ordering as shown in FIG. 5. The ordered result is marked as a second ordered result. Further, the first N (Top-N) video clips in the second sorting result are intercepted, N is recorded as a second preset number, and N is larger than or equal to 1. In this embodiment, the greater the target similarity, the greater the matching degree or correlation degree between the video clips and the target text is, so the N video clips can be used as the target clips matching the target text. Further, the machine learning model may output the N video clips, or output the start-stop times of the N video clips, respectively.
According to the embodiment of the disclosure, the target video is segmented into the plurality of mutually non-overlapping video segments, so that redundant calculation caused by time sequence overlapping and occurrence of a large number of highly repeated candidate segments are avoided, subsequent de-duplication processing is not needed, calculation force is saved, and the efficiency of multi-mode searching is improved. In addition, a first representation of the video clip is calculated based on the context information of the video clip, and a first similarity of the video clip to the target text is calculated based on the first representation. And calculating a second similarity of the video clip to the target text based on the second representation of each video frame included in the video clip. Further, a target similarity of the video clip to the target text is determined based on the first similarity and the second similarity. The embodiment combines the external information and the internal information of the video segment to determine the target similarity between the video segment and the target text, improves the calculation accuracy of the target similarity, and ensures that the target segment selected according to the target similarity is the video segment most matched or related with the target text, thereby improving the positioning accuracy of the video segment. Meanwhile, the embodiment avoids time sequence downsampling of target videos such as long videos, so that a plurality of cut video fragments which are not overlapped with each other keep complete time sequence resolution, and therefore video fragments related to the text information can be accurately positioned.
Fig. 6 is a flowchart of a video processing method according to another embodiment of the present disclosure. In this embodiment, splitting the target video into a plurality of mutually non-overlapping video segments includes: and dividing the target video into a plurality of mutually non-overlapping video fragments under the multiple layers according to each layer in the multiple layers, wherein the time lengths of the video fragments in different layers are different.
As shown in fig. 7, after each video frame in the long video is encoded, a video frame feature sequence is obtained, and further, each M consecutive video frames are formed into one video segment, that is, the video segments are divided, and it is assumed that 1000 video segments are obtained by dividing, and the 1000 video segments are used as the video segments of the first level. The context information for each of the 1000 video clips is then encoded according to Temporal Swin Transformer to obtain a first representation of each video clip in the first hierarchy. At the same time, multiple pooling layers are used to obtain multiple levels of video clips. For example, the first pooling layer 71 may merge two adjacent video segments of the 1000 video segments, for example, the 1 st video segment and the 2 nd video segment, the 3 rd video segment and the 4 th video segment, and so on, so as to obtain 500 video segments, where the 500 video segments may be regarded as video segments of the second hierarchy. The second pooling layer may similarly combine two adjacent video clips of the 500 video clips, thereby obtaining 250 video clips, and the 250 video clips may be used as video clips of the third layer. It can be seen that the length of time of a video clip in each level is 2 times the length of time of the video clip in its last level. In addition, the merging process of each pooling layer is equivalent to one repartition of the long video. In addition, as shown in fig. 7, there is a Temporal Swin Transformer module below each pooling layer, and the Temporal Swin Transformer module may encode context information of video segments in a level corresponding to the pooling layer to obtain a first representation of each video segment in the level, for example, there are L levels in this embodiment. Assuming l=4, 125 video clips are included in the fourth hierarchy.
Correspondingly, according to the context information of the video clip, calculating a first representation of the video clip, and according to the first representation, calculating a first similarity between the video clip and a target text, wherein the method comprises the following steps:
s601, updating original characterization of the video clips according to association relations between the video clips and other video clips to obtain first characterization of the video clips, wherein the original characterization of the video clips is obtained by fusing original characterization of each video frame included in the video clips.
For example, taking a first level of the L levels as an example, the context information of each video clip in the first level may be an association relationship between the video clip and other video clips in the first level. The original representation of the video segment may be updated based on the context information of the video segment to obtain a first representation of the video segment. The calculation of the original representation of the video segment may be, for example, fusing, for example, averaging, the original representation of each of the M video frames included in the video segment. Following a similar calculation process, a first representation of each video segment in the first hierarchy, a first representation of each video segment in the second hierarchy, … …, and a first representation of each video segment in the L-th hierarchy may be calculated. Further, the first representation of each video segment in the first hierarchy, the first representation of each video segment in the second hierarchy, … …, and the first representation of each video segment in the L-th hierarchy may be grouped into a set. For example, l=4, and there are 1875 video clips at 4 levels.
S602, calculating the first similarity between the video segment and the target text according to the first representation and the representation of the target text.
For example, a first similarity of each video segment in the set and the target text is calculated based on the first representation of the video segment and the representation of the target text. For example, the set includes 1875 first tokens, such that 1875 first similarities may be obtained. Further, the 1875 video clips are arranged in a descending order, i.e., coarse order, according to the 1875 first similarities. The characterization of the target text may also be referred to herein as a query (query) feature. Further, top-K video clips can be intercepted from the coarse ordering result as an ordering object of the fine ordering. For example, k=100, then the 100 video clips that are truncated can be the ordering object of the fine ordering.
Alternatively, in other embodiments, the 4 levels of video segments may not constitute a collection, but rather the video segments of each level individually compute the first similarity to the target text. Further, each video clip in each hierarchy is arranged in descending order according to the first similarity corresponding to each video clip in the hierarchy, so that L hierarchies correspond to L coarse ordering results. Further, top-K video clips are respectively intercepted from each of the L coarse ordering results as an ordering object of the fine ordering, for example, k=100, l=4, and then the acquired 400 video clips are taken as the ordering object of the fine ordering.
As a ranking object of the fine ranking, a second similarity of each video clip to the target text needs to be calculated.
Specifically, according to the second representation of each video frame included in the video clip, a second similarity between the video clip and the target text is calculated, including the following steps as shown in fig. 8:
s801, updating the original representation of the video frame according to the association relation between the video frame and other video frames in the video segment for each video frame included in the video segment, and obtaining a second representation of the video frame.
For example, a certain video clip is a certain video clip in a finely ordered ordering object. The video clip includes M video frames, each of the M video frames corresponding to an original representation, and the original representations of each of the M video frames corresponding to the M video frames, respectively, may be denoted as intra-video frame representations as shown in fig. 9. For each video frame in the M video frames, encoding an original representation of the video frame by using a multi-head self-attention mechanism, where the encoding process specifically may be to update the original representation of the video frame according to an association relationship between the video frame and other video frames in the M video frames, to obtain a second representation of the video frame. Similarly, a second representation of each of the M video frames may be calculated. The second characterization of each of the M video frames may be denoted as a video clip content characterization as shown in fig. 9.
S802, calculating the similarity between each video frame included in the video clip and the target text according to the second representation of each video frame included in the video clip.
As shown in fig. 9, after the second representation of each of the M video frames is obtained, a similarity of each video frame to the target text may be calculated from the second representation of each video frame and the representation of the target text. For example, after calculating the similarity between each video frame of the M video frames and the target text, M similarities may be obtained.
S803, calculating second similarity between the video clip and the target text according to the similarity between each video frame included in the video clip and the target text.
For example, the M similarities described above are fused to obtain a second similarity of the video segment to the target text. The second similarity of the video clip to the target text may be noted as content-based similarity of the video clip as shown in fig. 9. The first similarity of the video clip to the target text may be noted as a similarity of the video clip based on the context information as shown in fig. 9. And adding the first similarity and the second similarity to obtain the target similarity of the video segment and the target text, namely the final similarity. According to the similar calculation process, the target similarity between each video segment in the precisely ordered ordering object and the target text can be obtained, and further, the precisely ordered object is subjected to precise ordering according to the target similarity. Top-N video clips in the sorting result after the fine sorting can be used as target clips finally searched.
According to the method, the first similarity and the second similarity of each video segment and the target text are accurately calculated, so that the target similarity of each video segment and the target text is accurately calculated, and further, the target segment which is most matched with the target text is searched out according to the target similarity of each video segment and the target text.
Since the initial boundary of each video clip is set in advance, the start-stop time of the target clip determined according to the above embodiment is fixed. For example, the start-stop time of the target segment is 3 seconds to 5 seconds. However, in actual situations, the video content before 3 seconds or the video content after 5 seconds is related to or matched with the target text, so that the start and stop time of the determined target segment can be adjusted through the following embodiments, so that the start and stop time of the target segment approaches to the actual target boundary, and the effect of accurate positioning is achieved. The specific process is shown in fig. 10.
Fig. 10 is a flowchart of a video processing method according to another embodiment of the present disclosure. In this embodiment, after determining that the video clip is a target clip matching the target text, the method further includes the following steps:
S1001, calculating a third representation of the video segment according to the second representation of each video frame included in the video segment.
For example, a target segment may include M video frames, and according to the above embodiment, a second representation of each of the M video frames may be calculated, and further, averaging the second representation of each of the M video frames may result in a third representation of the target segment. It can be seen that the third characterization of the target segment is a content-based characterization of the target segment, such as the video segment shown in FIG. 11.
S1002, adjusting the time boundary of the target segment according to the first representation of the video segment, the third representation of the video segment and the representation of the target text.
For example, the first characterization of a certain video segment as described above is a characterization based on context information of the certain video segment, such as the context-based characterization of the video segment shown in fig. 11. Thus, the time boundaries of the target segments may be adjusted based on the context-based characterization of the target segments, the content-based characterization of the target segments, and the characterization of the target text.
Optionally, adjusting the time boundary of the target segment according to the first representation of the video segment, the third representation of the video segment and the representation of the target text includes: obtaining a first feature according to the first representation of the video clip and the representation of the target text; obtaining a second feature according to the third representation of the video segment and the representation of the target text; splicing the first feature and the second feature to obtain a target feature; predicting the bias of the time boundary of the target segment according to the target characteristics; and adjusting the time boundary of the target segment according to the bias.
For example, as shown in fig. 11, the first feature is obtained from the context-based representation of the target segment and the representation of the target text, e.g., the target segment is dot multiplied by the context-based representation and the representation of the target text. And obtaining the second feature according to the representation of the target segment based on the content and the representation of the target text, for example, the target segment is subjected to dot multiplication based on the representation of the content and the representation of the target text. Further, the first feature and the second feature are spliced to obtain a spliced target feature. And predicting the bias of the time boundary of the target segment according to the target feature. Optionally, since the time boundary includes a left boundary and a right boundary, the bias includes a left bias and a right bias, and the left bias and the right bias may also be signed, for example, "-" and "+". Further, the left and right boundaries of the target segment are adjusted based on the left and right offsets, respectively, e.g., the left offset, i.e., the signed adjustment value, is added to the left boundary and the right offset, i.e., the signed adjustment value, is added to the right boundary.
S1003, outputting the adjusted time boundary of the target segment.
For example, when Top-N video clips in the sorting result obtained after the fine sorting are used as the target clips, if N is greater than 1, each video clip in the N video clips can be respectively subjected to time boundary adjustment, and a clip positioning result is output. For example, the segment locating result may be a time boundary after each of the N video segments is adjusted.
According to the method and the device, the external information and the internal information of the target segment are fully utilized by combining the context-based representation of the target segment and the content-based representation of the target segment, so that the bias of the time boundary of the target segment can be accurately calculated according to the context-based representation of the target segment and the content-based representation of the target segment. Thereby accurately and flexibly adjusting the time boundary of the target segment. Flexibility of the time boundary of the target segment is improved. In addition, since the method described in this embodiment does not perform model reasoning on short videos, the length of the target segment to be finally predicted is not limited, thereby providing a greater degree of freedom in prediction. In addition, the embodiment performs fine sorting on the basis of coarse sorting, so that the range of target fragments is gradually narrowed, the searching or searching efficiency is greatly improved, the calculation complexity is reduced, and the accurate fragment positioning is realized.
On the basis of the above embodiment, the first descending order and the second descending order are implemented by a machine learning model, respectively, the machine learning model being trained by a dual approximate ordering loss function, the dual approximate ordering loss function including a first loss term and a second loss term; the first loss term is used for calculating the difference between descending order arrangement of the video sample fragments and real ordering of the video sample fragments according to the similarity between the video sample fragments and the fixed sample text by the machine learning model under the condition of fixed sample text; the second penalty term is used for calculating a difference between a descending order of the plurality of sample texts and a true ordering of the plurality of sample texts according to similarity of the plurality of sample texts and the fixed video sample segments respectively by the machine learning model under the condition of fixing the video sample segments.
For example, the coarse ordering as described above is noted as a first descending order and the fine ordering is noted as a second descending order. The first descending order and the second descending order are implemented internally to a machine learning model as described above. Specifically, the machine learning model is trained by Dual approximation ordering loss function (Dual-form Approximate Rank Loss) during the training process. Specifically, the dual approximate ordering Loss function includes a first Loss term and a second Loss term, each Loss term may specifically be an approximate (Approx) normalized break cumulative gain (Normalized Discounted Cumulative Gain, NDCG) Loss (Loss) term. For example, during training, the machine learning model ranks a plurality of video sample segments in descending order according to their respective similarities to a given sample text (e.g., similar to the target similarity described above). The first penalty term is used to calculate the difference between the descending order and the true ordering of the plurality of video sample segments given, i.e., fixed, sample text. The actual ordering of the plurality of video sample segments may be a pre-annotated ordering. Optimizing the ordering of the plurality of video sample segments according to the first penalty term may cause video segments with high similarity to be ordered before video segments with low similarity.
In addition, the machine learning model may also sort the plurality of sample texts in descending order according to their respective similarities to a given video sample segment (e.g., similar to the target similarity described above). The second penalty term is used to calculate the difference between the descending order and the true ordering of the plurality of sample text given, i.e., fixed, video sample segments. The actual ordering of the sample text may be a pre-annotated ordering. Optimizing the ordering of the plurality of sample texts according to the second penalty term may cause sample texts with high similarity to be ordered before sample texts with low similarity.
According to the embodiment, the machine learning model is trained through the dual approximate ordering loss function, so that sample text and video sample fragments can be sufficiently matched, cross-modal semantic alignment can be sufficiently learned by the machine learning model, and accurate cross-modal fragment positioning can be realized. In addition, because the dual approximate ordering loss function optimizes the ordering from the global angle, the semantic information of the whole long video can be fully utilized to carry out parameter training of the model in the training process, so that the trained machine learning model obtains the optimal effect on a plurality of public data sets. For example, the most advanced method (state-of-the-art, sota) effect is achieved.
Fig. 12 is a flowchart of a video processing method according to another embodiment of the present disclosure. The method may be performed by a cloud, for example, by a cloud server. The machine learning model as described above is deployed on the cloud server. In this embodiment, the method specifically includes the following steps:
s1201, acquiring a target text sent by a terminal.
S1202, segmenting the target video into a plurality of mutually non-overlapping video clips.
S1203, calculating a first representation of the video clip according to the context information of the video clip, and calculating a first similarity between the video clip and the target text according to the first representation.
And S1204, calculating the second similarity between the video clip and the target text according to the second representation of each video frame included in the video clip.
S1205, determining the target similarity between the video segment and the target text according to the first similarity and the second similarity.
S1206, if the target similarity between the video segment and the target text meets a first preset condition, determining that the video segment is a target segment matched with the target text.
S1207, feeding back at least one of the time boundary of the target segment and the target segment to the terminal.
It can be understood that the content of S1201-S1207 may refer to the implementation procedure of the foregoing embodiment, and will not be described herein.
Fig. 13 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present disclosure. The video processing device provided by the embodiment of the present disclosure may execute the processing flow provided by the embodiment of the video processing method, and the device may be implemented in a software and/or hardware manner, and may be configured in an electronic device, for example, a server or a terminal, where the terminal specifically includes a mobile phone, a computer, a tablet computer, or the like. The server may be a cloud server. As shown in fig. 13, the video processing apparatus 130 includes:
the splitting module 131 is configured to split the target video into a plurality of mutually non-overlapping video segments;
a first calculation module 132, configured to calculate a first representation of the video segment according to the context information of the video segment, and calculate a first similarity between the video segment and a target text according to the first representation;
a second calculation module 133, configured to calculate a second similarity between the video segment and the target text according to a second representation of each video frame included in the video segment;
a first determining module 134, configured to determine a target similarity between the video segment and the target text according to the first similarity and the second similarity;
The second determining module 135 is configured to determine that the video segment is a target segment that matches the target text if the target similarity between the video segment and the target text meets a first preset condition.
Optionally, the splitting module 131 is specifically configured to: and dividing the target video into a plurality of mutually non-overlapping video fragments under the multiple layers according to each layer in the multiple layers, wherein the time lengths of the video fragments in different layers are different.
Optionally, the first calculating module 132 is configured to calculate, according to the context information of the video segment, a first representation of the video segment, and calculate, according to the first representation, a first similarity between the video segment and the target text, where the first similarity is specifically:
updating the original representation of the video clip according to the association relation between the video clip and other video clips to obtain a first representation of the video clip, wherein the original representation of the video clip is obtained by fusing the original representation of each video frame included in the video clip;
and calculating the first similarity between the video segment and the target text according to the first representation and the representation of the target text.
Optionally, the second calculating module 133 is specifically configured to, when calculating the second similarity between the video segment and the target text according to the second representation of each video frame included in the video segment:
and when the first similarity between the video segment and the target text meets a second preset condition, calculating the second similarity between the video segment and the target text according to the second representation of each video frame included in the video segment.
Optionally, the second calculating module 133 is specifically configured to, when calculating the second similarity between the video segment and the target text according to the second representation of each video frame included in the video segment:
updating the original representation of the video frame according to the association relation between the video frame and other video frames in the video clip aiming at each video frame included in the video clip to obtain a second representation of the video frame;
according to the second representation of each video frame included in the video clip, calculating the similarity between each video frame included in the video clip and the target text;
and calculating the second similarity between the video segment and the target text according to the similarity between each video frame included in the video segment and the target text.
Optionally, the apparatus further comprises: a third calculation module 136, an adjustment module 137 and an output module 138, wherein after the second determination module 135 determines that the video segment is a target segment matching the target text, the third calculation module 136 is configured to calculate a third representation of the video segment according to the second representation of each video frame included in the video segment, and the adjustment module 137 is configured to adjust a time boundary of the target segment according to the first representation of the video segment, the third representation of the video segment and the representation of the target text; the output module 138 is configured to output the adjusted time boundary of the target segment.
Optionally, the adjusting module 137 is specifically configured to, when adjusting the time boundary of the target segment according to the first representation of the video segment, the third representation of the video segment, and the representation of the target text:
obtaining a first feature according to the first representation of the video clip and the representation of the target text;
obtaining a second feature according to the third representation of the video segment and the representation of the target text;
splicing the first feature and the second feature to obtain a target feature;
Predicting the bias of the time boundary of the target segment according to the target characteristics;
and adjusting the time boundary of the target segment according to the bias.
Optionally, the first similarity between the video segment and the target text meets a second preset condition, including:
the first similarity between the video segments and the target text enables the video segments to be located in a first preset number of the first sequencing result, and the first sequencing result is obtained by performing first descending arrangement on the plurality of mutually non-overlapping video segments according to the first similarity between the plurality of mutually non-overlapping video segments and the target text;
the target similarity between the video clip and the target text meets a first preset condition, and the method comprises the following steps:
the target similarity between the video segments and the target text enables the video segments to be located in a first preset number of the second sorting results, and the second sorting results are obtained after each video segment in the first sorting results is subjected to second descending order according to the target similarity between each video segment in the first sorting results and the target text.
Optionally, the first descending order and the second descending order are implemented by a machine learning model, the machine learning model is trained by a dual approximate ordering loss function, and the dual approximate ordering loss function comprises a first loss term and a second loss term;
the first loss term is used for calculating the difference between descending order arrangement of the video sample fragments and real ordering of the video sample fragments according to the similarity between the video sample fragments and the fixed sample text by the machine learning model under the condition of fixed sample text;
the second penalty term is used for calculating a difference between a descending order of the plurality of sample texts and a true ordering of the plurality of sample texts according to similarity of the plurality of sample texts and the fixed video sample segments respectively by the machine learning model under the condition of fixing the video sample segments.
The video processing apparatus of the embodiment shown in fig. 13 may be used to implement the technical solution of the above-mentioned method embodiment, and its implementation principle and technical effects are similar, and will not be described herein again.
The internal functions and structures of a video processing apparatus are described above, which may be implemented as an electronic device. Fig. 14 is a schematic structural diagram of an embodiment of an electronic device provided in an embodiment of the disclosure. As shown in fig. 14, the electronic device includes a memory 141 and a processor 142.
The memory 141 is used to store programs. In addition to the programs described above, the memory 141 may also be configured to store other various data to support operations on the electronic device. Examples of such data include instructions for any application or method operating on the electronic device, contact data, phonebook data, messages, pictures, videos, and the like.
The memory 141 may be implemented by any type or combination of volatile or nonvolatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.
Processor 142 is coupled to memory 141, executing programs stored in memory 141 for:
dividing a target video into a plurality of mutually non-overlapping video clips;
calculating a first representation of the video clip according to the context information of the video clip, and calculating a first similarity between the video clip and a target text according to the first representation;
calculating a second similarity of the video clip to the target text according to a second representation of each video frame included in the video clip;
Determining target similarity between the video segment and the target text according to the first similarity and the second similarity;
and if the target similarity between the video segment and the target text meets a first preset condition, determining that the video segment is a target segment matched with the target text.
Further, as shown in fig. 14, the electronic device may further include: communication component 143, power supply component 144, audio component 145, display 146, and other components. Only some of the components are schematically shown in fig. 14, which does not mean that the electronic device only comprises the components shown in fig. 14.
The communication component 143 is configured to facilitate communication between the electronic device and other devices, either wired or wireless. The electronic device may access a wireless network based on a communication standard, such as WiFi,2G, or 3G, or a combination thereof. In one exemplary embodiment, the communication component 143 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component 143 further includes a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.
A power supply assembly 144 provides power to the various components of the electronic device. The power components 144 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for electronic devices.
The audio component 145 is configured to output and/or input audio signals. For example, the audio component 145 includes a Microphone (MIC) configured to receive external audio signals when the electronic device is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may be further stored in the memory 141 or transmitted via the communication component 143. In some embodiments, audio component 145 further comprises a speaker for outputting audio signals.
The display 146 includes a screen, which may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or slide action, but also the duration and pressure associated with the touch or slide operation.
In addition, the embodiment of the present disclosure also provides a computer-readable storage medium having stored thereon a computer program that is executed by a processor to implement the video processing method described in the above embodiment.
In addition, the embodiment of the disclosure further provides a server, and the server can be the cloud server. Specifically, the server includes:
a memory;
a processor; and
a computer program;
wherein the computer program is stored in the memory and configured to be executed by the processor to implement the method as described above.
Fig. 15 is a flowchart of a video searching method provided in an embodiment of the present disclosure. The method may be applied to the application scenario shown in fig. 16, where the application scenario includes a terminal 161 and a cloud server 162, and specifically, the method may be executed by the cloud server 162. Specifically, the method comprises the following steps:
s1501, a search request input by a user is received, where the search request includes a target text.
For example, a video-type application program is installed on the terminal 161, the application program is provided with a user interface, and the terminal 161 requests the cloud server 162 to acquire a video stream according to an operation of a user on the user interface, such as a click operation for a certain video cover. When the terminal 161 receives the video stream from the cloud server 162, the video stream is processed, for example, decoded, converted in format, and the like, and the video content is displayed on the user interface so that the user of the terminal 161 views the video content. Assuming that the video content is a long video of a serving, during viewing, a user may ask questions about what is in the video content, such as a query for condiments used in the serving, the user may enter a search request on a user interface provided by the application. Here, the present embodiment is not limited to a specific input method, and may be, for example, text input, voice input, motion gesture input, or the like. Meanwhile, the form of the search request is not limited, and may be text information, voice information, video information, or the like, for example. For example, the search request includes target text, such as "what seasoning he is using for cooking". Further, the terminal 161 may send the search request to the cloud server 162. So that the cloud server 162 can receive the search request input by the user.
S1502, searching a target segment matched with the target text from a target video based on the search request, wherein the target similarity of the target segment and the target text comprises a first similarity and a second similarity, the first similarity is calculated according to a first representation of the target segment, the first representation is calculated according to context information of the target segment, and the second similarity is calculated according to a second representation of each video frame included by the target segment. Specifically, the cloud server 162 may segment a target video, such as a long video for dish making, into a plurality of mutually non-overlapping video segments, and create an index record for each video segment, where the index record includes a first representation of the video segment and a second representation of each video frame included in the video segment, and the first representation is calculated according to context information of the target segment. The meaning of the first and second characterization is specifically described above and will not be described in detail herein. In addition, the index record corresponding to each video clip can form an index database.
When the cloud server 162 receives the search request, a target text is parsed from the search request, and the target text is input into the index database as described above, and a target segment matched with the target text is searched for by a certain matching algorithm, so that the target similarity between the target text and the target text meets a first preset condition, and the first preset condition is specifically referred to the calculation process described above and is not described herein again. Specifically, the target similarity between the target text and the target text includes a first similarity and a second similarity, the first similarity is calculated according to a first representation of the target segment, the first representation is calculated according to context information of the target segment, and the second similarity is calculated according to a second representation of each video frame included in the target segment. Specifically, the calculation process of the first similarity and the second similarity is described in the above embodiments, which is not described herein.
S1503, feeding the target segment back to the user.
For example, the cloud server 162 may send the target segment matching the target text to the terminal 161, such that the terminal 161 presents the target segment to the user.
It is understood that the method described in this embodiment may be performed by the cloud server 162. For example, in some embodiments, the method may be performed by the terminal 161. Alternatively, some steps in the method are performed by the terminal 161, other steps are performed by the cloud server 162, for example, S1501 and S1503 may be performed by the terminal 161, and S1502 is performed by the cloud server 162.
According to the embodiment of the disclosure, the target video is segmented into the plurality of mutually non-overlapping video segments, so that redundant calculation caused by time sequence overlapping and occurrence of a large number of highly repeated candidate segments are avoided, subsequent de-duplication processing is not needed, calculation force is saved, and the efficiency of multi-mode searching is improved. In addition, a first representation of the video clip is calculated based on the context information of the video clip, and a first similarity of the video clip to the target text is calculated based on the first representation. And calculating a second similarity of the video clip to the target text based on the second representation of each video frame included in the video clip. Further, a target similarity of the video clip to the target text is determined based on the first similarity and the second similarity. The embodiment combines the external information and the internal information of the video segment to determine the target similarity between the video segment and the target text, improves the calculation accuracy of the target similarity, and ensures that the target segment selected according to the target similarity is the video segment most matched or related with the target text, thereby improving the positioning accuracy of the video segment. Meanwhile, the embodiment avoids time sequence downsampling of target videos such as long videos, so that a plurality of cut video fragments which are not overlapped with each other keep complete time sequence resolution, and therefore video fragments related to the text information can be accurately positioned.
Fig. 17 is a flowchart of an index construction method provided in an embodiment of the present disclosure. As shown in fig. 17, the method may be specifically performed by a cloud server, where the cloud server may include a search engine that may construct an index database so that the search engine can quickly and accurately retrieve target segments matching the target text from a large number of video segments. The method specifically comprises the following steps:
s1701, acquiring a target video, and segmenting the target video into a plurality of mutually non-overlapping video segments.
For example, the cloud server may obtain the target video, and the source of the target video is not limited herein. For example, the target video may be uploaded to a cloud server by a video blog or may be captured from a network by the cloud server. In addition, the present embodiment does not limit the type, format, number, and the like of the target video. A target video is schematically illustrated here. Specifically, the cloud server may segment the target video into a plurality of non-overlapping video segments.
S1702, respectively creating an index record for each video segment, wherein the index record comprises a first representation of the video segment and a second representation of each video frame included in the video segment, and the first representation is calculated according to the context information of the target segment.
For example, a search engine in the cloud server may create an index record for each video segment, and specifically, the index record for each video segment may include a first representation of the video segment, and a second representation of each video frame included in the video segment, where the first representation is calculated according to the context information of the target segment. The meaning of the first and second characterization is specifically described above and will not be described in detail herein.
S1703, forming an index database by the index records corresponding to each video clip.
For example, the search engine may construct an index database from index records corresponding to each video clip. The index database may be stored in the cloud server, or in another server.
Fig. 18 is a flowchart of a video searching method provided in an embodiment of the present disclosure. For example, the method may be performed by a cloud server, and specifically, the method includes the following steps:
s1801, receiving a search request input by a user, where the search request includes a target text.
For example, the cloud server 162 receives a search request sent by the terminal 161, the search request being input by the user on the terminal 161. The search request includes target text.
S1802, inputting the target text into a pre-constructed index database, searching to obtain a target segment matched with the target text, wherein the target similarity between the target segment and the target text comprises a first similarity and a second similarity, the first similarity is calculated according to a first representation of the target segment, the first representation is calculated according to context information of the target segment, and the second similarity is calculated according to a second representation of each video frame included in the target segment.
For example, when the cloud server 162 receives a search request, a target text is parsed from the search request, and the target text is input into the index database as described above, and a target segment matched with the target text is searched for by a certain matching algorithm, so that the target similarity between the target text and the target text meets a first preset condition, where the first preset condition specifically refers to the calculation process described above, and is not described herein again. Specifically, the target similarity between the target text and the target text includes a first similarity and a second similarity, the first similarity is calculated according to a first representation of the target segment, the first representation is calculated according to context information of the target segment, and the second similarity is calculated according to a second representation of each video frame included in the target segment. Specifically, the calculation process of the first similarity and the second similarity is described in the above embodiments, which is not described herein.
Further, the cloud server 162 may send the target segment matched with the target text to the terminal 161, so that the terminal 161 displays the target segment to the user.
According to the embodiment of the disclosure, the target video is segmented into the plurality of mutually non-overlapping video segments, so that redundant calculation caused by time sequence overlapping and occurrence of a large number of highly repeated candidate segments are avoided, subsequent de-duplication processing is not needed, calculation force is saved, and the efficiency of multi-mode searching is improved. In addition, a first representation of the video clip is calculated based on the context information of the video clip, and a first similarity of the video clip to the target text is calculated based on the first representation. And calculating a second similarity of the video clip to the target text based on the second representation of each video frame included in the video clip. Further, a target similarity of the video clip to the target text is determined based on the first similarity and the second similarity. The embodiment combines the external information and the internal information of the video segment to determine the target similarity between the video segment and the target text, improves the calculation accuracy of the target similarity, and ensures that the target segment selected according to the target similarity is the video segment most matched or related with the target text, thereby improving the positioning accuracy of the video segment. Meanwhile, the embodiment avoids time sequence downsampling of target videos such as long videos, so that a plurality of cut video fragments which are not overlapped with each other keep complete time sequence resolution, and therefore video fragments related to the text information can be accurately positioned.
It should be noted that in this document, relational terms such as "first" and "second" and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The foregoing is merely a specific embodiment of the disclosure to enable one skilled in the art to understand or practice the disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown and described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (14)

1. A video processing method, wherein the method comprises:
dividing a target video into a plurality of mutually non-overlapping video clips;
calculating a first representation of the video clip according to the context information of the video clip, and calculating a first similarity between the video clip and a target text according to the first representation;
calculating a second similarity of the video clip to the target text according to a second representation of each video frame included in the video clip;
determining target similarity between the video segment and the target text according to the first similarity and the second similarity;
and if the target similarity between the video segment and the target text meets a first preset condition, determining that the video segment is a target segment matched with the target text.
2. The method of claim 1, wherein calculating a first representation of the video segment from the context information of the video segment and calculating a first similarity of the video segment to a target text from the first representation comprises:
updating the original representation of the video clip according to the association relation between the video clip and other video clips to obtain a first representation of the video clip, wherein the original representation of the video clip is obtained by fusing the original representation of each video frame included in the video clip;
And calculating the first similarity between the video segment and the target text according to the first representation and the representation of the target text.
3. The method of claim 1, wherein calculating a second similarity of the video clip to the target text based on the second representation of each video frame included in the video clip comprises:
and when the first similarity between the video segment and the target text meets a second preset condition, calculating the second similarity between the video segment and the target text according to the second representation of each video frame included in the video segment.
4. A method according to claim 1 or 3, wherein calculating a second similarity of the video clip to the target text from a second representation of each video frame comprised by the video clip comprises:
updating the original representation of the video frame according to the association relation between the video frame and other video frames in the video clip aiming at each video frame included in the video clip to obtain a second representation of the video frame;
according to the second representation of each video frame included in the video clip, calculating the similarity between each video frame included in the video clip and the target text;
And calculating the second similarity between the video segment and the target text according to the similarity between each video frame included in the video segment and the target text.
5. The method of claim 1, wherein after determining that the video clip is a target clip that matches the target text, the method further comprises:
calculating a third representation of the video clip from the second representation of each video frame comprised by the video clip;
adjusting the time boundary of the target segment according to the first representation of the video segment, the third representation of the video segment and the representation of the target text;
and outputting the adjusted time boundary of the target segment.
6. The method of claim 5, wherein adjusting the temporal boundary of the target segment based on the first representation of the video segment, the third representation of the video segment, and the representation of the target text comprises:
obtaining a first feature according to the first representation of the video clip and the representation of the target text;
obtaining a second feature according to the third representation of the video segment and the representation of the target text;
Splicing the first feature and the second feature to obtain a target feature;
predicting the bias of the time boundary of the target segment according to the target characteristics;
and adjusting the time boundary of the target segment according to the bias.
7. The method of claim 3, wherein the first similarity of the video clip to the target text satisfies a second preset condition, comprising:
the first similarity between the video clips and the target text enables the video clips to be located in a first preset number of first sequencing results, and the first sequencing results are obtained by performing first descending sequence arrangement on the plurality of mutually non-overlapping video clips according to the first similarity between the plurality of mutually non-overlapping video clips and the target text;
the target similarity between the video clip and the target text meets a first preset condition, and the method comprises the following steps:
the target similarity between the video segments and the target text enables the video segments to be located in a first second preset number of second sorting results, and the second sorting results are obtained after each video segment in the first sorting results is subjected to second descending order according to the target similarity between each video segment in the first sorting results and the target text.
8. A video processing method, wherein the method comprises:
acquiring a target text sent by a terminal;
dividing a target video into a plurality of mutually non-overlapping video clips;
calculating a first representation of the video clip according to the context information of the video clip, and calculating a first similarity between the video clip and the target text according to the first representation;
calculating a second similarity of the video clip to the target text according to a second representation of each video frame included in the video clip;
determining target similarity between the video segment and the target text according to the first similarity and the second similarity;
if the target similarity between the video segment and the target text meets a first preset condition, determining that the video segment is a target segment matched with the target text;
and feeding back at least one of the time boundary of the target segment and the target segment to the terminal.
9. A video search method, wherein the method comprises:
receiving a search request input by a user, wherein the search request comprises target text;
searching a target segment matched with the target text from a target video based on the search request, wherein the target similarity of the target segment and the target text comprises a first similarity and a second similarity, the first similarity is calculated according to a first representation of the target segment, the first representation is calculated according to the context information of the target segment, and the second similarity is calculated according to a second representation of each video frame included in the target segment;
And feeding back the target fragment to a user.
10. An index building method, wherein the method comprises:
acquiring a target video, and segmenting the target video into a plurality of mutually non-overlapping video segments;
creating an index record for each video segment, wherein the index record comprises a first representation of the video segment and a second representation of each video frame included in the video segment, and the first representation is calculated according to the context information of the target segment;
and forming an index database by the index records corresponding to each video segment respectively.
11. A video search method, wherein the method comprises:
receiving a search request input by a user, wherein the search request comprises target text;
inputting the target text into a pre-constructed index database, searching to obtain a target segment matched with the target text, wherein the target similarity between the target segment and the target text comprises a first similarity and a second similarity, the first similarity is calculated according to a first representation of the target segment, the first representation is calculated according to context information of the target segment, and the second similarity is calculated according to a second representation of each video frame included in the target segment.
12. A video processing apparatus, comprising:
the segmentation module is used for segmenting the target video into a plurality of mutually non-overlapping video clips;
the first computing module is used for computing a first representation of the video clip according to the context information of the video clip, and computing a first similarity between the video clip and a target text according to the first representation;
a second calculation module, configured to calculate a second similarity between the video segment and the target text according to a second representation of each video frame included in the video segment;
the first determining module is used for determining the target similarity between the video segment and the target text according to the first similarity and the second similarity;
and the second determining module is used for determining that the video fragment is a target fragment matched with the target text if the target similarity between the video fragment and the target text meets a first preset condition.
13. An electronic device, comprising:
a memory;
a processor; and
a computer program;
wherein the computer program is stored in the memory and configured to be executed by the processor to implement the method of any one of claims 1-11.
14. A computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the method of any of claims 1-11.
CN202310147893.1A 2023-02-10 2023-02-10 Video processing, searching and index constructing method, device, equipment and storage medium Active CN116186329B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310147893.1A CN116186329B (en) 2023-02-10 2023-02-10 Video processing, searching and index constructing method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310147893.1A CN116186329B (en) 2023-02-10 2023-02-10 Video processing, searching and index constructing method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN116186329A true CN116186329A (en) 2023-05-30
CN116186329B CN116186329B (en) 2023-09-12

Family

ID=86436199

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310147893.1A Active CN116186329B (en) 2023-02-10 2023-02-10 Video processing, searching and index constructing method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116186329B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109905772A (en) * 2019-03-12 2019-06-18 腾讯科技(深圳)有限公司 Video clip querying method, device, computer equipment and storage medium
CN110121118A (en) * 2019-06-17 2019-08-13 腾讯科技(深圳)有限公司 Video clip localization method, device, computer equipment and storage medium
CN111209439A (en) * 2020-01-10 2020-05-29 北京百度网讯科技有限公司 Video clip retrieval method, device, electronic equipment and storage medium
CN113111836A (en) * 2021-04-25 2021-07-13 山东省人工智能研究院 Video analysis method based on cross-modal Hash learning
US20210319062A1 (en) * 2020-04-09 2021-10-14 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for searching video segment, device, and medium
CN113590881A (en) * 2021-08-09 2021-11-02 北京达佳互联信息技术有限公司 Video clip retrieval method, and training method and device of video clip retrieval model
CN114612748A (en) * 2022-03-24 2022-06-10 北京工业大学 Cross-modal video clip retrieval method based on feature decoupling

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109905772A (en) * 2019-03-12 2019-06-18 腾讯科技(深圳)有限公司 Video clip querying method, device, computer equipment and storage medium
CN110121118A (en) * 2019-06-17 2019-08-13 腾讯科技(深圳)有限公司 Video clip localization method, device, computer equipment and storage medium
CN111209439A (en) * 2020-01-10 2020-05-29 北京百度网讯科技有限公司 Video clip retrieval method, device, electronic equipment and storage medium
US20210319062A1 (en) * 2020-04-09 2021-10-14 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for searching video segment, device, and medium
CN113111836A (en) * 2021-04-25 2021-07-13 山东省人工智能研究院 Video analysis method based on cross-modal Hash learning
CN113590881A (en) * 2021-08-09 2021-11-02 北京达佳互联信息技术有限公司 Video clip retrieval method, and training method and device of video clip retrieval model
CN114612748A (en) * 2022-03-24 2022-06-10 北京工业大学 Cross-modal video clip retrieval method based on feature decoupling

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张天;靳聪;帖云;李小兵;: "面向跨模态检索的音频数据库内容匹配方法研究", 信号处理, no. 06 *

Also Published As

Publication number Publication date
CN116186329B (en) 2023-09-12

Similar Documents

Publication Publication Date Title
US20220214775A1 (en) Method for extracting salient dialog usage from live data
US11281724B2 (en) Method and system for providing recommendation query using search context
CN107660284B (en) Search improvement based on machine learning
RU2614137C2 (en) Method and apparatus for obtaining information
WO2020029966A1 (en) Method and device for video processing, electronic device, and storage medium
CN109189987A (en) Video searching method and device
US9779163B2 (en) Selective invocation of playback content supplementation
KR102300415B1 (en) Event Practicing System based on Voice Memo on Mobile, Mobile Control Server and Mobile Control Method, Mobile and Application Practicing Method therefor
US8515990B2 (en) Mobile terminal and method of managing video using metadata therein
CN105335414B (en) Music recommendation method and device and terminal
CN112131410A (en) Multimedia resource display method, device, system and storage medium
US20080134038A1 (en) Interactive information providing service method and apparatus
CN108307207A (en) A kind of video pushing method and device
CN105551488A (en) Voice control method and system
US10674183B2 (en) System and method for perspective switching during video access
CN111339744A (en) Ticket information display method, device and storage medium
CN104281656A (en) Method and device for adding label information into application program
US9330647B1 (en) Digital audio services to augment broadcast radio
CN103593356A (en) Method and system for information searching on basis of multimedia information fingerprint technology and application
CN107515870B (en) Searching method and device and searching device
CN112784142A (en) Information recommendation method and device
CN110990598A (en) Resource retrieval method and device, electronic equipment and computer-readable storage medium
CN104853251A (en) Online collection method and device for multimedia data
JP6251637B2 (en) Information retrieval method, apparatus and program
CN116186329B (en) Video processing, searching and index constructing method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant