WO2017114388A1 - 一种视频搜索方法及装置 - Google Patents

一种视频搜索方法及装置 Download PDF

Info

Publication number
WO2017114388A1
WO2017114388A1 PCT/CN2016/112390 CN2016112390W WO2017114388A1 WO 2017114388 A1 WO2017114388 A1 WO 2017114388A1 CN 2016112390 W CN2016112390 W CN 2016112390W WO 2017114388 A1 WO2017114388 A1 WO 2017114388A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
search
video frame
label
search request
Prior art date
Application number
PCT/CN2016/112390
Other languages
English (en)
French (fr)
Inventor
肖瑛
杨振宇
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2017114388A1 publication Critical patent/WO2017114388A1/zh
Priority to US15/712,316 priority Critical patent/US10642892B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/70Labelling scene content, e.g. deriving syntactic or semantic representations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • G06F16/732Query formulation
    • G06F16/7328Query by example, e.g. a complete video frame or video sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • G06F16/738Presentation of query results
    • G06F16/739Presentation of query results in form of a video summary, e.g. the video summary being a video sequence, a composite still image or having synthesized frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/74Browsing; Visualisation therefor
    • G06F16/743Browsing; Visualisation therefor a collection of video files or sequences
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/74Browsing; Visualisation therefor
    • G06F16/745Browsing; Visualisation therefor the internal structure of a single video sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/7867Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, title and artist information, manually generated time, location and usage information, user ratings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • G06V20/47Detecting features for summarising video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/49Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/71Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7837Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using objects detected or recognised in the video content

Definitions

  • the present invention belongs to the field of communications technologies, and in particular, to a video search method and apparatus.
  • Video resources Take video resources as an example.
  • the video is split and edited by manual operation, multiple clip videos are edited, and related titles are added.
  • the user enters a search term to perform a full-network search through a unified comprehensive search box, and displays the search result if the search term completely matches the added title.
  • the embodiment of the present invention provides the following technical solutions:
  • a video search method which includes:
  • the target video tagged with the video frame tag is searched for and the target video is displayed.
  • the embodiment of the present invention further provides the following technical solutions:
  • a video search device comprising:
  • a label prediction unit configured to acquire a video to be labeled, and predict a video frame label of the video frame in the to-be-labeled video by using a preset classification model
  • a first labeling unit configured to combine video frames that are temporally adjacent and have the same video frame label, and label the video frame to be labeled with a corresponding video frame label;
  • a tag determining unit configured to: when receiving a search request indicating a search video, determine a video frame label corresponding to the search request based on a labeling result of the video frame label;
  • a searching unit configured to search for a target video labeled with the video frame label in the candidate video
  • a display unit configured to display the target video.
  • a video frame label of a video frame in an annotated video is first predicted by using a preset classification model, and video frames adjacent to each other and having the same video frame label are merged. And marking the corresponding video frame label with the labeling video; after receiving the search request indicating the search video by the user, determining the corresponding video frame label of the search request based on the labeling result of the video frame label, thereby finding the label in the candidate video
  • the target video of the video frame label is displayed and displayed; in this embodiment, by predicting, merging, and labeling the video frame label, the video frame label corresponding to the search request is determined based on the labeling result of the video frame label, that is, using the pre-labeling
  • the video frame tag searches the content of the video, compared to the way based on manual addition of the title. High efficiency of video search and accuracy of search results.
  • FIG. 1 is a schematic diagram of a scenario of a video search apparatus according to an embodiment of the present invention
  • FIG. 1b is a schematic flowchart of a video search method according to a first embodiment of the present invention
  • FIGS. 2a to 2h are schematic diagrams of a scenario of a video search method according to a second embodiment of the present invention.
  • FIG. 3 is a schematic structural diagram of a video search apparatus according to a third embodiment of the present invention.
  • FIG. 3b is another schematic structural diagram of a video search apparatus according to a third embodiment of the present invention.
  • FIG. 4 is a structural block diagram of a terminal according to an embodiment of the present invention.
  • the principles of the present invention use many other general purpose or special purpose computing, communication environments, or configurations. Line operation.
  • Examples of well-known computing systems, environments, and configurations suitable for use with the present invention may include, but are not limited to, hand-held phones, personal computers, servers, multi-processor systems, microcomputer-based systems, mainframe computers, and A distributed computing environment, including any of the above systems or devices.
  • Embodiments of the present invention provide a video search method and apparatus.
  • FIG. 1 is a schematic diagram of a scenario of a system in which a video search device is provided according to an embodiment of the present invention.
  • the video search system may include a video search device, which is mainly used to predict a video frame in a video to be labeled by using a preset classification model.
  • a video frame label and merging video frames that are adjacent in time and having the same video frame label, thereby implementing a corresponding video frame label for the labeling video; and thereafter, receiving a search request input by the user indicating the search video, If the search is "A Collection of Kisses", "B Drama”, etc., based on the labeling result of the video frame label, the video frame label corresponding to the search request is determined, and then, in the candidate video, such as a specified Video, or network-wide video, etc., find the target video marked with the video frame label, and finally display the target video.
  • the candidate video such as a specified Video, or network-wide video, etc.
  • the video search system may further include a video library, which is mainly used to store the to-be-labeled video, so that the video search device can mark the corresponding video frame label with the annotation video; the video library also stores the search involved in the actual scene. Content and an intent tag corresponding to the search content, so that the video search device performs training based on this to generate a neural network model; in addition, the video library further stores a large number of candidate videos for the video search device to find the target video from, and many more.
  • a video library which is mainly used to store the to-be-labeled video, so that the video search device can mark the corresponding video frame label with the annotation video
  • the video library also stores the search involved in the actual scene. Content and an intent tag corresponding to the search content, so that the video search device performs training based on this to generate a neural network model
  • the video library further stores a large number of candidate videos for the video search device to find the target video from, and many more.
  • the video search system may further include a user terminal for receiving a user directly inputting a search request through an input device, such as a keyboard, a mouse, etc., and after determining the target video, using an output device, such as a terminal screen, to target The video is played.
  • an input device such as a keyboard, a mouse, etc.
  • an output device such as a terminal screen
  • a video search device which may be integrated into a network device such as a server or a gateway.
  • a video search method includes: acquiring a video to be labeled, and predicting a video frame label of a video frame in the to-be-labeled video by using a preset classification model; and temporally adjacent and having the same video frame label The video frames are merged, and the corresponding video frame label is marked with the labeling video; when receiving the search request indicating the search video, the video frame label corresponding to the search request is determined based on the labeling result of the video frame label; in the candidate video, the label is searched The target video of the video frame tag is displayed and the target video is displayed.
  • FIG. 1b is a schematic flowchart of a video search method according to a first embodiment of the present invention.
  • the specific process may include:
  • step S101 the video to be labeled is acquired, and the video frame label of the video frame in the to-be-labeled video is predicted by using a preset classification model.
  • step S102 video frames that are temporally adjacent and have the same video frame label are combined, and the video label to be labeled is labeled with the corresponding video frame label.
  • step S103 upon receiving the search request indicating the search video, the video frame label corresponding to the search request is determined based on the labeling result of the video frame label.
  • the step S101 to the step S103 may be specifically:
  • the video frame label of the entire network video needs to be labeled, that is, the video frame label of the video frame in the video to be labeled is predicted by using a preset classification model, and the time phase is The adjacent video frames with the same video frame label are merged, and the corresponding video frame label is marked with the labeling video, thereby obtaining the labeling result of the video frame label.
  • the video search provided in this embodiment is essentially a content-based video search.
  • the video frame label of the video frame in the labeled video is predicted by using a preset classification model, and after the prediction, the video frames are performed. Sorting, merging video frames with the same time and the same video frame label, so that a video clip can be obtained.
  • the video clips are respectively labeled with corresponding video frame labels, thereby obtaining the labeling result of the video clip, and completing the labeling to be marked.
  • the video frame label of the video is essentially a content-based video search.
  • the video frame of the video to be labeled is referred to as a first video frame
  • the image feature of the first video frame is referred to as a first image feature, which does not constitute a limitation on implementation of the solution.
  • "merging video frames that are temporally adjacent and having the same video frame label” may specifically be: merging the first video frames that are temporally adjacent and have the same video frame label.
  • the key frame extraction algorithm is to perform lens segmentation on the video sequence, and then extract key frames representing the lens content in the lens, and use the low-level features (color, texture, shape, etc.) extracted from the key frames to index the lens and Search.
  • the key frame extraction may be performed based on the shot boundary, or may be extracted based on the content analysis, or may be extracted based on the cluster, and the like, and is not specifically limited herein.
  • the first video frame image feature (ie, the first image feature) may be acquired by using a deep learning model (GoogLeNet), thereby reusing the preset.
  • a classification model such as a Support Vector Machine (SVM) classification model, predicts a video frame label for each first video frame.
  • SVM Support Vector Machine
  • the classification model may also be determined first, for example, as follows:
  • the video frame of the video is referred to as the second video frame
  • the image feature of the second video frame is referred to as the second image feature, which does not constitute a limitation on the implementation of the scheme.
  • the “predicting the video frame label of the first video frame by using the preset classification model according to the first image feature” may be specifically: predicting the video frame label of the first video frame by using the classification model according to the first image feature. .
  • a video clip that has been manually labeled with a scene label (which may be referred to as an original video) and a scene label corresponding to the video clip are used as training data, and the same key frame extraction algorithm as described above is used to extract the original video.
  • the video frame, and the image features of the video frame of the original video are acquired in the same manner as described above, and the SVM classification model is trained by the SVM support vector machine training algorithm to complete the training process.
  • step S103 when receiving the search request indicating the search video, there are many ways to determine the corresponding video frame label of the search request based on the labeling result of the video frame label (step S103), for example:
  • the search content and the intent tag corresponding to the search content are collected, and the neural network model is generated based on the search content and the intent tag.
  • the search content and the corresponding intent tag may be obtained from an actual user search request, for example, using the search content and the corresponding intent tag as training data to train a deep neural network (DNN, Deep Neural Network) to generate a neural network model.
  • DNN deep neural network
  • semantically identifying the search request and determining a corresponding video frame label may be specifically: based on the neural network model, the search request Perform semantic recognition to determine the corresponding video frame label.
  • step S104 in the candidate video, the target video marked with the video frame label is searched for, and the target video is displayed.
  • the “receiving a search request indicating a search video” may be specifically: receiving an indication in a search box corresponding to the currently played video. Search request for search video;
  • the “displaying the target video” may be specifically: determining, in a play progress bar of the currently played video, a play position of the target video, and performing a label prompt based on the play position, For the user to choose to play the paragraph.
  • a search box is set, and a video search is performed in the search box to obtain a target video under the video, where the target video belongs to the current video.
  • a paragraph of the video is to say, in this embodiment, for a play page of a current video, a search box is set, and a video search is performed in the search box to obtain a target video under the video, where the target video belongs to the current video.
  • the “receiving a search request indicating a search video” may be specifically: receiving, in a full-network search box, a search request indicating a search video;
  • the “displaying the target video” may be specifically: displaying the target video and attribute information of the target video in a list form.
  • the target video search is performed for the video of the whole network, and after the target video is searched, the target video is displayed in the form of a list; since the search result of the whole network search is more, therefore,
  • the attribute information corresponding to the target video is displayed together, wherein the attribute information may include one or more combinations of information such as a TV drama name, a set number, a variety name, a period number, a clip duration, and an appearance character.
  • the video search method provided in this embodiment firstly predicts a video frame label of a video frame in the labeled video by using a preset classification model, and compares video frames that are adjacent in time and have the same video frame label. Merging, labeling the video with the corresponding video frame label; and then, when receiving the search request from the user indicating the search video, determining the search based on the labeling result of the video frame label Requesting a corresponding video frame label, so as to find and display the target video labeled with the video frame label in the candidate video; in this embodiment, the video frame label is predicted, merged, and labeled, thereby based on the video frame label.
  • the labeling result determines the corresponding video frame label of the search request, that is, the content of the video is searched by using the pre-labeled video frame label, and the efficiency of the video search and the accuracy of the search result are greatly improved compared with the method of manually adding the title. .
  • the video is first split and edited by a manual operation, a plurality of clip videos are edited, and related titles are added.
  • FIG. 2a a video in the prior art is used.
  • the schematic of the search the user enters the search word through a unified comprehensive search box (marked with a black border) to search the entire network, and the search content may include OGC (Occupationally-generated-Content), PGC, such as movies, television, variety shows, and the like.
  • OGC Occupationally-generated-Content
  • PGC such as movies, television, variety shows, and the like.
  • a training model is performed based on an existing original video tag labeled with a scene label, and a classification model is generated; and the video frame label of the video frame in the video is predicted and marked by using the classification model;
  • a neural network model is generated; thus, when receiving the search request for searching the video content, the neural network model is used to semantically identify the search request. Determine the corresponding video frame tag and find the video content tagged with the video frame tag to display the play to the user.
  • the technology can search video content in a designated episode or a whole network video, and identify related video segments; adopting advanced semantic recognition technology, greatly improving the accuracy of colloquial search results, and liberating manpower to a large extent At the same time, the user is provided with a richer scene dimension search.
  • the following will advance Detailed description of the line.
  • a search icon identifier (marked by a black border) is set at a corner of the video play page (such as the upper right corner), as shown in FIG. 2c, when the mouse is Move in the logo, you can open the search box (marked with a black border), the user can search for the search word input, press "enter” / or “search icon” to submit the search request, and set a short time
  • the search box automatically retracts without any operation within the range (such as 3 seconds, 5 seconds, etc.).
  • the user when the user inputs the video (also referred to as a video clip) that is to be searched in the search box, after confirming, the video clip in the video playback play bar below the video play page is corresponding to the video clip to be searched.
  • the playing position is marked with a prompt, as shown in FIG. 2d, the user can click the corresponding playing position according to the prompt, and the video clip can be played.
  • the search word “kiss play” is input in the search box, and after confirmation, the current play is performed.
  • the matching two video clips are prompted in the progress bar of the video.
  • querying the label with the corresponding video frame label may include: if the video clip of the related user search can be matched, the labeling prompt of the corresponding segment is given below; if the related user is not matched Searched video clips, but with similar clips, can give hints; in addition, if there is no similar clip when there is no matching user search video clip, then a prompt is found that the result is not found.
  • a training model is generated based on the existing original video labeling of the scene label, and a classification model is generated; and the video frame label of the video frame in the video is predicted and labeled by using the classification model. The process is indicated.
  • a video clip (which may be referred to as an original video or an existing video) and a scene label corresponding to the clip are obtained from the existing scene label manually labeled video library as training data.
  • the key frame extraction algorithm is used to extract the key frame images (ie, the second video frame) in the original video, and the video frames are labeled according to the scene tags of the original video.
  • the trained GoogleNet network For each frame of picture, the trained GoogleNet network is used to extract 1024-dimensional floating point number as the image feature (ie, the second image feature), and then combined with the video frame tag, the SVM classification model is trained by the support vector machine SVM training algorithm, and the training process is completed.
  • the SVM classification model is a supervised learning model, usually used for pattern recognition, classification, and regression analysis.
  • the video frame (ie, the first video frame) of the video segment (ie, the video to be labeled) is first extracted by using the same key frame extraction algorithm as the training process, and each video frame is also acquired by GoogleNet. 1024-dimensional image features (ie, first image features), and then use the SVM classification model output by the training process to predict the video frame labels of each video frame, and finally merge the temporally adjacent video frames with the same video frame labels. , get the labeling result of the video clip.
  • FIG. 2h for training learning based on the search content collected in the actual application and the corresponding intent tag, generating a neural network model, and illustrating the process of semantically identifying the search request by using the neural network model.
  • the search content involved in the actual application and the intent tag corresponding to the search content are collected first, that is, the real Query (inquiry, ie, search content) and the search intent tag corresponding to the Query are used as training data to train the deep neural network DNN.
  • the real Query inquiry, ie, search content
  • the search intent tag corresponding to the Query are used as training data to train the deep neural network DNN.
  • the cross-entropy loss function of the classification is minimized, so that the cos distance of the Query and Query corresponding labels is small.
  • the searched Query is first used to calculate the correlation between the search Query and the video frame label at the semantic level, that is, the user's Query is converted into a 128-dimensional vector, and then This vector and the 128-dimensional vector corresponding to all video frame labels respectively obtain the cos distance, and the label with the smallest cos distance is used as the predicted output.
  • the video search method provided in this embodiment firstly predicts a video frame label of a video frame in the labeled video by using a preset classification model, and compares video frames that are adjacent in time and have the same video frame label. Merging, labeling the video with the corresponding video frame label; and then, when receiving the search request indicating the search video by the user, determining the video frame label corresponding to the search request based on the labeling result of the video frame label, thereby searching in the candidate video
  • the target video labeled with the video frame label is displayed and displayed.
  • the video frame label corresponding to the search request is determined based on the labeling result of the video frame label, that is, the video frame label is used.
  • Pre-labeled video frame tags search the content of the video, which greatly improves the efficiency of video search and the accuracy of search results, compared to the method of manually adding titles.
  • an embodiment of the present invention further provides an apparatus based on the above video search method.
  • the meaning of the noun is the same as that in the video search method described above. For specific implementation details, refer to the description in the method embodiment.
  • FIG. 3a is a schematic structural diagram of a video search apparatus according to an embodiment of the present invention.
  • the apparatus may include a label prediction unit 301, a first labeling unit 302, a label determining unit 303, a searching unit 304, and a display unit 305.
  • the label prediction unit 301 is configured to acquire a video to be labeled, and predict a video frame label of the video frame in the to-be-labeled video by using a preset classification model.
  • the first labeling unit 302 is configured to time adjacent The video frames having the same video frame label are merged, and the corresponding video frame label is marked on the to-be-labeled video.
  • the video frame label of the entire network video needs to be labeled, that is, the video frame label of the video frame in the video to be labeled is predicted by using a preset classification model, and the time phase is Neighboring video frames with the same video frame label are merged and treated as labels The frequency label is labeled with the corresponding video frame label to obtain the labeling result of the video frame label.
  • the video search provided in this embodiment is essentially a content-based video search.
  • the video frame label of the video frame in the labeled video is predicted by using a preset classification model, and after the prediction, the video frames are performed. Sorting, merging video frames with the same time and the same video frame label, so that a video clip can be obtained.
  • the video clips are respectively labeled with corresponding video frame labels, thereby obtaining the labeling result of the video clip, and completing the labeling to be marked.
  • the video frame label of the video is essentially a content-based video search.
  • the label determining unit 303 is configured to determine, according to the labeling result of the video frame label, a video frame label corresponding to the search request, when the search request indicating the search video is received, and the searching unit 304, configured to search for the candidate video a target video of the video frame label; a display unit 305, configured to display the target video.
  • the label prediction unit 301 can include:
  • a prediction subunit configured to predict a video frame label of the first video frame by using a preset classification model according to the first image feature.
  • the video frame of the video to be labeled is referred to as a first video frame
  • the image feature of the first video frame is referred to as a first image feature, which does not constitute a limitation on implementation of the solution.
  • the first labeling unit 302 may be specifically configured to combine the first video frames that are temporally adjacent and have the same video frame label, and label the video label to be labeled with the corresponding video frame label.
  • the key frame extraction algorithm is to perform lens segmentation on the video sequence, and then extract key frames representing the lens content in the lens, and use the low-level features (color, texture, shape, etc.) extracted from the key frames to index the lens and Search.
  • key frame extraction can be based on the lens boundary
  • the extraction may also be performed based on the content analysis, or may be extracted based on the cluster, and the like, and is not specifically limited herein.
  • the first video frame image feature (ie, the first image feature) may be acquired by using a deep learning model (GoogLeNet), thereby reusing the preset.
  • a classification model such as a support vector machine SVM classification model, predicts video frame labels for each first video frame.
  • FIG. 3b is a schematic diagram of another structure of the video search device.
  • the classification model may be determined.
  • the device may further include:
  • a first collecting unit 306 configured to collect the original video that has been marked in the scene label in advance
  • the extracting unit 307 is configured to extract a video frame of the original video by using a key frame extraction algorithm, and determine the second video frame.
  • the second labeling unit 308 is configured to label the second video frame with a video frame label according to the scene label.
  • the second obtaining unit 309 obtains an image feature of each of the second video frames, and determines the second image feature
  • the first training unit 310 is configured to perform a training based on the video frame label labeled with the second video frame and the second image feature to generate a classification model.
  • the video frame of the original video marked with the scene label is referred to as a second video frame
  • the image feature of the second video frame is referred to as a second image feature.
  • the prediction subunit may be specifically configured to: according to the first image feature, predict the video frame label of the first video frame by using the classification model.
  • a video clip that has been manually labeled with a scene label (which may be referred to as an original video) and a scene label corresponding to the video clip are used as training data, and the same key frame extraction algorithm as described above is used to extract the original video.
  • Video frames, and acquiring the original video in the same manner as described above The image feature of the video frame is trained by the SVM support vector machine training algorithm to complete the training process.
  • the label determining unit 303 determines that there are many ways to request the corresponding video frame label of the search request.
  • the label determining unit 303 may specifically include:
  • (21) a receiving subunit, configured to receive a search request indicating a search video
  • a label determining subunit configured to determine a video frame label corresponding to the search request by combining the result of the semantic recognition and the labeling result of the video frame label.
  • the device may further include:
  • a second collecting unit 311, configured to collect search content and an intent label corresponding to the search content
  • the second training unit 312 is configured to perform training based on the search content and the intent tag to generate a neural network model.
  • the search content and the corresponding intent tag may be obtained from an actual user search request, for example, the search content and the corresponding intent tag are used as training data to train the deep neural network DNN, thereby generating a neural network model.
  • the identifying subunit may be specifically configured to perform semantic recognition on the search request based on the neural network model.
  • the target video is displayed.
  • the receiving subunit is specifically configured to: a search box corresponding to the currently played video. Receiving a search request indicating a search video;
  • the display unit 305 may be specifically configured to: determine a play position of the target video in a play progress bar of the currently played video, and perform a label prompt based on the play position for the user to select Play the paragraph.
  • a search box is set, and a video search is performed in the search box to obtain a target video under the video, where the target video belongs to the current video.
  • a paragraph of the video is to say, in this embodiment, for a play page of a current video, a search box is set, and a video search is performed in the search box to obtain a target video under the video, where the target video belongs to the current video.
  • the receiving sub-unit when the candidate video is a full-network video set, is specifically configured to: in a full-network search box, receive a search request indicating a search video;
  • the display unit 305 may be specifically configured to: display the target video and the attribute information of the target video in a list form.
  • the target video search is performed for the video of the whole network, and after the target video is searched, the target video is displayed in the form of a list; since the search result of the whole network search is more, therefore,
  • the attribute information corresponding to the target video is displayed together, wherein the attribute information may include one or more combinations of information such as a TV drama name, a set number, a variety name, a period number, a clip duration, and an appearance character.
  • the foregoing units may be implemented as a separate entity, or may be implemented in any combination, and may be implemented as the same or a plurality of entities.
  • the foregoing method embodiments and details are not described herein.
  • the video search device may be specifically integrated in a network device such as a server or a gateway.
  • the video search apparatus provided in this embodiment firstly predicts the video frame label of the video frame in the labeled video by using a preset classification model, and compares video frames that are adjacent in time and have the same video frame label. Merging, labeling the video with the corresponding video frame label; and then, when receiving the search request indicating the search video by the user, determining the video frame label corresponding to the search request based on the labeling result of the video frame label, thereby searching in the candidate video
  • the target video labeled with the video frame label is displayed and displayed.
  • the video frame label corresponding to the search request is determined based on the labeling result of the video frame label, that is, the video frame label is used.
  • Pre-labeled video frame tags search the content of the video, which greatly improves the efficiency of video search and the accuracy of search results, compared to the method of manually adding titles.
  • a terminal for implementing the above video search method.
  • the terminal may include: one or more (only one shown in the figure) processor 201, memory 203, and transmission device 205. As shown in FIG. 4, the terminal may further include an input and output device 207.
  • the memory 203 can be used to store a software program and a module, such as a video search method and a program instruction/module corresponding to the device in the embodiment of the present invention.
  • the processor 201 executes each of the software programs and modules stored in the memory 203.
  • a functional application and data processing, that is, the above video search method is implemented.
  • Memory 203 can include high speed random access memory, and can also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid state memory.
  • memory 203 can further include memory remotely located relative to processor 201, which can be connected to the terminal over a network. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
  • the transmission device 205 described above is for receiving or transmitting data via a network.
  • Specific examples of the above network may include a wired network and a wireless network.
  • the transmission device 205 includes a Network Interface Controller (NIC) that can be connected to other network devices and routers via a network cable to communicate with the Internet or a local area network.
  • the transmission device 205 is a Radio Frequency (RF) module for communicating with the Internet wirelessly.
  • NIC Network Interface Controller
  • RF Radio Frequency
  • the memory 203 is used to store an application.
  • the processor 201 may call an application stored in the memory 203 to perform the following steps: acquiring a video to be labeled, and predicting a video frame label of the video frame in the video to be labeled by using a preset classification model; The video frames of the same video frame label are merged, and the corresponding video frame label is marked with the labeling video; when the search request indicating the search video is received, the video frame label corresponding to the search request is determined based on the labeling result of the video frame label; In the video, find the target video labeled with the video frame label and display the target video.
  • the processor 201 is further configured to: extract a video frame of the video to be labeled by using a key frame extraction algorithm, and determine the first video frame; acquire image features of each first video frame, and determine the first image feature Predicting the video of the first video frame using a preset classification model according to the first image feature Frame tag; merges first video frames that are temporally adjacent and have the same video frame label.
  • the processor 201 is further configured to: collect the original video that has been marked by the scene label before acquiring the video to be labeled; extract the video frame of the original video by using the key frame extraction algorithm, and determine the second video frame; Labeling a second video frame with a video frame label according to the scene label; acquiring an image feature of each second video frame, and determining the second image feature; performing the video frame label and the second image feature labeled on the second video frame Training, generating a classification model; predicting a video frame label of the first video frame using the classification model according to the first image feature.
  • the processor 201 is further configured to: when receiving the search request indicating the search video, perform semantic recognition on the search request based on the preset neural network model; combined with the result of the semantic recognition and the labeling result of the video frame label, Determine the corresponding video frame label for the search request.
  • the processor 201 is further configured to: collect the search content and the intent tag corresponding to the search content before receiving the search request indicating the search video; perform training based on the search content and the intent tag to generate a neural network model; based on the neural network model , semantic recognition of search requests.
  • the processor 201 is further configured to: perform a search request indicating a search video in a search box corresponding to the currently played video; and determine a play position of the target video in a play progress bar of the currently played video, And according to the playing position, the labeling prompt is provided for the user to select the paragraph to play.
  • the processor 201 is further configured to: perform a search request indicating a search video in the entire network search box; and display the target video and the attribute information of the target video in a list form.
  • FIG. 4 is only schematic, and the terminal can be a smart phone (such as an Android mobile phone, an iOS mobile phone, etc.), a tablet computer, a palm computer, and a mobile Internet device (MID). Terminal equipment such as PAD.
  • FIG. 4 does not limit the structure of the above electronic device.
  • the terminal may also include more or fewer components (such as a network interface, display device, etc.) than shown in FIG. 4, or have a different configuration than that shown in FIG.
  • Embodiments of the present invention also provide a storage medium.
  • the foregoing storage medium may be used to execute program code of the video search method.
  • the foregoing storage medium may be located on at least one of the plurality of network devices in the network shown in the foregoing embodiment.
  • the storage medium is arranged to store program code for performing the following steps:
  • S1 acquiring a video to be labeled, and predicting a video frame label of the video frame in the to-be-labeled video by using a preset classification model;
  • S4 In the candidate video, find the target video labeled with the video frame label, and display the target video.
  • the storage medium is further configured to store program code for performing the following steps: extracting a video frame of the video to be labeled by using a key frame extraction algorithm, and determining the first video frame; acquiring an image of each first video frame And determining, as the first image feature, predicting a video frame label of the first video frame by using a preset classification model according to the first image feature; and performing a first video frame that is temporally adjacent and has the same video frame label merge.
  • the storage medium is further configured to store program code for performing the following steps: collecting the original video that has been marked by the scene label before acquiring the video to be labeled; and extracting the video frame of the original video by using the key frame extraction algorithm And determining the second video frame; labeling the second video frame with the video frame label according to the scene label; acquiring the image feature of each second video frame, and determining the second image feature; and marking the second video frame based on Video frame label and second image feature are trained to generate classification mode Type: predicting a video frame label of the first video frame using the classification model according to the first image feature.
  • the storage medium is further configured to store program code for performing the following steps: when receiving the search request indicating the search video, semantically identifying the search request based on the preset neural network model; combining semantic recognition The result and the labeling result of the video frame label determine the corresponding video frame label of the search request.
  • the storage medium is further configured to store program code for performing the steps of: collecting search content and an intent tag corresponding to the search content before receiving the search request indicating the search video; training based on the search content and the intent tag, A neural network model is generated; based on the neural network model, the search request is semantically identified.
  • the storage medium is further configured to store program code for performing the following steps: in the search box corresponding to the currently played video, receiving a search request indicating the search video; in the play progress bar of the currently played video , determine the play position of the target video, and mark the prompt based on the play position, so that the user can select to play the paragraph.
  • the storage medium is further configured to store program code for performing the following steps: receiving a search request indicating the search video in the entire network search box; displaying the target video and the attribute information of the target video in a list form .
  • the foregoing storage medium may include, but not limited to, a USB flash drive, a Read-Only Memory (ROM), a Random Access Memory (RAM), a mobile hard disk, and a magnetic memory.
  • ROM Read-Only Memory
  • RAM Random Access Memory
  • a mobile hard disk e.g., a hard disk
  • magnetic memory e.g., a hard disk
  • the video search device provided by the embodiment of the present invention is, for example, a computer, a tablet computer, a mobile phone with a touch function, etc., and the video search device belongs to the same concept as the video search method in the above embodiment, in the video. Any one of the methods provided in the embodiment of the video search method may be run on the search device. The specific implementation process is described in the video search method embodiment, and details are not described herein again.
  • the computer program is executed to control related hardware, and the computer program can be stored in a computer readable storage medium, such as in a memory of the terminal, and executed by at least one processor in the terminal during execution.
  • a flow of an embodiment of the video search method can be included.
  • the storage medium may be a magnetic disk, an optical disk, a read only memory (ROM), a random access memory (RAM), or the like.
  • each functional module may be integrated into one processing chip, or each module may exist physically separately, or two or more modules may be integrated into one module.
  • the above integrated modules can be implemented in the form of hardware or in the form of software functional modules.
  • the integrated module if implemented in the form of a software functional module and sold or used as a standalone product, may also be stored in a computer readable storage medium, such as a read only memory, a magnetic disk or an optical disk, etc. .

Abstract

本发明公开了一种视频搜索方法及装置,其中该方法包括:利用预设的分类模型,预测待标注视频中视频帧的视频帧标签;将时间相邻的且具有相同的视频帧标签的视频帧进行合并,对待标注视频标注相应的视频帧标签;在接收到指示搜索视频的搜索请求时,基于视频帧标签的标注结果确定搜索请求相应的视频帧标签;在候选视频中查找标注有视频帧标签的目标视频;对目标视频进行展示。本发明实施例通过对视频帧标签进行预测、合并以及标注,从而基于视频帧标签的标注结果确定搜索请求相应的视频帧标签,即利用预先标注的视频帧标签对视频的内容进行搜索,相对于基于人工添加标题的方式,大大的提高了视频搜索的效率以及搜索结果的准确率。

Description

一种视频搜索方法及装置
本申请要求于2015年12月30日提交中国专利局、申请号为201511017439.6、发明名称“一种视频搜索方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明属于通信技术领域,尤其涉及一种视频搜索方法及装置。
背景技术
随着互联网技术的迅速发展,互联网上可供用户搜索查询的网络资源也越来越多,在这众多的资源中,准确搜索出用户需要的资源也显得尤为重要。
以视频资源为例,通常情况下采用人工运营的方式,将视频进行拆分和剪辑,编辑出多个片段视频,并添加相关标题。用户通过统一的综合搜索框,输入搜索词进行全网搜索,若搜索词跟添加的标题完全匹配时,展示搜索结果。
在对现有技术的研究和实践过程中,本发明的发明人发现,由于现有技术中从视频的拆分、剪辑到标题的添加,都需要人工进行,人工运营成分较多,容易出现标题不全面或不准确的现象,因此会直接导致视频搜索效率低以及搜索结果准确率不高的问题。
发明内容
本发明的目的在于提供一种视频搜索方法及装置,旨在提高搜索效率以及搜索结果的准确率。
为解决上述技术问题,本发明实施例提供以下技术方案:
一种视频搜索方法,其中包括:
获取待标注视频,并利用预设的分类模型,预测所述待标注视频中视频帧 的视频帧标签;
将时间相邻的且具有相同的视频帧标签的视频帧进行合并,对所述待标注视频标注相应的视频帧标签;
在接收到指示搜索视频的搜索请求时,基于视频帧标签的标注结果确定所述搜索请求相应的视频帧标签;
在候选视频中,查找标注有所述视频帧标签的目标视频,并对所述目标视频进行展示。
为解决上述技术问题,本发明实施例还提供以下技术方案:
一种视频搜索装置,其中包括:
标签预测单元,被设置为获取待标注视频,并利用预设的分类模型,预测所述待标注视频中视频帧的视频帧标签;
第一标注单元,被设置为将时间相邻的且具有相同的视频帧标签的视频帧进行合并,对所述待标注视频标注相应的视频帧标签;
标签确定单元,被设置为在接收到指示搜索视频的搜索请求时,基于视频帧标签的标注结果确定所述搜索请求相应的视频帧标签;
查找单元,被设置为在候选视频中,查找标注有所述视频帧标签的目标视频;
展示单元,被设置为对所述目标视频进行展示。
相对于现有技术,本发明实施例,首先利用预设的分类模型,对待标注视频中视频帧的视频帧标签进行预测,并将时间相邻的且具有相同的视频帧标签的视频帧进行合并,对待标注视频标注相应的视频帧标签;其后,在接收到用户指示搜索视频的搜索请求时,基于视频帧标签的标注结果确定搜索请求相应的视频帧标签,从而在候选视频中查找出标注有该视频帧标签的目标视频并进行展示;本实施例中,通过对视频帧标签进行预测、合并以及标注,从而基于视频帧标签的标注结果确定搜索请求相应的视频帧标签,即利用预先标注的视频帧标签对视频的内容进行搜索,相对于基于人工添加标题的方式,大大的提 高了视频搜索的效率以及搜索结果的准确率。
附图说明
下面结合附图,通过对本发明的具体实施方式详细描述,将使本发明的技术方案及其它有益效果显而易见。
图1a是本发明实施例提供的视频搜索装置的场景示意图;
图1b是本发明第一实施例提供的视频搜索方法的流程示意图;
图2a至图2h为本发明第二实施例提供的视频搜索方法的场景示意图;
图3a为本发明第三实施例提供的视频搜索装置的结构示意图;
图3b为本发明第三实施例提供的视频搜索装置的另一结构示意图;
图4是根据本发明实施例的一种终端的结构框图。
具体实施方式
请参照图式,其中相同的组件符号代表相同的组件,本发明的原理是以实施在一适当的运算环境中来举例说明。以下的说明是基于所例示的本发明具体实施例,其不应被视为限制本发明未在此详述的其它具体实施例。
在以下的说明中,本发明的具体实施例将参考由一部或多部计算机所执行的步骤及符号来说明,除非另有述明。因此,这些步骤及操作将有数次提到由计算机执行,本文所指的计算机执行包括了由代表了以一结构化型式中的数据的电子信号的计算机处理单元的操作。此操作转换该数据或将其维持在该计算机的内存系统中的位置处,其可重新配置或另外以本领域测试人员所熟知的方式来改变该计算机的运作。该数据所维持的数据结构为该内存的实体位置,其具有由该数据格式所定义的特定特性。但是,本发明原理以上述文字来说明,其并不代表为一种限制,本领域测试人员将可了解到以下所述的多种步骤及操作亦可实施在硬件当中。
本发明的原理使用许多其它泛用性或特定目的运算、通信环境或组态来进 行操作。所熟知的适合用于本发明的运算系统、环境与组态的范例可包括(但不限于)手持电话、个人计算机、服务器、多处理器系统、微电脑为主的系统、主架构型计算机、及分布式运算环境,其中包括了任何的上述系统或装置。
本发明实施例提供一种视频搜索方法及装置。
参见图1a,该图为本发明实施例提供的视频搜索装置所处系统的场景示意图,该视频搜索系统可以包括视频搜索装置,主要用于利用预设的分类模型,预测待标注视频中视频帧的视频帧标签,并将时间相邻的且具有相同的视频帧标签的视频帧进行合并,从而实现对待标注视频标注相应的视频帧标签;其后,接收用户输入的指示搜索视频的搜索请求,如指示搜索“A剧集吻戏”、“B剧集搞笑戏”等等,基于视频帧标签的标注结果,确定与搜索请求相应的视频帧标签,接着,在候选视频中,如某一指定视频、或者全网视频等,查找出标注有该视频帧标签的目标视频,最后对目标视频进行展示。
此外,该视频搜索系统还可以包括视频库,主要用于存储待标注视频,以使视频搜索装置可以对待标注视频标注相应的视频帧标签;该视频库中还存储有实际场景中涉及到的搜索内容以及与所述搜索内容对应的意图标签,以使得视频搜索装置基于此进行训练,生成神经网络模型;另外,该视频库还存储有大量候选视频,以供视频搜索装置从中查找出目标视频,等等。当然,该视频搜索系统还可以包括用户终端,用于通过输入装置,如键盘、鼠标等,接收用户直接输入搜索请求,并在确定出目标视频后,通过输出装置,如终端屏幕等,对目标视频进行播放。
以下将分别进行详细说明。
第一实施例
在本实施例中,将从视频搜索装置的角度进行描述,该视频搜索装置具体可以集成在服务器或网关等网络设备中。
一种视频搜索方法,包括:获取待标注视频,并利用预设的分类模型,预测待标注视频中视频帧的视频帧标签;将时间相邻的且具有相同的视频帧标签 的视频帧进行合并,对待标注视频标注相应的视频帧标签;在接收到指示搜索视频的搜索请求时,基于视频帧标签的标注结果确定搜索请求相应的视频帧标签;在候选视频中,查找标注有该视频帧标签的目标视频,并对该目标视频进行展示。
请参阅图1b,图1b是本发明第一实施例提供的视频搜索方法的流程示意图,具体流程可以包括:
在步骤S101中,获取待标注视频,并利用预设的分类模型,预测所述待标注视频中视频帧的视频帧标签。
在步骤S102中,将时间相邻的且具有相同的视频帧标签的视频帧进行合并,对待标注视频标注相应的视频帧标签。
在步骤S103中,在接收到指示搜索视频的搜索请求时,基于视频帧标签的标注结果确定搜索请求相应的视频帧标签。
其中,所述步骤S101至步骤S103可具体为:
可以理解的是,一方面,在对搜索请求进行处理前,需要先对全网视频标注视频帧标签,即利用预设的分类模型,预测待标注视频中视频帧的视频帧标签,将时间相邻的且具有相同的视频帧标签的视频帧进行合并,并对待标注视频标注相应的视频帧标签,从而得到视频帧标签的标注结果。
也就是说,本实施例提供的视频搜索实质上是为基于内容的视频搜索,首先利用预设的分类模型,对待标注视频中视频帧的视频帧标签进行预测,预测后,对这些视频帧进行整理,将时间相邻且视频帧标签相同的视频帧进行合并,从而可以得到一视频片段,最后,给这些视频片段分别标注上相应的视频帧标签,从而得到视频片段的标注结果,完成待标注视频的视频帧标签标注。
进一步的,“利用预设的分类模型,预测待标注视频中视频帧的视频帧标签”可以具体如下:
(11)利用关键帧提取算法提取待标注视频的视频帧,并确定为第一视频帧;
(12)获取每个第一视频帧的图像特征,并确定为第一图像特征;
(13)根据确定出的第一图像特征,利用预设的分类模型预测第一视频帧的视频帧标签。
可以理解的是,为便于区分理解,本实施例将待标注视频的视频帧称为第一视频帧,将第一视频帧的图像特征称为第一图像特征,不构成对方案实现的限定。
基于此,“将时间相邻的且具有相同的视频帧标签的视频帧进行合并”可以具体为:将时间相邻的且具有相同的视频帧标签的第一视频帧进行合并。
其中,关键帧提取算法就是在视频序列上进行镜头分割,再在镜头内提取出能够代表镜头内容的关键帧,利用从关键帧提取的低层特征(颜色、纹理、形状等)进行镜头的索引和检索。其中,关键帧提取可以基于镜头边界进行提取、也可以基于内容分析进行提取、也可以基于聚类进行提取等等,此处不作具体限定。
又比如,在提取完标注视频的视频帧(即第一视频帧)后,可以采用深度学习模型(GoogLeNet)获取每个第一视频帧图像特征(即第一图像特征),从而再利用预设的分类模型,如支持向量机(SVM,Support Vector Machine)分类模型预测每个第一视频帧的视频帧标签。
更进一步的,在对全网视频标注视频帧标签之前,还可以先对分类模型进行确定,比如,可以具体如下:
a、收集预先已进行场景标签标注的原有视频;
b、利用关键帧提取算法提取原有视频的视频帧,并确定为第二视频帧;
c、根据场景标签,对第二视频帧标注视频帧标签;
d、获取每个第二视频帧的图像特征,并确定为第二图像特征;
e、基于对第二视频帧标注的视频帧标签以及第二图像特征进行训练,生成分类模型。
容易想到的是,为便于区分理解,本实施例将已进行场景标签标注的原有 视频的视频帧称为第二视频帧,将第二视频帧的图像特征称为第二图像特征,不构成对方案实现的限定。
基于此,“根据第一图像特征,利用预设的分类模型预测第一视频帧的视频帧标签”可以具体为:根据第一图像特征,利用所述分类模型预测第一视频帧的视频帧标签。
可具体的,比如,将已进行人工标注场景标签的视频片段(可称为原有视频),以及视频片段对应的场景标签作为训练数据,利用与上述同样的关键帧提取算法提取原有视频的视频帧,以及利用与上述同样的方式获取原有视频的视频帧的图像特征,通过SVM支持向量机训练算法,训练出所述SVM分类模型,完成训练过程。
可以理解的是,在接收到指示搜索视频的搜索请求时,基于视频帧标签的标注结果确定所述搜索请求相应的视频帧标签(步骤S103)的方式有很多,比如:
(21)在接收到指示搜索视频的搜索请求时,基于预设的神经网络模型,对所述搜索请求进行语义识别;
(22)结合语义识别的结果以及视频帧标签的标注结果,确定所述搜索请求相应的视频帧标签。
也就是说,在另一方面,在对搜索请求进行处理前,需要确定进行语义识别的网络模型,比如,可以具体如下:
收集搜索内容以及与所述搜索内容对应的意图标签,基于所述搜索内容以及所述意图标签进行训练,生成神经网络模型。
其中,搜索内容以及对应的意图标签,可以从实际用户搜索请求中获取,比如,将搜索内容以及对应的意图标签作为训练数据,训练深层神经网络(DNN,Deep Neural Network),从而生成神经网络模型。
基于此,“基于预设的神经网络模型,对所述搜索请求进行语义识别,确定相应的视频帧标签”可以具体为:基于所述神经网络模型,对所述搜索请求 进行语义识别,确定相应的视频帧标签。
在步骤S104中,在候选视频中,查找标注有该视频帧标签的目标视频,并对目标视频进行展示。
在一种可能的实施方式中,当所述候选视频为一个当前播放的视频时,“接收指示搜索视频的搜索请求”可以具体为:在所述当前播放的视频对应的搜索框中,接收指示搜索视频的搜索请求;
基于此,“对所述目标视频进行展示”可以具体为:在所述当前播放的视频的播放进度条中,确定出所述目标视频的播放位置,并基于所述播放位置进行标注提示,以供用户选择进行段落播放。
也就是说,在该实施方式中,针对一个当前视频的播放页,会设置有一搜索框,在该搜索框中进行视频搜索,可以获取到该视频下的目标视频,此处目标视频属于该当前视频的某一段落。
在另一种可能的实施方式中,当所述候选视频为全网视频集合时,“接收指示搜索视频的搜索请求”可以具体为:在全网搜索框中,接收指示搜索视频的搜索请求;
基于此,“对所述目标视频进行展示”可以具体为:将所述目标视频以及目标视频的属性信息以列表形式进行展示。
也就是说,在该实施方式中,是针对全网的视频进行目标视频搜索,搜索到目标视频后,将目标视频以列表形式进行展示;由于全网搜索得到的搜索结果较多,因此,还会将该目标视频对应的属性信息一并进行展示,其中,属性信息可以包括电视剧名、集数、综艺名、期数、片断时长、出场人物等信息中的一个或多个组合。
由上述可知,本实施例提供的视频搜索方法,首先利用预设的分类模型,对待标注视频中视频帧的视频帧标签进行预测,并将时间相邻的且具有相同的视频帧标签的视频帧进行合并,对待标注视频标注相应的视频帧标签;其后,在接收到用户指示搜索视频的搜索请求时,基于视频帧标签的标注结果确定搜 索请求相应的视频帧标签,从而在候选视频中查找出标注有该视频帧标签的目标视频并进行展示;本实施例中,通过对视频帧标签进行预测、合并以及标注,从而基于视频帧标签的标注结果确定搜索请求相应的视频帧标签,即利用预先标注的视频帧标签对视频的内容进行搜索,相对于基于人工添加标题的方式,大大的提高了视频搜索的效率以及搜索结果的准确率。
第二实施例
根据第一实施例所描述的方法,以下将举例作进一步详细说明。
现有技术中,在视频搜索前,首先利用人工运营的方式,将视频进行拆分和剪辑,编辑出多个片段视频,并添加相关标题,请参考图2a,为现有技术中一种视频搜索的示意图,用户通过统一的综合搜索框(用黑边框标示)输入搜索词进行全网搜索,搜索内容可以包括电影、电视、综艺节目等OGC(Occupationally-generated-Content,职业生产内容)、PGC(Professionally-generated Content,专业生产内容)视频或者长尾的UGC(User-generated-Content,用户生产内容)视频;当搜索词跟添加的标题完全匹配时,展示搜索结果,当搜索词跟添加的标题不匹配时,展示未收到相关视频的提示信息。
本发明实施例中,首先,基于现有的已进行场景标签标注的原有视频进行训练学习,生成一分类模型;并且,利用该分类模型,预测出视频中视频帧的视频帧标签并标注;同时,基于实际应用中收集到的搜索内容及相应的意图标签进行训练学习,生成一神经网络模型;从而,在接收到搜索视频内容的搜索请求时,利用前述神经网络模型对搜索请求进行语义识别,确定相应的视频帧标签,并查找标注有该视频帧标签的视频内容,以向用户展示播放。
本技术可以在指定剧集、或全网视频中进行视频内容搜索,并标识出相关视频片段;采用了高级语义识别技术,大大提高了口语化搜索的结果准确率,在很大程度上解放人力,同时给用户提供更为丰富的场景维度搜索。以下将进 行详细说明。
(一)对当前正在观看中的视频进行视频内容搜索
可具体的,本发明实施例中,首先,如图2b所示,在视频播放页的一角(如右上角)设置有一个搜索icon标识(用黑边框标示),如图2c所示,当鼠标移入该标识,可以打开搜索框(用黑边框标示),用户可以搜索框内进行搜索词输入,按“enter键”/或“搜索icon”视为提交该搜索请求,并设定在一短时间范围内(如3秒、5秒等)无任何操作,该搜索框自动收回。
基于此,当用户在该搜索框内输入想要搜索的视频(也可称视频片段),确认后,该视频播放页下方的视频播放播放进度条中对想要搜索的视频片段,在对应的播放位置进行标注提示,如图2d所示,用户根据提示点击对应的播放位置,可以对该视频片段进行播放,比如,在该搜索框内输入搜索词“吻戏”,确认后,在当前播放视频的进度条中对匹配的两个视频片段进行提示。
可一并参考图2e,为搜索流程示意,包括:S21、用户输入搜索词;S22、语义识别;S23、若相关则返回视频片段结果;S24、若相似则返回同类视频结果;S25、若无匹配则返回未找到视频结果的提示。即,在视频搜索过程中,查询到标注有相应视频帧标签,可以包括:如果能匹配到相关的用户搜索的视频片段,则在下方给出相应片段的标注提示;若未匹配到相关的用户搜索的视频片段,但有相似片段,可给出提示;另外,若未匹配用户搜索视频片段也无相似片段,则给出未找到结果的提示。
(二)对全网视频进行视频内容搜索
如图2f所示,在综合搜索框(用黑边框标示)直接输入需要查找的视频内容,如输入“韩剧的吻戏”,系统识别出需求后,从全网的视频集合中,返回提取好的视频片段结果列表,并将视频片段的标题(电视剧名、集数、综艺名、期数等)、片断时长和时间、出场人物等属性信息进行展示,以供用户选择。
以下对技术架构方案分别陈述。
1)如何自动识别视频片段对应的视频帧标签
比如,请参考图2g,为基于现有的已进行场景标签标注的原有视频进行训练学习,生成一分类模型;以及,利用该分类模型,预测出视频中视频帧的视频帧标签并标注的过程示意。
可具体的,在模型训练时,首先从现有的已进行场景标签人工标注视频库中,获取视频片段(可称为原有视频或现有视频)和该片段对应的场景标签,作为训练数据,利用关键帧提取算法提取原有视频中关键帧图片(即第二视频帧),并依据原有视频的场景标签给这这些图片打上视频帧标签。
针对每帧图片,利用训练好的GoogleNet网络提取1024维浮点数作为图像特征(即第二图像特征),再结合视频帧标签,采用支持向量机SVM训练算法训练出SVM分类模型,至此完成训练过程。其中,SVM分类模型是一个有监督的学习模型,通常用来进行模式识别、分类、以及回归分析。
在预测即视频帧标签自动标注时,首先采用与训练过程相同的关键帧提取算法提取视频片段(即待标注视频)的视频帧(即第一视频帧),对每个视频帧同样采用GoogleNet获取1024维图像特征(即第一图像特征),然后再利用训练过程输出的SVM分类模型预测每个视频帧的视频帧标签,最后将时间相邻的且具有相同的视频帧标签的视频帧进行归并,得到该视频片段的标注结果。
2)如何对用户的搜索词进行语义识别
比如,请参考图2h,为基于实际应用中收集到的搜索内容及相应的意图标签进行训练学习,生成一神经网络模型,以及利用神经网络模型对搜索请求进行语义识别的过程示意。
在训练时,先收集实际应用中涉及的搜索内容以及与搜索内容对应的意图标签,即以真实Query(询问、即搜索内容)和Query对应的搜索意图标签为训练数据,训练出深层神经网络DNN(即神经网络模型),最小化分类的交叉熵损失函数,使得Query和Query对应标签的cos距离较小。
在预测时,首先利用训练好的网络模型,将搜索Query在语义的层面与视频帧标签进行相关性计算,即,将用户的Query转化为128维的向量,然后将 此向量与所有视频帧标签对应的128维向量分别求cos距离,将cos距离最小的标签作为预测输出。
由上述可知,本实施例提供的视频搜索方法,首先利用预设的分类模型,对待标注视频中视频帧的视频帧标签进行预测,并将时间相邻的且具有相同的视频帧标签的视频帧进行合并,对待标注视频标注相应的视频帧标签;其后,在接收到用户指示搜索视频的搜索请求时,基于视频帧标签的标注结果确定搜索请求相应的视频帧标签,从而在候选视频中查找出标注有该视频帧标签的目标视频并进行展示;本实施例中,通过对视频帧标签进行预测、合并以及标注,从而基于视频帧标签的标注结果确定搜索请求相应的视频帧标签,即利用预先标注的视频帧标签对视频的内容进行搜索,相对于基于人工添加标题的方式,大大的提高了视频搜索的效率以及搜索结果的准确率。
第三实施例
为便于更好的实施本发明实施例提供的视频搜索方法,本发明实施例还提供一种基于上述视频搜索方法的装置。其中名词的含义与上述视频搜索的方法中相同,具体实现细节可以参考方法实施例中的说明。
请参阅图3a,图3a为本发明实施例提供的视频搜索装置的结构示意图,该装置可以包括标签预测单元301、第一标注单元302、标签确定单元303、查找单元304以及展示单元305。
其中,所述标签预测单元301,用于获取待标注视频,并利用预设的分类模型,预测所述待标注视频中视频帧的视频帧标签;第一标注单元302,用于将时间相邻的且具有相同的视频帧标签的视频帧进行合并,对所述待标注视频标注相应的视频帧标签。
可以理解的是,一方面,在对搜索请求进行处理前,需要先对全网视频标注视频帧标签,即利用预设的分类模型,预测待标注视频中视频帧的视频帧标签,将时间相邻的且具有相同的视频帧标签的视频帧进行合并,并对待标注视 频标注相应的视频帧标签,从而得到视频帧标签的标注结果。
也就是说,本实施例提供的视频搜索实质上是为基于内容的视频搜索,首先利用预设的分类模型,对待标注视频中视频帧的视频帧标签进行预测,预测后,对这些视频帧进行整理,将时间相邻且视频帧标签相同的视频帧进行合并,从而可以得到一视频片段,最后,给这些视频片段分别标注上相应的视频帧标签,从而得到视频片段的标注结果,完成待标注视频的视频帧标签标注。
标签确定单元303,用于在接收到指示搜索视频的搜索请求时,基于视频帧标签的标注结果确定所述搜索请求相应的视频帧标签;查找单元304,用于在候选视频中,查找标注有所述视频帧标签的目标视频;展示单元305,用于对所述目标视频进行展示。
进一步的,所述标签预测单元301可以包括:
(11)提取子单元,用于利用关键帧提取算法提取所述待标注视频的视频帧,并确定为第一视频帧;
(12)获取子单元,用于获取每个所述第一视频帧的图像特征,并确定为第一图像特征;
(13)预测子单元,用于根据所述第一图像特征,利用预设的分类模型预测第一视频帧的视频帧标签。
可以理解的是,为便于区分理解,本实施例将待标注视频的视频帧称为第一视频帧,将第一视频帧的图像特征称为第一图像特征,不构成对方案实现的限定。
基于此,所述第一标注单元302,可以具体用于将时间相邻的且具有相同的视频帧标签的第一视频帧进行合并,并对所述待标注视频标注相应的视频帧标签。
其中,关键帧提取算法就是在视频序列上进行镜头分割,再在镜头内提取出能够代表镜头内容的关键帧,利用从关键帧提取的低层特征(颜色、纹理、形状等)进行镜头的索引和检索。其中,关键帧提取可以基于镜头边界进行提 取、也可以基于内容分析进行提取、也可以基于聚类进行提取等等,此处不作具体限定。
又比如,在提取完标注视频的视频帧(即第一视频帧)后,可以采用深度学习模型(GoogLeNet)获取每个第一视频帧图像特征(即第一图像特征),从而再利用预设的分类模型,如支持向量机SVM分类模型预测每个第一视频帧的视频帧标签。
更进一步的,可一并参考图3b,为视频搜索装置的另一结构示意图,在对全网视频标注视频帧标签之前,还可以先对分类模型进行确定,比如,所述装置还可以包括:
a、第一收集单元306,用于收集预先已进行场景标签标注的原有视频;
b、提取单元307,用于利用关键帧提取算法提取所述原有视频的视频帧,并确定为第二视频帧;
c、第二标注单元308,用于根据所述场景标签,对所述第二视频帧标注视频帧标签;
d、第二获取单元309,甩获取每个所述第二视频帧的图像特征,并确定为第二图像特征;
e、第一训练单元310,用于基于对所述第二视频帧标注的视频帧标签以及所述第二图像特征进行训练,生成分类模型。
容易想到的是,为便于区分理解,本实施例将已进行场景标签标注的原有视频的视频帧称为第二视频帧,将第二视频帧的图像特征称为第二图像特征,不构成对方案实现的限定。
基于此,所述预测子单元可以具体用于:根据所述第一图像特征,利用所述分类模型预测第一视频帧的视频帧标签。
可具体的,比如,将已进行人工标注场景标签的视频片段(可称为原有视频),以及视频片段对应的场景标签作为训练数据,利用与上述同样的关键帧提取算法提取原有视频的视频帧,以及利用与上述同样的方式获取原有视频的 视频帧的图像特征,通过SVM支持向量机训练算法,训练出所述SVM分类模型,完成训练过程。
可以理解的是,所述标签确定单元303确定所述搜索请求相应的视频帧标签的方式有很多,比如,可以具体包括:
(21)接收子单元,用于接收指示搜索视频的搜索请求;
(22)识别子单元,用于基于预设的神经网络模型,对所述搜索请求进行语义识别;
(23)标签确定子单元,用于结合语义识别的结果以及视频帧标签的标注结果,确定所述搜索请求相应的视频帧标签。
也就是说,在另一方面,在对搜索请求进行处理前,需要确定进行语义识别的网络模型,比如,述装置还可以包括:
第二收集单元311,用于收集搜索内容以及与所述搜索内容对应的意图标签;
第二训练单元312,用于基于所述搜索内容以及所述意图标签进行训练,生成神经网络模型。
其中,搜索内容以及对应的意图标签,可以从实际用户搜索请求中获取,比如,将搜索内容以及对应的意图标签作为训练数据,训练深层神经网络DNN,从而生成神经网络模型。
基于此,所述识别子单元可以具体用于:基于所述神经网络模型,对所述搜索请求进行语义识别。
对所述目标视频进行展示,在一种可能的实施方式中,当所述候选视频为一个当前播放的视频时,所述接收子单元具体用于:在所述当前播放的视频对应的搜索框中,接收指示搜索视频的搜索请求;
基于此,所述展示单元305可以具体用于:在所述当前播放的视频的播放进度条中,确定出所述目标视频的播放位置,并基于所述播放位置进行标注提示,以供用户选择进行段落播放。
也就是说,在该实施方式中,针对一个当前视频的播放页,会设置有一搜索框,在该搜索框中进行视频搜索,可以获取到该视频下的目标视频,此处目标视频属于该当前视频的某一段落。
在另一种可能的实施方式中,当所述候选视频为全网视频集合时,所述接收子单元具体用于:在全网搜索框中,接收指示搜索视频的搜索请求;
基于此,所述展示单元305可以具体用于:将所述目标视频以及目标视频的属性信息以列表形式进行展示。
也就是说,在该实施方式中,是针对全网的视频进行目标视频搜索,搜索到目标视频后,将目标视频以列表形式进行展示;由于全网搜索得到的搜索结果较多,因此,还会将该目标视频对应的属性信息一并进行展示,其中,属性信息可以包括电视剧名、集数、综艺名、期数、片断时长、出场人物等信息中的一个或多个组合。
具体实施时,以上各个单元可以作为独立的实体来实现,也可以进行任意组合,作为同一或若干个实体来实现,以上各个单元的具体实施可参见前面的方法实施例,在此不再赘述。
该视频搜索装置具体可以集成在服务器或网关等网络设备中。
由上述可知,本实施例提供的视频搜索装置,首先利用预设的分类模型,对待标注视频中视频帧的视频帧标签进行预测,并将时间相邻的且具有相同的视频帧标签的视频帧进行合并,对待标注视频标注相应的视频帧标签;其后,在接收到用户指示搜索视频的搜索请求时,基于视频帧标签的标注结果确定搜索请求相应的视频帧标签,从而在候选视频中查找出标注有该视频帧标签的目标视频并进行展示;本实施例中,通过对视频帧标签进行预测、合并以及标注,从而基于视频帧标签的标注结果确定搜索请求相应的视频帧标签,即利用预先标注的视频帧标签对视频的内容进行搜索,相对于基于人工添加标题的方式,大大的提高了视频搜索的效率以及搜索结果的准确率。
第四实施例
根据本发明实施例,还提供了一种用于实施上述视频搜索方法的终端。
图4是根据本发明实施例的一种终端的结构框图,如图4所示,该终端可以包括:一个或多个(图中仅示出一个)处理器201、存储器203、以及传输装置205,如图4所示,该终端还可以包括输入输出设备207。
其中,存储器203可用于存储软件程序以及模块,如本发明实施例中的视频搜索方法和装置对应的程序指令/模块,处理器201通过运行存储在存储器203内的软件程序以及模块,从而执行各种功能应用以及数据处理,即实现上述的视频搜索方法。存储器203可包括高速随机存储器,还可以包括非易失性存储器,如一个或者多个磁性存储装置、闪存、或者其他非易失性固态存储器。在一些实例中,存储器203可进一步包括相对于处理器201远程设置的存储器,这些远程存储器可以通过网络连接至终端。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。
上述的传输装置205用于经由一个网络接收或者发送数据。上述的网络具体实例可包括有线网络及无线网络。在一个实例中,传输装置205包括一个网络适配器(Network Interface Controller,NIC),其可通过网线与其他网络设备与路由器相连从而可与互联网或局域网进行通讯。在一个实例中,传输装置205为射频(Radio Frequency,RF)模块,其用于通过无线方式与互联网进行通讯。
其中,具体地,存储器203用于存储应用程序。
处理器201可以调用存储器203存储的应用程序,以执行下述步骤:获取待标注视频,并利用预设的分类模型,预测待标注视频中视频帧的视频帧标签;将时间相邻的且具有相同的视频帧标签的视频帧进行合并,对待标注视频标注相应的视频帧标签;在接收到指示搜索视频的搜索请求时,基于视频帧标签的标注结果确定搜索请求相应的视频帧标签;在候选视频中,查找标注有视频帧标签的目标视频,并对目标视频进行展示。
处理器201还用于执行下述步骤:利用关键帧提取算法提取待标注视频的视频帧,并确定为第一视频帧;获取每个第一视频帧的图像特征,并确定为第一图像特征;根据第一图像特征,利用预设的分类模型预测第一视频帧的视频 帧标签;将时间相邻的且具有相同的视频帧标签的第一视频帧进行合并。
处理器201还用于执行下述步骤:获取待标注视频之前,收集预先已进行场景标签标注的原有视频;利用关键帧提取算法提取原有视频的视频帧,并确定为第二视频帧;根据场景标签,对第二视频帧标注视频帧标签;获取每个第二视频帧的图像特征,并确定为第二图像特征;基于对第二视频帧标注的视频帧标签以及第二图像特征进行训练,生成分类模型;根据第一图像特征,利用分类模型预测第一视频帧的视频帧标签。
处理器201还用于执行下述步骤:在接收到指示搜索视频的搜索请求时,基于预设的神经网络模型,对搜索请求进行语义识别;结合语义识别的结果以及视频帧标签的标注结果,确定搜索请求相应的视频帧标签。
处理器201还用于执行下述步骤:接收指示搜索视频的搜索请求之前,收集搜索内容以及与搜索内容对应的意图标签;基于搜索内容以及意图标签进行训练,生成神经网络模型;基于神经网络模型,对搜索请求进行语义识别。
处理器201还用于执行下述步骤:在当前播放的视频对应的搜索框中,接收到指示搜索视频的搜索请求;在当前播放的视频的播放进度条中,确定出目标视频的播放位置,并基于播放位置进行标注提示,以供用户选择进行段落播放。
处理器201还用于执行下述步骤:在全网搜索框中,接收到指示搜索视频的搜索请求;将目标视频以及目标视频的属性信息以列表形式进行展示。
本领域普通技术人员可以理解,图4所示的结构仅为示意,终端可以是智能手机(如Android手机、iOS手机等)、平板电脑、掌上电脑以及移动互联网设备(Mobile Internet Devices,MID)、PAD等终端设备。图4其并不对上述电子装置的结构造成限定。例如,终端还可包括比图4中所示更多或者更少的组件(如网络接口、显示装置等),或者具有与图4所示不同的配置。
本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步骤是可以通过程序来指令终端设备相关的硬件来完成,该程序可以存储于一计算机可读存储介质中,存储介质可以包括:闪存盘、只读存储器(Read-Only Memory,ROM)、随机存取器(Random Access Memory,RAM)、磁盘或光 盘等。
第五实施例
本发明的实施例还提供了一种存储介质。可选地,在本实施例中,上述存储介质可以用于执行视频搜索方法的程序代码。
可选地,在本实施例中,上述存储介质可以位于上述实施例所示的网络中的多个网络设备中的至少一个网络设备上。
可选地,在本实施例中,存储介质被设置为存储用于执行以下步骤的程序代码:
S1,获取待标注视频,并利用预设的分类模型,预测待标注视频中视频帧的视频帧标签;
S2,将时间相邻的且具有相同的视频帧标签的视频帧进行合并,对待标注视频标注相应的视频帧标签;
S3,在接收到指示搜索视频的搜索请求时,基于视频帧标签的标注结果确定搜索请求相应的视频帧标签;
S4,在候选视频中,查找标注有视频帧标签的目标视频,并对目标视频进行展示。
可选地,存储介质还被设置为存储用于执行以下步骤的程序代码:利用关键帧提取算法提取待标注视频的视频帧,并确定为第一视频帧;获取每个第一视频帧的图像特征,并确定为第一图像特征;根据第一图像特征,利用预设的分类模型预测第一视频帧的视频帧标签;将时间相邻的且具有相同的视频帧标签的第一视频帧进行合并。
可选地,存储介质还被设置为存储用于执行以下步骤的程序代码:获取待标注视频之前,收集预先已进行场景标签标注的原有视频;利用关键帧提取算法提取原有视频的视频帧,并确定为第二视频帧;根据场景标签,对第二视频帧标注视频帧标签;获取每个第二视频帧的图像特征,并确定为第二图像特征;基于对第二视频帧标注的视频帧标签以及第二图像特征进行训练,生成分类模 型;根据第一图像特征,利用分类模型预测第一视频帧的视频帧标签。
可选地,存储介质还被设置为存储用于执行以下步骤的程序代码:在接收到指示搜索视频的搜索请求时,基于预设的神经网络模型,对搜索请求进行语义识别;结合语义识别的结果以及视频帧标签的标注结果,确定搜索请求相应的视频帧标签。
可选地,存储介质还被设置为存储用于执行以下步骤的程序代码:接收指示搜索视频的搜索请求之前,收集搜索内容以及与搜索内容对应的意图标签;基于搜索内容以及意图标签进行训练,生成神经网络模型;基于神经网络模型,对搜索请求进行语义识别。
可选地,存储介质还被设置为存储用于执行以下步骤的程序代码:在当前播放的视频对应的搜索框中,接收到指示搜索视频的搜索请求;在当前播放的视频的播放进度条中,确定出目标视频的播放位置,并基于播放位置进行标注提示,以供用户选择进行段落播放。
可选地,存储介质还被设置为存储用于执行以下步骤的程序代码:在全网搜索框中,接收到指示搜索视频的搜索请求;将目标视频以及目标视频的属性信息以列表形式进行展示。
可选地,在本实施例中,上述存储介质可以包括但不限于:U盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见上文针对视频搜索方法的详细描述,此处不再赘述。
本发明实施例提供的所述视频搜索装置,譬如为计算机、平板电脑、具有触摸功能的手机等等,所述视频搜索装置与上文实施例中的视频搜索方法属于同一构思,在所述视频搜索装置上可以运行所述视频搜索方法实施例中提供的任一方法,其具体实现过程详见所述视频搜索方法实施例,此处不再赘述。
需要说明的是,对本发明所述视频搜索方法而言,本领域普通测试人员可以理解实现本发明实施例所述视频搜索方法的全部或部分流程,是可以通过计 算机程序来控制相关的硬件来完成,所述计算机程序可存储于一计算机可读取存储介质中,如存储在终端的存储器中,并被该终端内的至少一个处理器执行,在执行过程中可包括如所述视频搜索方法的实施例的流程。其中,所述的存储介质可为磁碟、光盘、只读存储器(ROM,Read Only Memory)、随机存取记忆体(RAM,Random Access Memory)等。
对本发明实施例的所述视频搜索装置而言,其各功能模块可以集成在一个处理芯片中,也可以是各个模块单独物理存在,也可以两个或两个以上模块集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。所述集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中,所述存储介质譬如为只读存储器,磁盘或光盘等。
以上对本发明实施例所提供的一种视频搜索方法及装置进行了详细介绍,本文中应用了具体个例对本发明的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本发明的方法及其核心思想;同时,对于本领域的技术人员,依据本发明的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本发明的限制。

Claims (14)

  1. 一种视频搜索方法,其中,包括:
    获取待标注视频,并利用预设的分类模型,预测所述待标注视频中视频帧的视频帧标签;
    将时间相邻的且具有相同的视频帧标签的视频帧进行合并,对所述待标注视频标注相应的视频帧标签;
    在接收到指示搜索视频的搜索请求时,基于视频帧标签的标注结果确定所述搜索请求相应的视频帧标签;
    在候选视频中,查找标注有所述视频帧标签的目标视频,并对所述目标视频进行展示。
  2. 根据权利要求1所述的视频搜索方法,其中,所述利用预设的分类模型,预测所述待标注视频中视频帧的视频帧标签,包括:
    利用关键帧提取算法提取所述待标注视频的视频帧,并确定为第一视频帧;
    获取每个所述第一视频帧的图像特征,并确定为第一图像特征;
    根据所述第一图像特征,利用预设的分类模型预测第一视频帧的视频帧标签;
    所述将时间相邻的且具有相同的视频帧标签的视频帧进行合并具体为:将时间相邻的且具有相同的视频帧标签的第一视频帧进行合并。
  3. 根据权利要求2所述的视频搜索方法,其中,所述获取待标注视频之前,还包括:
    收集预先已进行场景标签标注的原有视频;
    利用关键帧提取算法提取所述原有视频的视频帧,并确定为第二视频帧;
    根据所述场景标签,对所述第二视频帧标注视频帧标签;
    获取每个所述第二视频帧的图像特征,并确定为第二图像特征;
    基于对所述第二视频帧标注的视频帧标签以及所述第二图像特征进行训 练,生成分类模型;
    所述根据所述第一图像特征,利用预设的分类模型预测第一视频帧的视频帧标签具体为:根据所述第一图像特征,利用所述分类模型预测第一视频帧的视频帧标签。
  4. 根据权利要求1至3任一项所述的视频搜索方法,其中,所述在接收到指示搜索视频的搜索请求时,基于视频帧标签的标注结果确定所述搜索请求相应的视频帧标签,包括:
    在接收到指示搜索视频的搜索请求时,基于预设的神经网络模型,对所述搜索请求进行语义识别;
    结合语义识别的结果以及视频帧标签的标注结果,确定所述搜索请求相应的视频帧标签。
  5. 根据权利要求4所述的视频搜索方法,其中,所述接收指示搜索视频的搜索请求之前,还包括:
    收集搜索内容以及与所述搜索内容对应的意图标签;
    基于所述搜索内容以及所述意图标签进行训练,生成神经网络模型;
    所述基于预设的神经网络模型,对所述搜索请求进行语义识别具体为:基于所述神经网络模型,对所述搜索请求进行语义识别。
  6. 根据权利要求1所述的视频搜索方法,其中,
    当所述候选视频为一个当前播放的视频时,所述接收到指示搜索视频的搜索请求具体为:在所述当前播放的视频对应的搜索框中,接收到指示搜索视频的搜索请求;
    所述对所述目标视频进行展示具体为:在所述当前播放的视频的播放进度条中,确定出所述目标视频的播放位置,并基于所述播放位置进行标注提示,以供用户选择进行段落播放。
  7. 根据权利要求1所述的视频搜索方法,其中,
    当所述候选视频为全网视频集合时,所述接收到指示搜索视频的搜索请求 具体为:在全网搜索框中,接收到指示搜索视频的搜索请求;
    所述对所述目标视频进行展示具体为:将所述目标视频以及目标视频的属性信息以列表形式进行展示。
  8. 一种视频搜索装置,其中,包括:
    标签预测单元,被设置为获取待标注视频,并利用预设的分类模型,预测所述待标注视频中视频帧的视频帧标签;
    第一标注单元,被设置为将时间相邻的且具有相同的视频帧标签的视频帧进行合并,对所述待标注视频标注相应的视频帧标签;
    标签确定单元,被设置为在接收到指示搜索视频的搜索请求时,基于视频帧标签的标注结果确定所述搜索请求相应的视频帧标签;
    查找单元,被设置为在候选视频中,查找标注有所述视频帧标签的目标视频;
    展示单元,被设置为对所述目标视频进行展示。
  9. 根据权利要求8所述的视频搜索装置,其中,所述标签预测单元包括:
    提取子单元,被设置为利用关键帧提取算法提取所述待标注视频的视频帧,并确定为第一视频帧;
    获取子单元,被设置为获取每个所述第一视频帧的图像特征,并确定为第一图像特征;
    预测子单元,被设置为根据所述第一图像特征,利用预设的分类模型预测第一视频帧的视频帧标签;
    所述第一标注单元,被设置为将时间相邻的且具有相同的视频帧标签的第一视频帧进行合并,并对所述待标注视频标注相应的视频帧标签。
  10. 根据权利要求9所述的视频搜索装置,其中,所述装置还包括:
    第一收集单元,被设置为收集预先已进行场景标签标注的原有视频;
    提取单元,被设置为利用关键帧提取算法提取所述原有视频的视频帧,并确定为第二视频帧;
    第二标注单元,被设置为根据所述场景标签,对所述第二视频帧标注视频帧标签;
    第二获取单元,被设置为获取每个所述第二视频帧的图像特征,并确定为第二图像特征;
    第一训练单元,被设置为基于对所述第二视频帧标注的视频帧标签以及所述第二图像特征进行训练,生成分类模型;
    所述预测子单元被设置为:根据所述第一图像特征,利用所述分类模型预测第一视频帧的视频帧标签。
  11. 根据权利要求8至10任一项所述的视频搜索装置,其中,所述标签确定单元,包括:
    接收子单元,被设置为接收指示搜索视频的搜索请求;
    识别子单元,被设置为基于预设的神经网络模型,对所述搜索请求进行语义识别;
    标签确定子单元,被设置为结合语义识别的结果以及视频帧标签的标注结果,确定所述搜索请求相应的视频帧标签。
  12. 根据权利要求11所述的视频搜索装置,其中,所述装置还包括:
    第二收集单元,被设置为收集搜索内容以及与所述搜索内容对应的意图标签;
    第二训练单元,被设置为基于所述搜索内容以及所述意图标签进行训练,生成神经网络模型;
    所述识别子单元被设置为:基于所述神经网络模型,对所述搜索请求进行语义识别。
  13. 根据权利要求8所述的视频搜索装置,其中,
    当所述候选视频为一个当前播放的视频时,所述接收子单元被设置为:在所述当前播放的视频对应的搜索框中,接收指示搜索视频的搜索请求;
    所述展示单元被设置为:在所述当前播放的视频的播放进度条中,确定出 所述目标视频的播放位置,并基于所述播放位置进行标注提示,以供用户选择进行段落播放。
  14. 根据权利要求8所述的视频搜索装置,其中,
    当所述候选视频为全网视频集合时,所述接收子单元被设置为:在全网搜索框中,接收指示搜索视频的搜索请求;
    所述展示单元被设置为:将所述目标视频以及目标视频的属性信息以列表形式进行展示。
PCT/CN2016/112390 2015-12-30 2016-12-27 一种视频搜索方法及装置 WO2017114388A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US15/712,316 US10642892B2 (en) 2015-12-30 2017-09-22 Video search method and apparatus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201511017439.6 2015-12-30
CN201511017439.6A CN105677735B (zh) 2015-12-30 2015-12-30 一种视频搜索方法及装置

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US15/712,316 Continuation US10642892B2 (en) 2015-12-30 2017-09-22 Video search method and apparatus

Publications (1)

Publication Number Publication Date
WO2017114388A1 true WO2017114388A1 (zh) 2017-07-06

Family

ID=56189754

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/112390 WO2017114388A1 (zh) 2015-12-30 2016-12-27 一种视频搜索方法及装置

Country Status (3)

Country Link
US (1) US10642892B2 (zh)
CN (1) CN105677735B (zh)
WO (1) WO2017114388A1 (zh)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109151615A (zh) * 2018-11-02 2019-01-04 湖南双菱电子科技有限公司 视频处理方法、计算机设备和计算机存储介质
CN109977239A (zh) * 2019-03-31 2019-07-05 联想(北京)有限公司 一种信息处理方法和电子设备
CN111263186A (zh) * 2020-02-18 2020-06-09 中国传媒大学 视频生成、播放、搜索以及处理方法、装置和存储介质
CN111382620A (zh) * 2018-12-28 2020-07-07 阿里巴巴集团控股有限公司 视频标签添加方法、计算机存储介质和电子设备
CN113742585A (zh) * 2021-08-31 2021-12-03 深圳Tcl新技术有限公司 内容搜索方法、装置、电子设备和计算机可读存储介质
CN113806588A (zh) * 2021-09-22 2021-12-17 北京百度网讯科技有限公司 搜索视频的方法和装置

Families Citing this family (66)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105677735B (zh) * 2015-12-30 2020-04-21 腾讯科技(深圳)有限公司 一种视频搜索方法及装置
KR101769071B1 (ko) * 2016-05-10 2017-08-18 네이버 주식회사 비디오 태그 제작 및 활용을 위한 방법 및 시스템
CN106572394B (zh) * 2016-08-30 2020-04-14 上海二三四五网络科技有限公司 一种影视数据导航方法
US10198626B2 (en) * 2016-10-19 2019-02-05 Snap Inc. Neural networks for facial modeling
CN106919652B (zh) * 2017-01-20 2018-04-06 东北石油大学 基于多源多视角直推式学习的短视频自动标注方法与系统
CN107071542B (zh) * 2017-04-18 2020-07-28 百度在线网络技术(北京)有限公司 视频片段播放方法及装置
CN107205016B (zh) * 2017-04-18 2020-01-21 中国科学院计算技术研究所 物联网设备的检索方法
CN107229737A (zh) * 2017-06-14 2017-10-03 广东小天才科技有限公司 一种视频搜索的方法及电子设备
CN107291904A (zh) * 2017-06-23 2017-10-24 百度在线网络技术(北京)有限公司 一种视频搜索方法和装置
CN110019696A (zh) * 2017-08-09 2019-07-16 百度在线网络技术(北京)有限公司 查询意图标注方法、装置、设备及存储介质
CN107704525A (zh) * 2017-09-04 2018-02-16 优酷网络技术(北京)有限公司 视频搜索方法和装置
CN107818180B (zh) * 2017-11-27 2021-07-06 北京小米移动软件有限公司 视频关联方法、视频显示方法、装置及存储介质
CN109963164A (zh) * 2017-12-14 2019-07-02 北京搜狗科技发展有限公司 一种在视频中查询对象的方法、装置和设备
US11109111B2 (en) 2017-12-20 2021-08-31 Flickray, Inc. Event-driven streaming media interactivity
US11252477B2 (en) 2017-12-20 2022-02-15 Videokawa, Inc. Event-driven streaming media interactivity
CN108229363A (zh) 2017-12-27 2018-06-29 北京市商汤科技开发有限公司 关键帧调度方法和装置、电子设备、程序和介质
CN108235116B (zh) * 2017-12-27 2020-06-16 北京市商汤科技开发有限公司 特征传播方法和装置、电子设备和介质
CN110309353A (zh) * 2018-02-06 2019-10-08 上海全土豆文化传播有限公司 视频索引方法及装置
CN110209877A (zh) * 2018-02-06 2019-09-06 上海全土豆文化传播有限公司 视频分析方法及装置
CN108460122B (zh) * 2018-02-23 2021-09-07 武汉斗鱼网络科技有限公司 基于深度学习的视频搜索方法、存储介质、设备及系统
CN108537134B (zh) * 2018-03-16 2020-06-30 北京交通大学 一种视频语义场景分割及标注方法
CN108491930B (zh) * 2018-03-23 2022-04-15 腾讯科技(深圳)有限公司 一种样本数据的处理方法以及数据处理装置
CN108228915B (zh) * 2018-03-29 2021-10-26 华南理工大学 一种基于深度学习的视频检索方法
CN108920507A (zh) * 2018-05-29 2018-11-30 宇龙计算机通信科技(深圳)有限公司 自动搜索方法、装置、终端及计算机可读存储介质
CN108960316B (zh) * 2018-06-27 2020-10-30 北京字节跳动网络技术有限公司 用于生成模型的方法和装置
CN109116718B (zh) * 2018-06-29 2021-10-22 上海掌门科技有限公司 设置闹钟的方法及设备
JP7257756B2 (ja) * 2018-08-20 2023-04-14 キヤノン株式会社 画像識別装置、画像識別方法、学習装置、及びニューラルネットワーク
CN109189978B (zh) * 2018-08-27 2020-06-30 广州酷狗计算机科技有限公司 基于语音消息进行音频搜索的方法、装置及存储介质
CN109635157B (zh) * 2018-10-30 2021-05-25 北京奇艺世纪科技有限公司 模型生成方法、视频搜索方法、装置、终端及存储介质
KR102604937B1 (ko) * 2018-12-05 2023-11-23 삼성전자주식회사 캐릭터를 포함하는 동영상을 생성하기 위한 전자 장치 및 그에 관한 방법
CN111314775B (zh) 2018-12-12 2021-09-07 华为终端有限公司 一种视频拆分方法及电子设备
CN109885730A (zh) * 2018-12-27 2019-06-14 北京春鸿科技有限公司 在wifi存储设备中视频搜索方法
CN109688475B (zh) * 2018-12-29 2020-10-02 深圳Tcl新技术有限公司 视频播放跳转方法、系统及计算机可读存储介质
US10860860B1 (en) * 2019-01-03 2020-12-08 Amazon Technologies, Inc. Matching videos to titles using artificial intelligence
CN109688484A (zh) * 2019-02-20 2019-04-26 广东小天才科技有限公司 一种教学视频学习方法及系统
CN113994652A (zh) * 2019-04-03 2022-01-28 三星电子株式会社 电子设备及其控制方法
CN110446063B (zh) * 2019-07-26 2021-09-07 腾讯科技(深圳)有限公司 视频封面的生成方法、装置及电子设备
CN110781323A (zh) * 2019-10-25 2020-02-11 北京达佳互联信息技术有限公司 多媒体资源的标签确定方法、装置、电子设备及存储介质
CN110781960B (zh) * 2019-10-25 2022-06-28 Oppo广东移动通信有限公司 视频分类模型的训练方法、分类方法、装置及设备
CN110781818B (zh) * 2019-10-25 2023-04-07 Oppo广东移动通信有限公司 视频分类方法、模型训练方法、装置及设备
CN110826471B (zh) * 2019-11-01 2023-07-14 腾讯科技(深圳)有限公司 视频标签的标注方法、装置、设备及计算机可读存储介质
WO2021091893A1 (en) * 2019-11-06 2021-05-14 Betterplay Ai Table Tennis Ltd. Method for efficient and accurate identification of sequences
CN112988671A (zh) * 2019-12-13 2021-06-18 北京字节跳动网络技术有限公司 媒体文件处理方法、装置、可读介质及电子设备
CN113132752B (zh) * 2019-12-30 2023-02-24 阿里巴巴集团控股有限公司 视频处理方法及装置
CN113128285A (zh) * 2019-12-31 2021-07-16 华为技术有限公司 一种处理视频的方法及装置
CN111432138B (zh) * 2020-03-16 2022-04-26 Oppo广东移动通信有限公司 视频拼接方法及装置、计算机可读介质和电子设备
CN111405197B (zh) * 2020-03-19 2022-11-08 京东科技信息技术有限公司 一种视频裁剪方法、图像处理方法及装置
CN113553469B (zh) * 2020-04-23 2023-12-22 阿里巴巴集团控股有限公司 数据处理方法、装置、电子设备及计算机存储介质
CN111400553A (zh) * 2020-04-26 2020-07-10 Oppo广东移动通信有限公司 视频搜索方法、视频搜索装置及终端设备
CN111708909B (zh) * 2020-05-19 2023-11-24 北京奇艺世纪科技有限公司 视频标签的添加方法及装置、电子设备、计算机可读存储介质
CN111708908B (zh) * 2020-05-19 2024-01-30 北京奇艺世纪科技有限公司 视频标签的添加方法及装置、电子设备、计算机可读存储介质
KR20210118203A (ko) * 2020-06-28 2021-09-29 베이징 바이두 넷컴 사이언스 앤 테크놀로지 코., 엘티디. 이모티콘 패키지 생성 방법 및 기기, 전자 기기 및 매체
CN111797801B (zh) * 2020-07-14 2023-07-21 北京百度网讯科技有限公司 用于视频场景分析的方法和装置
CN111866568B (zh) * 2020-07-23 2023-03-31 聚好看科技股份有限公司 一种显示设备、服务器及基于语音的视频集锦获取方法
CN111901668B (zh) * 2020-09-07 2022-06-24 三星电子(中国)研发中心 视频播放方法和装置
CN112261491B (zh) * 2020-12-22 2021-04-16 北京达佳互联信息技术有限公司 视频时序标注方法、装置、电子设备及存储介质
CN112989114B (zh) * 2021-02-04 2023-08-29 有米科技股份有限公司 应用于视频筛选的视频信息生成方法及装置
CN112989115A (zh) * 2021-02-04 2021-06-18 有米科技股份有限公司 待推荐视频的筛选控制方法及装置
CN115119062A (zh) * 2021-03-20 2022-09-27 海信集团控股股份有限公司 一种视频拆分方法、显示设备及显示方法
CN113329268A (zh) * 2021-04-28 2021-08-31 王可 一种影视作品情节筛选系统及方法
US11776261B2 (en) 2021-04-30 2023-10-03 Spherex, Inc. Context-aware event based annotation system for media asset
CN113672764A (zh) * 2021-09-03 2021-11-19 海信电子科技(武汉)有限公司 视频数据检索方法、装置、设备、介质及产品
CN114390368B (zh) * 2021-12-29 2022-12-16 腾讯科技(深圳)有限公司 直播视频数据的处理方法及装置、设备、可读介质
CN114449346B (zh) * 2022-02-14 2023-08-15 腾讯科技(深圳)有限公司 视频处理方法、装置、设备以及存储介质
WO2023235780A1 (en) * 2022-06-01 2023-12-07 Apple Inc. Video classification and search system to support customizable video highlights
CN116843643B (zh) * 2023-07-03 2024-01-16 北京语言大学 一种视频美学质量评价数据集构造方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102207966A (zh) * 2011-06-01 2011-10-05 华南理工大学 基于对象标签的视频内容快速检索方法
CN103761284A (zh) * 2014-01-13 2014-04-30 中国农业大学 一种视频检索方法和系统
CN104133875A (zh) * 2014-07-24 2014-11-05 北京中视广信科技有限公司 一种基于人脸的视频标注方法和视频检索方法
CN105677735A (zh) * 2015-12-30 2016-06-15 腾讯科技(深圳)有限公司 一种视频搜索方法及装置

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7739599B2 (en) * 2005-09-23 2010-06-15 Microsoft Corporation Automatic capturing and editing of a video
US8631439B2 (en) * 2007-04-06 2014-01-14 At&T Intellectual Property I, L.P. Methods, systems, and computer program products for implementing a navigational search structure for media content
US20090150947A1 (en) * 2007-10-05 2009-06-11 Soderstrom Robert W Online search, storage, manipulation, and delivery of video content
CN101566998B (zh) * 2009-05-26 2011-12-28 华中师范大学 一种基于神经网络的中文问答系统
CN102360431A (zh) * 2011-10-08 2012-02-22 大连海事大学 一种自动进行描述图像的方法
CN102663015B (zh) * 2012-03-21 2015-05-06 上海大学 基于特征袋模型和监督学习的视频语义标注方法
US9280742B1 (en) * 2012-09-05 2016-03-08 Google Inc. Conceptual enhancement of automatic multimedia annotations
US10623821B2 (en) * 2013-09-10 2020-04-14 Tivo Solutions Inc. Method and apparatus for creating and sharing customized multimedia segments

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102207966A (zh) * 2011-06-01 2011-10-05 华南理工大学 基于对象标签的视频内容快速检索方法
CN103761284A (zh) * 2014-01-13 2014-04-30 中国农业大学 一种视频检索方法和系统
CN104133875A (zh) * 2014-07-24 2014-11-05 北京中视广信科技有限公司 一种基于人脸的视频标注方法和视频检索方法
CN105677735A (zh) * 2015-12-30 2016-06-15 腾讯科技(深圳)有限公司 一种视频搜索方法及装置

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109151615A (zh) * 2018-11-02 2019-01-04 湖南双菱电子科技有限公司 视频处理方法、计算机设备和计算机存储介质
CN111382620A (zh) * 2018-12-28 2020-07-07 阿里巴巴集团控股有限公司 视频标签添加方法、计算机存储介质和电子设备
CN111382620B (zh) * 2018-12-28 2023-06-09 阿里巴巴集团控股有限公司 视频标签添加方法、计算机存储介质和电子设备
CN109977239A (zh) * 2019-03-31 2019-07-05 联想(北京)有限公司 一种信息处理方法和电子设备
CN109977239B (zh) * 2019-03-31 2023-08-18 联想(北京)有限公司 一种信息处理方法和电子设备
CN111263186A (zh) * 2020-02-18 2020-06-09 中国传媒大学 视频生成、播放、搜索以及处理方法、装置和存储介质
CN113742585A (zh) * 2021-08-31 2021-12-03 深圳Tcl新技术有限公司 内容搜索方法、装置、电子设备和计算机可读存储介质
CN113806588A (zh) * 2021-09-22 2021-12-17 北京百度网讯科技有限公司 搜索视频的方法和装置
CN113806588B (zh) * 2021-09-22 2024-04-12 北京百度网讯科技有限公司 搜索视频的方法和装置

Also Published As

Publication number Publication date
CN105677735B (zh) 2020-04-21
CN105677735A (zh) 2016-06-15
US10642892B2 (en) 2020-05-05
US20180025079A1 (en) 2018-01-25

Similar Documents

Publication Publication Date Title
WO2017114388A1 (zh) 一种视频搜索方法及装置
US10970334B2 (en) Navigating video scenes using cognitive insights
CN110198432B (zh) 视频数据的处理方法、装置、计算机可读介质及电子设备
CN108319723A (zh) 一种图片分享方法和装置、终端、存储介质
CN113536793A (zh) 一种实体识别方法、装置、设备以及存储介质
WO2019134587A1 (zh) 视频数据处理方法、装置、电子设备和存储介质
CN113010703B (zh) 一种信息推荐方法、装置、电子设备和存储介质
CN109408672B (zh) 一种文章生成方法、装置、服务器及存储介质
US11462018B2 (en) Representative image generation
KR100896336B1 (ko) 영상 정보 기반의 동영상 연관 검색 시스템 및 방법
US20180151178A1 (en) Interactive question-answering apparatus and method thereof
CN105607757A (zh) 一种输入方法和装置、一种用于输入的装置
CN108255939A (zh) 一种跨语言搜索方法和装置、一种用于跨语言搜索的装置
KR102440198B1 (ko) 시각 검색 방법, 장치, 컴퓨터 기기 및 저장 매체 (video search method and apparatus, computer device, and storage medium)
CN113596601A (zh) 一种视频画面的定位方法、相关装置、设备及存储介质
Zang et al. Multimodal icon annotation for mobile applications
CN111428522A (zh) 翻译语料生成方法、装置、计算机设备及存储介质
Ghosh et al. SmartTennisTV: Automatic indexing of tennis videos
Tran et al. V-first: A flexible interactive retrieval system for video at vbs 2022
CN110795597A (zh) 视频关键字确定、视频检索方法及装置、存储介质、终端
Feng et al. Multiple style exploration for story unit segmentation of broadcast news video
CN113407775A (zh) 视频搜索方法、装置及电子设备
CN112286617A (zh) 操作指导方法、装置及电子设备
JP2016540320A (ja) マルチメディア資産の中のオブジェクトを注釈付けするための方法
KR102122918B1 (ko) 대화형 질의응답 장치 및 그 방법

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16881177

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16881177

Country of ref document: EP

Kind code of ref document: A1