WO2020029966A1 - 视频处理方法及装置、电子设备和存储介质 - Google Patents

视频处理方法及装置、电子设备和存储介质 Download PDF

Info

Publication number
WO2020029966A1
WO2020029966A1 PCT/CN2019/099486 CN2019099486W WO2020029966A1 WO 2020029966 A1 WO2020029966 A1 WO 2020029966A1 CN 2019099486 W CN2019099486 W CN 2019099486W WO 2020029966 A1 WO2020029966 A1 WO 2020029966A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
feature information
preselected
feature
target
Prior art date
Application number
PCT/CN2019/099486
Other languages
English (en)
French (fr)
Inventor
汤晓鸥
邵典
熊宇
赵岳
黄青虬
乔宇
林达华
Original Assignee
北京市商汤科技开发有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京市商汤科技开发有限公司 filed Critical 北京市商汤科技开发有限公司
Priority to US16/975,347 priority Critical patent/US11120078B2/en
Priority to JP2020573569A priority patent/JP6916970B2/ja
Priority to SG11202008134YA priority patent/SG11202008134YA/en
Priority to KR1020207030575A priority patent/KR102222300B1/ko
Publication of WO2020029966A1 publication Critical patent/WO2020029966A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/232Content retrieval operation locally within server, e.g. reading video streams from disk arrays
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7844Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/74Browsing; Visualisation therefor
    • G06F16/743Browsing; Visualisation therefor a collection of video files or sequences
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • H04N21/23418Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/25Management operations performed by the server for facilitating the content distribution or administrating data related to end-users or client devices, e.g. end-user or client device authentication, learning user preferences for recommending movies
    • H04N21/251Learning process for intelligent management, e.g. learning user preferences for recommending movies
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/432Content retrieval operation from a local storage medium, e.g. hard-disk
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/45Management operations performed by the client for facilitating the reception of or the interaction with the content or administrating data related to the end-user or to the client device itself, e.g. learning user preferences for recommending movies, resolving scheduling conflicts
    • H04N21/466Learning process for intelligent management, e.g. learning user preferences for recommending movies
    • H04N21/4668Learning process for intelligent management, e.g. learning user preferences for recommending movies for recommending content, e.g. movies
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/482End-user interface for program selection
    • H04N21/4826End-user interface for program selection using recommendation lists, e.g. of programs or channels sorted out according to their score
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/482End-user interface for program selection
    • H04N21/4828End-user interface for program selection for searching program descriptors

Definitions

  • querying or retrieving videos in a video library through a sentence usually requires defining content tags for the videos in the video library in advance, and then retrieving the videos through the tags.
  • content tags are not extensible, and it is difficult to retrieve video content that the tags do not have.
  • the content tags of different videos may be duplicated, which may result in redundant search results. Tags can't handle retrieval of content in natural language.
  • a video processing method including: determining, according to paragraph information of a query text paragraph and video information of multiple videos in a video library, that the multiple videos are associated with the query text paragraph Determine a target video in the preselected video according to the video frame information of the preselected video and the sentence information of the query text paragraph.
  • a video processing device including: a pre-selected video determining module, configured to determine the number of videos in the multiple videos according to the paragraph information of the query text paragraph and the video information of multiple videos in the video library.
  • a preselected video associated with the query text paragraph ;
  • a target video determination module configured to determine a target video in the preselected video according to video frame information of the preselected video and sentence information of the query text paragraph.
  • FIG. 1 shows a flowchart of a video processing method according to an embodiment of the present disclosure
  • FIG. 2 shows a flowchart of a video processing method according to an embodiment of the present disclosure
  • FIG. 4 illustrates a flowchart of a video processing method according to an embodiment of the present disclosure
  • FIG. 5 shows an application diagram of a video processing method according to an embodiment of the present disclosure
  • FIG. 6 illustrates a block diagram of a video processing apparatus according to an embodiment of the present disclosure
  • FIG. 7 illustrates a block diagram of an electronic device according to an embodiment of the present disclosure
  • FIG. 8 illustrates a block diagram of an electronic device according to an embodiment of the present disclosure.
  • exemplary means “serving as an example, embodiment, or illustration.” Any embodiment described herein as “exemplary” is not necessarily to be construed as superior to or better than other embodiments.
  • FIG. 1 illustrates a flowchart of a video processing method according to an embodiment of the present disclosure. As shown in Figure 1, the video processing method includes:
  • step S11 according to the paragraph information of the query text paragraph and the video information of the multiple videos in the video library, determine a preselected video associated with the query text paragraph from the multiple videos;
  • step S12 the target video in the preselected video is determined according to the video frame information of the preselected video and the sentence information of the query text paragraph.
  • the preselected video is determined by querying the paragraph information of the text paragraph and the video information of the video
  • the target video is determined based on the query sentence information of the text paragraph and the video frame information of the preselected video. Retrieval of videos with relevance to query text paragraphs can accurately find the target video, avoid redundant query results, and can process query text paragraphs in the form of natural language without being limited by the inherent content of content tags.
  • the video processing method may be executed by a terminal device or a server or other processing devices, where the terminal device may be a user equipment (User Equipment, UE), a mobile device, a user terminal, a terminal, a cellular phone, Cordless phones, Personal Digital Processing (PDA), handheld devices, computing devices, in-vehicle devices, wearable devices, etc.
  • the video processing method may be implemented by a processor invoking computer-readable instructions stored in a memory.
  • query text paragraphs include one or more sentences.
  • the sentence information includes first feature information of one or more sentences of the query text paragraph
  • the paragraph information includes second feature information of the query text paragraph
  • the video frame information includes multiple video frames of the video.
  • Fourth feature information and the video information includes third feature information of the video.
  • first feature information of one or more sentences in a query text paragraph may be acquired, and second feature information of the query text paragraph is determined.
  • the first feature information of the sentence may be a feature vector representing the semantics of the sentence.
  • the method further includes: performing feature extraction processing on one or more sentences of the query text paragraph to obtain the first features of the one or more sentences. information.
  • the second feature information of the query text paragraph is determined according to the first feature information of one or more sentences in the query text paragraph.
  • feature extraction may be performed on the content of one or more sentences through methods such as semantic recognition to obtain first feature information of the one or more sentences.
  • the content of one or more sentences may be semantically identified through a neural network to perform feature extraction on the content of one or more sentences, thereby obtaining first feature information of the one or more sentences.
  • the present disclosure does not limit the method of feature extraction of the content of one or more sentences.
  • the first feature information may be a feature vector representing the semantics of the sentence
  • the first feature information of one or more sentences in the query text paragraph may be fused to obtain the first feature information of the query text paragraph.
  • Two feature information, the second feature information may be a feature vector representing the semantics of the query text paragraph.
  • the first feature information is a feature vector representing the semantics of the sentence
  • the first feature information of one or more sentences may be summed, averaged, or otherwise processed to obtain the second feature information of the query text paragraph
  • the query text paragraph includes M sentences
  • the first feature information of the M sentences is s 1 , s 2 , ..., s M
  • s 1 , s 2 , ..., s M can be summed, averaged, or other Processing to fuse the second feature information P into a query text paragraph
  • the second feature information P is a feature vector with the same dimensions as s 1 , s 2 ,..., S M.
  • the disclosure does not limit the method for obtaining the second characteristic information of the query text paragraph.
  • the second feature information of the query text paragraph can be obtained by extracting the first feature information of each sentence in the query text paragraph, and the semantics of the query text paragraph can be accurately characterized by the second feature information.
  • the fourth feature information of each video frame of the video may be obtained, and the third feature information of the video may be obtained according to the fourth feature information.
  • the method further includes: separately performing feature extraction processing on multiple video frames of the second video to obtain fourth feature information of the multiple video frames of the second video, where the second video is any one of the multiple videos; The fourth feature information of the multiple video frames of the two videos determines the third feature information of the second video.
  • feature extraction processing may be performed on multiple video frames of the second video separately to obtain fourth feature information of the multiple video frames of the second video.
  • feature extraction processing may be performed for each video frame in the second video, or one video frame may be selected for feature extraction processing every certain number of frames.
  • every 5 video frames select a video frame for feature extraction processing (that is, determine the feature information of a selected video frame among the 6 video frames as the fourth feature information), or the 6 video frames may be Fusion processing of feature information (for example, summing, averaging, or other processing, that is, merging feature information of 6 video frames into one, and determining feature information obtained by fusing feature information of 6 video frames as (Fourth feature information), or feature information of each video frame of the second video may be separately extracted as the fourth feature information.
  • the fourth feature information may be a feature vector that characterizes the feature information in the video frame.
  • the fourth feature information may characterize the feature information such as a person, clothing color, action, and scene in the video frame.
  • the network performs feature extraction processing on video frames, and the present disclosure does not limit the method of extracting feature information in the video frames.
  • the fourth feature information of multiple video frames of the second video may be fused to obtain the third feature information of the second video.
  • the fourth feature information is a feature vector that characterizes the feature information in the video frame. Multiple fourth feature information may be summed, averaged, or otherwise processed to obtain the third feature information of the second video.
  • the feature information may be a feature vector representing the feature information of the second video.
  • the fourth feature information f 1 , f 2 , ..., f T of T (T is a positive integer) video frames are obtained from multiple video frames of the second video, and f 1 , f 2 , ..., f T performs summing, averaging, or other processing to fuse into the third feature information V i of the second video, 1 ⁇ i ⁇ N, where N is the number of videos in the video library.
  • the disclosure does not limit the method of obtaining the third characteristic information.
  • step S11 feature extraction is performed on all videos in the video library in advance to obtain third feature information and fourth feature information of all videos in the video library.
  • feature extraction may be performed on the new video to obtain third feature information and fourth feature information of the new video.
  • the third feature information of the second video can be obtained by extracting the fourth feature information of the video frame in the second video, and the feature information of the second video can be accurately characterized by the third feature information.
  • step S111 according to the second feature information of the query text paragraph and the third feature information of the plurality of videos in the video library, a preselected video associated with the query text paragraph in the plurality of videos is determined.
  • determining the preselected video associated with the query text paragraph in the multiple videos according to the second feature information and the third feature information of multiple videos in the video library may include: according to the second feature The information and the third characteristic information of the multiple videos in the video library respectively determine a first correlation score between the query text paragraph and the multiple videos; and determine a preselected video among the multiple videos according to the first correlation score.
  • the second feature information may be a feature vector representing the semantics of the query text paragraph
  • the third feature information may be a feature vector representing the feature information of the second video, the second feature information, and the third feature.
  • the dimensions of the information may be different, that is, the second feature information and the third feature information may not be in a vector space of the same dimension. Therefore, the second feature information and the third feature information may be processed to make the processed second feature information A vector space in the same dimension as the third feature information.
  • the cosine similarity of the second feature vector and the third feature vector can be determined as the first correlation score between the query text paragraph and the first video, and the semantic content and first Correlation between feature information of a video.
  • the third feature information and the second feature information of the first video may be mapped to a vector space of the same dimension in a mapping manner.
  • the third feature information of the first video is a feature vector V j , 1 ⁇ j ⁇ N
  • the second feature information of the query text paragraph is that the feature vectors P, P and V j have different dimensions, and can be mapped by way of , Map P and V j to a vector space of the same dimension to obtain the third feature vector of the first video And query the second feature vector of the text paragraph
  • a neural network may be used to map the third feature information and the second feature information to a vector space of the same dimension.
  • mapping the third feature information and the second feature information of the first video to a vector space of the same dimension, and obtaining the third feature vector of the first video and the second feature vector of the query text paragraph may include: using the first A neural network maps the third feature information to a third feature vector, and uses a second neural network to map the second feature information to a second feature vector.
  • the first neural network and the second neural network may be a back propagation (BP) neural network, a convolutional neural network, or a recurrent neural network, and the like.
  • BP back propagation
  • the third feature information V j has a dimension of 10 and the second feature information P has a dimension of 6.
  • a vector space of the same dimension can be determined.
  • the vector space has a dimension of 8 and a first neural network can be used. Map the 10-dimensional third feature information V j to an 8-dimensional vector space to obtain an 8-dimensional third feature vector Can use a second neural network Map the 6-dimensional second feature information P to an 8-dimensional vector space to obtain an 8-dimensional second feature vector This disclosure does not limit the number of dimensions.
  • a second feature vector may be determined And the third eigenvector Cosine similarity, and with The cosine similarity of is determined as the first correlation score St (V; P) between the query text paragraph and the first video.
  • a first neural network may be used Map the third feature information V 1 , V 2 , ..., V N of each video in the video library to obtain the third feature vector of all videos in the video library And determine the second feature vector separately Third feature vector with all videos , The cosine similarity of the two, is used as the first relevance score of the query text paragraph and each video.
  • a preselected video among the plurality of videos may be determined according to the first correlation score. For example, a video with a first correlation score higher than a certain score threshold may be selected as a preselected video, or multiple videos may be sorted according to the first correlation score, and a predetermined number of videos in the sequence are selected as the preselected video.
  • the present disclosure does not limit the selection method and the number of preselected videos.
  • the first correlation score between the query text paragraph and the video can be determined through the second feature information and the third feature information, and the preselected video is selected according to the first correlation score, thereby improving the accuracy of the preselected video selection.
  • the preselected video can be processed without having to process all the videos in the video library, which saves computing overhead and improves processing efficiency.
  • the first neural network and the second neural network may be trained before the first neural network and the second neural network are used for mapping processing.
  • the method further includes: training the first neural network and the second neural network according to the third sample feature information of the sample video and the second sample feature information of the sample text paragraph.
  • the videos in the video library may be used as the sample videos, and the videos in other video libraries may also be used as the sample videos.
  • the disclosure does not limit the sample videos.
  • the fourth sample feature information of the video frame of the sample video may be extracted, and the third sample feature information of the sample video is determined according to the fourth sample feature information.
  • the third sample feature information of a plurality of sample videos may be input to a first neural network for mapping to obtain a third sample feature vector.
  • the second sample feature information of the sample text paragraph can be input to a second neural network to obtain a second sample feature vector.
  • the cosine similarity between the second sample feature vector and each third sample feature vector may be determined separately, and the first comprehensive network loss may be determined according to the cosine similarity.
  • the following formula may be used (1) Determine the first comprehensive network loss:
  • L find is the first comprehensive network loss
  • S t (V b , P a ) is the cosine similarity between the second sample feature vector of the a-th sample text paragraph and the third sample feature vector of the b-th sample video.
  • V a is the third sample feature information of the sample video corresponding to the a-th sample text paragraph
  • S t (V a , P a ) is the second sample feature vector of the a-th sample text paragraph and the corresponding sample video.
  • a and b are both positive integers.
  • is a set constant. In the example, ⁇ can be set to 0.2.
  • the first comprehensive network loss may be used to adjust the network parameter values of the first neural network and the second neural network.
  • the network parameter values of the first neural network and the second neural network are adjusted in a direction that minimizes the loss of the first comprehensive network, so that the adjusted first neural network and the second neural network have a higher fit. Goodness, while avoiding overfitting.
  • the disclosure does not limit the method of adjusting the network parameter values of the first neural network and the second neural network.
  • the steps of adjusting the network parameter values of the first neural network and the second neural network may be executed cyclically, and the first neural network and the first neural network are adjusted successively in a manner that reduces or converges the loss of the first comprehensive network.
  • Network parameter values of the second neural network may be used in the process of mapping the third feature information of the first video and querying the second feature information of the text paragraph.
  • FIG. 3 illustrates a flowchart of a video processing method according to an embodiment of the present disclosure. As shown in FIG. 3, step S12 includes:
  • step S121 the target video in the preselected video is determined according to the first feature information of one or more sentences of the query text paragraph and the fourth feature information of the multiple video frames in the preselected video.
  • the correlation between the query text paragraph and the video in the preselected video may be further determined according to the first feature information of one or more sentences and the fourth feature information of multiple video frames in the preselected video. .
  • determining the target video in the preselected video according to the first feature information of one or more sentences and the fourth feature information of multiple video frames in the preselected video includes: according to one or more The first feature information of the sentence and the fourth feature information of multiple video frames in the preselected video determine the second correlation score of the query text paragraph and the preselected video; and determine the preselection according to the first correlation score and the second correlation score.
  • the target video in the video includes: according to one or more The first feature information of the sentence and the fourth feature information of multiple video frames in the preselected video determine the second correlation score of the query text paragraph and the preselected video; and determine the preselection according to the first correlation score and the second correlation score.
  • determining the second correlation score of the query text paragraph and the preselected video may be Including: mapping the fourth feature information of multiple video frames of the target preselected video and the first feature information of one or more sentences to a vector space of the same dimension, and respectively obtaining the fourth feature vectors of multiple video frames of the target preselected video And the first feature vector of one or more sentences, wherein the target preselected video is any one of the preselected videos; determining that the cosine similarity between the fourth feature vector and the first feature vector of the target sentence is greater than or equal to the similarity threshold
  • the second correlation score of the query text paragraph and the target preselected video can be determined according to the fourth feature vector of multiple video frames of the target preselected video and one or more sentence first feature vectors, and the query can be accurately determined Relevance between the semantic content of a text paragraph and the target pre-selected video.
  • the fourth feature information of multiple video frames of the target preselected video has different dimensions from the first feature information of one or more sentences, and the fourth feature information and the first feature information may be mapped in a mapping manner The feature information is mapped to a vector space of the same dimension.
  • the fourth feature information of multiple video frames of the target preselected video may be feature vectors f 1 , f 2 , ..., f K (K is the number of video frames of the target preselected video, and K is a positive integer), one
  • the first feature information of one or more sentences may be feature vectors s 1 , s 2 , ..., s M (M is the number of sentences in the query text paragraph, and M is a positive integer), and f 1 may be mapped by mapping.
  • f 2 ,..., f K and s 1 , s 2 ,..., s M are mapped to a vector space of the same dimension to obtain a fourth feature vector And the first eigenvector
  • a neural network may be used to map the fourth feature information and the first feature information to a vector space of the same dimension.
  • the fourth feature information of multiple video frames of the target preselected video and The first feature information of one or more sentences is mapped to a vector space of the same dimension, and the fourth feature vector of multiple video frames of the target preselected video and the first feature vector of one or more sentences are obtained, including: using a third
  • the neural network maps the fourth feature information into a fourth feature vector, and uses the fourth neural network to map the first feature information into a first feature vector.
  • the third neural network and the fourth neural network may be a BP neural network, a convolutional neural network, a recurrent neural network, or the like, and the disclosure does not limit the types of the third neural network and the fourth neural network.
  • the fourth feature information f 1 , f 2 , ..., f K has a dimension of 10
  • the first feature information s 1 , s 2 , ..., s M has a dimension of 6
  • a vector space of the same dimension can be determined, for example ,
  • the vector space has a dimension of 8, and a third neural network can be used 10-dimensional fourth feature information f 1 , f 2 , ..., f K is mapped to an 8-dimensional vector space to obtain an 8-dimensional fourth feature vector
  • This disclosure does not limit the number of dimensions.
  • a target feature vector in which the cosine similarity between the fourth feature vector and the first feature vector of the target sentence is greater than or equal to a similarity threshold may be determined.
  • one sentence can be arbitrarily selected as the target sentence in one or more sentences (for example, the yth sentence is selected as the target sentence, 1 ⁇ y ⁇ K), and multiple fourth Feature vector The cosine similarity with the first feature vector s y of the target sentence, and Determine the target feature vector whose cosine similarity with the first feature vector s y is greater than or equal to the similarity threshold, for example, Among them, 1 ⁇ h ⁇ K, 1 ⁇ u ⁇ K, 1 ⁇ q ⁇ K, and the similarity threshold may be a preset threshold, such as 0.5, and the disclosure does not limit the similarity threshold.
  • the video frames corresponding to the target feature vector may be aggregated into video fragments corresponding to the target sentence.
  • the fourth feature information may be a feature vector obtained by selecting a video frame for feature extraction processing every 5 video frames (that is, every 6 video frames) in the target preselected video.
  • the fourth feature vector is a The feature vector obtained by mapping the four feature information, and the video frame corresponding to each fourth feature vector may be a video frame used for extracting the fourth feature information and five video frames before or after the video frame.
  • the video frames corresponding to all target feature vectors can be aggregated together to obtain a video fragment, which is a video fragment corresponding to the target sentence.
  • Corresponding video frames are aggregated to obtain video fragments corresponding to the target sentence.
  • the disclosure does not limit the video frames corresponding to the target feature vector.
  • a video segment corresponding to a feature vector of each sentence may be determined, and according to a timestamp of a video frame included in a video segment corresponding to the feature vector of each sentence or The frame number and other information determine the corresponding position of the semantic content of each sentence in the target preselected video.
  • the fifth feature vector of the video segment corresponding to the target sentence is determined according to the target feature vector.
  • the target feature vector can be Summing, averaging, or other processing is performed to fuse into a fifth feature vector g y .
  • the target sentence may have multiple corresponding video segments, for example, the target feature vector may be among them, Are adjacent target feature vectors, Are adjacent target feature vectors, Is the adjacent target feature vector, Fused into a fifth eigenvector g y1 , Fused into a fifth eigenvector g y2 , Fusion into a fifth feature vector g y3 . That is, each sentence may correspond to one or more fifth feature vectors. In an example, each fifth feature vector may correspond to one sentence.
  • the second feature vector of the video segment corresponding to the one or more sentences and the first feature vector of the one or more sentences may be used to determine the second part of the query text paragraph and the target preselected video. Relevance score.
  • the first feature vector of multiple sentences M is a positive integer
  • the fifth feature vector of the plurality of video clips is g 1 , g 2 , ..., g W , and W is a positive integer, where is the same as the first feature vector
  • the corresponding fifth feature vector is g 1 , g 2 , ..., g O (O is the same as the first feature vector The number of corresponding fifth feature vectors, where O is a positive integer less than W), and
  • the corresponding fifth feature vector is g O + 1 , g O + 2 , ..., g V (V is the same as the first feature vector The number of corresponding fifth feature vectors, V is a positive integer less than W and greater than O), and
  • the corresponding fifth feature vector is g Z ,
  • the second correlation score of the query text paragraph and the target preselected video may be determined according to the following formula (2):
  • x ij represents whether the i-th sentence corresponds to the j-th video segment.
  • u max is the preset number of values of the video segment, 1 ⁇ u max ⁇ W.
  • r ij is the cosine similarity between the first feature vector of the i-th sentence and the fifth feature vector of the j-th video segment.
  • the first relevance score S t (V, P) of the query text paragraph and the target preselected video may be obtained according to the second relevance score S p (V of the query text paragraph and the target preselected video).
  • P determine the third correlation score S r (V, P) of the query text paragraph and the target preselected video, and determine the third correlation score of the query text paragraph and each preselected video.
  • a product of the first correlation score and the second correlation score is determined as a third correlation score; and a target video is determined in the preselected video according to the third correlation score.
  • Preselected videos can be sorted based on the third relevance score of the query text paragraph and each preselected video, a predetermined number of videos in the sorted sequence can be selected, or videos with a third relevance score greater than or equal to a certain threshold
  • the disclosure does not limit the method of selecting the target video.
  • the third neural network and the fourth neural network may be trained before the third neural network and the fourth neural network are used for mapping processing.
  • the method further includes training a third neural network and a fourth neural network according to the fourth sample feature information of the plurality of video frames in the sample video and the first sample feature information of one or more sentences of the sample text paragraph.
  • the videos in the video library may be used as the sample videos, and the videos in other video libraries may also be used as the sample videos.
  • the disclosure does not limit the sample videos.
  • the fourth sample feature information of the video frame of the sample video may be extracted. You can enter any query text paragraph as a sample text paragraph.
  • the sample text paragraph may include one or more sentences, and the first sample feature information of the training sentence may be extracted.
  • the fourth sample feature information of multiple video frames of the sample video may be input to a third neural network to obtain a fourth sample feature vector.
  • the first sample feature information of one or more sentences of the sample text paragraph can be input into the fourth neural network to obtain the first sample feature vector.
  • a target sample feature vector whose cosine similarity with the first target sample feature vector is greater than or equal to a similarity threshold may be determined, where the first target sample feature vector Is any one of the first sample feature vectors. Further, the target sample feature vector may be fused into a fifth sample feature vector corresponding to the first target sample feature vector. In an example, a fifth sample feature vector corresponding to each first sample feature vector may be determined separately.
  • the cosine similarity between each fifth sample feature vector and the first sample feature vector can be determined separately, and the second comprehensive network loss is determined according to the cosine similarity.
  • the following can be determined according to the following Formula (3) determines the second comprehensive network loss:
  • L ref is the second comprehensive network loss
  • g d is the d fifth sample feature vector
  • g + is the fifth sample feature vector corresponding to the first target sample feature vector
  • is a set constant. In the example, ⁇ can be set to 0.1.
  • the second comprehensive network loss may be used to adjust the network parameter values of the third neural network and the fourth neural network.
  • the network parameter values of the third neural network and the fourth neural network are adjusted in a direction that minimizes the loss of the second comprehensive network, so that the adjusted third neural network and the fourth neural network have a higher fit. Goodness, while avoiding overfitting.
  • the disclosure does not limit the method of adjusting the network parameter values of the third neural network and the fourth neural network.
  • the steps of adjusting the network parameter values of the third neural network and the fourth neural network may be executed cyclically, and the third neural network and the third neural network are adjusted successively in a manner that reduces or converges the second comprehensive network loss.
  • Network parameter values of the fourth neural network may be used in the process of mapping the fourth feature information of multiple video frames of the target preselected video and the first feature information of one or more sentences.
  • FIG. 4 illustrates a flowchart of a video processing method according to an embodiment of the present disclosure.
  • the preselected video may be determined according to the second feature information and the third feature information of the query text paragraph in step S111, and according to the first feature information of one or more sentences of the query text paragraph and
  • the fourth feature information is to determine a target video from a preselected video.
  • FIG. 5 shows an application diagram of a video processing method according to an embodiment of the present disclosure.
  • the video library may include N videos, and fourth feature information of multiple video frames of each video may be obtained separately, and third feature information of each video may be obtained according to the fourth feature information.
  • a query text paragraph may be input, and the query text paragraph may include one or more sentences.
  • the first feature information of each sentence may be extracted, and the second feature text of the query text paragraph may be determined according to the first feature information.
  • Feature information may be extracted.
  • the dimensions of the third feature information and the second feature information may be different.
  • the third feature information may be mapped to a third feature vector through a first neural network, and the third feature information may be mapped through a second neural network.
  • the two feature information are mapped into a second feature vector.
  • the third feature vector and the second feature vector are in a vector space of the same dimension.
  • the cosine similarity between the second feature vector of the query text paragraph and the third feature vector of each video may be determined, and the cosine similarity is determined as the first correlation score of the query text paragraph with each video.
  • the videos in the video library can be sorted according to the first correlation score, such as the video library on the left in FIG. 6.
  • the video sequence obtained by sorting the videos in the video library according to the first correlation score is video 1, video. 2.
  • Video 3 ... Video N The first E (1 ⁇ E ⁇ N) videos are selected from the video sequence as a preselected video.
  • the third neural network may be used to map the fourth feature information of the preselected video to a fourth feature vector, and the fourth neural network may be used to query the first of the one or more sentences of the text paragraph.
  • the feature information is mapped to a first feature vector.
  • the fourth feature vector is in a vector space of the same dimension as the first feature vector.
  • a fourth feature vector whose cosine similarity to the first feature vector of the target sentence is greater than or equal to the similarity threshold may be determined as the target feature vector, and the video frame of the target preselected video corresponding to the target feature vector may be determined Aggregated into video fragments, the target feature vector can also be fused into a fifth feature vector, and the second correlation score of the query text paragraph and the target preselected video can be determined by formula (2). Further, a second relevance score of the query text paragraph and each preselected video may be determined.
  • the first correlation score of the query text paragraph and the preselected video and the second correlation score of the query text paragraph and the preselected video may be multiplied to obtain the third correlation between the query text paragraph and the preselected video.
  • Scores, and sort the E preselected videos according to the third correlation score as shown in the video library on the right in Figure 6, the video sequence obtained by sorting the E preselected videos according to the third correlation score is video 3, video 5 Video 8: After this sorting, Video 3 is the video with the third highest relevance score, that is, the video with the highest relevance to the semantic content of the query text paragraph, followed by Video 5, Video 8 ... 3 as the target video, the first X (X ⁇ E) videos can also be selected as the target video.
  • the cosine similarity between the second feature vector of the text paragraph and the third feature vector of the video is determined as the first correlation score between the query text paragraph and the video, which can accurately Determine the correlation between the semantic content of the query text paragraph and the feature information of the video, so as to accurately select the pre-selected video.
  • the pre-selected video can be processed instead of all the videos in the video library. , Which saves computing overhead and improves processing efficiency.
  • the second correlation score of the query text paragraph and the target preselected video may be determined according to the fourth feature vector of multiple video frames of the target preselected video and one or more sentence first feature vectors, and according to the second correlation score And the first correlation score to determine the target video, the video can be retrieved based on the correlation between the video and the query text paragraph, the target video can be accurately found, the query result is redundant, and the query text paragraph in natural language form can be processed without Limited by the inherent content of the content tag.
  • the present disclosure also provides a video processing device, an electronic device, a computer-readable storage medium, and a program, all of which can be used to implement any one of the video processing methods provided by the present disclosure.
  • a video processing device an electronic device, a computer-readable storage medium, and a program, all of which can be used to implement any one of the video processing methods provided by the present disclosure.
  • FIG. 6 illustrates a block diagram of a video processing apparatus according to an embodiment of the present disclosure. As shown in FIG. 6, the device includes:
  • the preselected video determining module 11 is configured to determine a preselected video associated with the query text paragraph among the multiple videos according to the paragraph information of the query text paragraph and the video information of multiple videos in the video library;
  • the target video determining module 12 is configured to determine the target video in the preselected video according to the video frame information of the preselected video and the sentence information of the query text paragraph.
  • the sentence information includes first feature information of one or more sentences of the query text paragraph, the paragraph information includes second feature information of the query text paragraphs, the video frame information includes fourth feature information of a plurality of video frames of the video, and the video information includes video Third characteristic information.
  • the pre-selected video determination module is further used for:
  • a preselected video associated with the query text paragraph in the multiple videos is determined.
  • the apparatus further includes:
  • Sentence feature extraction module configured to perform feature extraction processing on one or more sentences of a query text paragraph to obtain first feature information of one or more sentences;
  • the second determining module is configured to determine the second feature information of the query text paragraph according to the first feature information of one or more sentences in the query text paragraph.
  • the apparatus further includes:
  • a video feature extraction module configured to separately perform feature extraction processing on multiple video frames of the second video to obtain fourth feature information of the multiple video frames of the second video, where the second video is any one of the multiple videos;
  • the first determining module is configured to determine third feature information of the second video according to fourth feature information of multiple video frames of the second video.
  • the preselected video determination module is further configured to:
  • a preselected video among the plurality of videos is determined.
  • the preselected video determination module is further configured to:
  • the cosine similarity of the second feature vector and the third feature vector is determined as a first correlation score between the query text paragraph and the first video.
  • the target video determination module is further configured to:
  • the fourth feature information of multiple video frames of the target preselected video and the first feature information of one or more sentences are mapped to a vector space of the same dimension, and the fourth feature vectors of multiple video frames of the target preselected video and one Or the first feature vector of multiple sentences, wherein the target preselected video is any one of the preselected videos;
  • the second correlation score of the query text paragraph and the target preselected video is determined.
  • the target video determination module is further configured to:
  • a target video is determined among the preselected videos.
  • the functions provided by the apparatus provided in the embodiments of the present disclosure or the modules included may be used to execute the method described in the foregoing method embodiments.
  • the functions provided by the apparatus provided in the embodiments of the present disclosure or the modules included may be used to execute the method described in the foregoing method embodiments.
  • An embodiment of the present disclosure also provides a computer-readable storage medium having computer program instructions stored thereon.
  • the computer program instructions implement the above method when executed by a processor.
  • the computer-readable storage medium may be a non-volatile computer-readable storage medium.
  • An embodiment of the present disclosure further provides an electronic device including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured as the foregoing method.
  • the electronic device may be provided as a terminal, a server, or other forms of devices.
  • Fig. 7 is a block diagram of an electronic device 800 according to an exemplary embodiment.
  • the electronic device 800 may be a terminal such as a mobile phone, a computer, a digital broadcasting terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, and the like.
  • the electronic device 800 may include one or more of the following components: a processing component 802, a memory 804, a power component 806, a multimedia component 808, an audio component 810, an input / output (I / O) interface 812, and a sensor component 814 , And communication component 816.
  • the processing component 802 generally controls overall operations of the electronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations.
  • the processing component 802 may include one or more processors 820 to execute instructions to complete all or part of the steps of the method described above.
  • the processing component 802 may include one or more modules to facilitate the interaction between the processing component 802 and other components.
  • the processing component 802 may include a multimedia module to facilitate the interaction between the multimedia component 808 and the processing component 802.
  • the memory 804 is configured to store various types of data to support operation at the electronic device 800. Examples of such data include instructions for any application or method for operating on the electronic device 800, contact data, phone book data, messages, pictures, videos, and the like.
  • the memory 804 may be implemented by any type of volatile or non-volatile storage devices, or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), Programming read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic disk or optical disk.
  • SRAM static random access memory
  • EEPROM electrically erasable programmable read-only memory
  • EPROM Programming read-only memory
  • PROM programmable read-only memory
  • ROM read-only memory
  • magnetic memory flash memory
  • flash memory magnetic disk or optical disk.
  • the power component 806 provides power to various components of the electronic device 800.
  • the power component 806 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the electronic device 800.
  • the multimedia component 808 includes a screen that provides an output interface between the electronic device 800 and a user.
  • the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user.
  • the touch panel includes one or more touch sensors to sense touch, swipe, and gestures on the touch panel. The touch sensor can not only sense the boundary of a touch or slide action, but also detect duration and pressure related to the touch or slide operation.
  • the multimedia component 808 includes a front camera and / or a rear camera. When the electronic device 800 is in an operation mode, such as a shooting mode or a video mode, the front camera and / or the rear camera can receive external multimedia data. Each front camera and rear camera can be a fixed optical lens system or have focal length and optical zoom capabilities.
  • the audio component 810 is configured to output and / or input audio signals.
  • the audio component 810 includes a microphone (MIC).
  • the microphone is configured to receive an external audio signal.
  • the received audio signal may be further stored in the memory 804 or transmitted via the communication component 816.
  • the audio component 810 further includes a speaker for outputting audio signals.
  • the I / O interface 812 provides an interface between the processing component 802 and a peripheral interface module.
  • the peripheral interface module may be a keyboard, a click wheel, a button, or the like. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.
  • the sensor component 814 includes one or more sensors for providing various aspects of the state evaluation of the electronic device 800.
  • the sensor component 814 can detect the on / off state of the electronic device 800, and the relative positioning of the components, such as the display and keypad of the electronic device 800.
  • the sensor component 814 can also detect the electronic device 800 or a component of the electronic device 800.
  • the location changes, the presence or absence of the user's contact with the electronic device 800, the orientation or acceleration / deceleration of the electronic device 800, and the temperature change of the electronic device 800.
  • the sensor component 814 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact.
  • the sensor component 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications.
  • the sensor component 814 may further include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
  • the communication component 816 is configured to facilitate wired or wireless communication between the electronic device 800 and other devices.
  • the electronic device 800 can access a wireless network based on a communication standard, such as WiFi, 2G, or 3G, or a combination thereof.
  • the communication component 816 receives a broadcast signal or broadcast-related information from an external broadcast management system via a broadcast channel.
  • the communication component 816 also includes a near field communication (NFC) module to facilitate short-range communication.
  • the NFC module can be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
  • RFID radio frequency identification
  • IrDA infrared data association
  • UWB ultra wideband
  • Bluetooth Bluetooth
  • the electronic device 800 may be implemented by one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), Implementation of a programming gate array (FPGA), controller, microcontroller, microprocessor, or other electronic component to perform the above method.
  • ASICs application specific integrated circuits
  • DSPs digital signal processors
  • DSPDs digital signal processing devices
  • PLDs programmable logic devices
  • FPGA programming gate array
  • controller microcontroller, microprocessor, or other electronic component to perform the above method.
  • a non-volatile computer-readable storage medium such as a memory 804 including computer program instructions, and the computer program instructions may be executed by the processor 820 of the electronic device 800 to complete the above method.
  • Fig. 8 is a block diagram of an electronic device 1900 according to an exemplary embodiment.
  • the electronic device 1900 may be provided as a server.
  • the electronic device 1900 includes a processing component 1922, which further includes one or more processors, and a memory resource represented by a memory 1932, for storing instructions executable by the processing component 1922, such as an application program.
  • the application program stored in the memory 1932 may include one or more modules each corresponding to a set of instructions.
  • the processing component 1922 is configured to execute instructions to perform the method described above.
  • the electronic device 1900 may further include a power supply component 1926 configured to perform power management of the electronic device 1900, a wired or wireless network interface 1950 configured to connect the electronic device 1900 to a network, and an input / output (I / O) interface 1958 .
  • the electronic device 1900 can operate based on an operating system stored in the memory 1932, such as Windows ServerTM, Mac OSXTM, UnixTM, LinuxTM, FreeBSDTM, or the like.
  • a non-volatile computer-readable storage medium is also provided, such as a memory 1932 including computer program instructions, which can be executed by the processing component 1922 of the electronic device 1900 to complete the above method.
  • the present disclosure may be a system, method, and / or computer program product.
  • the computer program product may include a computer-readable storage medium having computer-readable program instructions for causing a processor to implement various aspects of the present disclosure.
  • the computer-readable storage medium may be a tangible device that can hold and store instructions used by the instruction execution device.
  • the computer-readable storage medium may be, for example, but not limited to, an electric storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
  • Non-exhaustive list of computer-readable storage media include: portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM) Or flash memory), static random access memory (SRAM), portable compact disc read only memory (CD-ROM), digital versatile disc (DVD), memory stick, floppy disk, mechanical encoding device, such as a printer with instructions stored there A protruding structure in the hole card or groove, and any suitable combination of the above.
  • RAM random access memory
  • ROM read-only memory
  • EPROM erasable programmable read-only memory
  • flash memory flash memory
  • SRAM static random access memory
  • CD-ROM compact disc read only memory
  • DVD digital versatile disc
  • memory stick floppy disk
  • mechanical encoding device such as a printer with instructions stored there A protruding structure in the hole card or groove, and any suitable combination of the above.
  • Computer-readable storage media used herein are not to be interpreted as transient signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (for example, light pulses through fiber optic cables), or via electrical wires Electrical signal transmitted.
  • the computer-readable program instructions described herein can be downloaded from a computer-readable storage medium to various computing / processing devices, or downloaded to an external computer or external storage device via a network, such as the Internet, a local area network, a wide area network, and / or a wireless network.
  • the network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and / or edge servers.
  • the network adapter card or network interface in each computing / processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in each computing / processing device .
  • Computer program instructions for performing the operations of the present disclosure may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-related instructions, microcode, firmware instructions, state setting data, or in one or more programming languages.
  • Programming languages include object-oriented programming languages—such as Smalltalk, C ++, and so on—as well as regular procedural programming languages—such as "C” or similar programming languages.
  • Computer-readable program instructions may be executed entirely on a user's computer, partly on a user's computer, as a stand-alone software package, partly on a user's computer, partly on a remote computer, or entirely on a remote computer or server carried out.
  • the remote computer can be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (such as through the Internet using an Internet service provider) connection).
  • LAN local area network
  • WAN wide area network
  • an electronic circuit such as a programmable logic circuit, a field-programmable gate array (FPGA), or a programmable logic array (PLA), can be personalized by using state information of computer-readable program instructions.
  • FPGA field-programmable gate array
  • PDA programmable logic array
  • the electronic circuit can Computer-readable program instructions are executed to implement various aspects of the present disclosure.
  • These computer-readable program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, or other programmable data processing device, thereby producing a machine such that when executed by a processor of a computer or other programmable data processing device , Means for implementing the functions / actions specified in one or more blocks in the flowcharts and / or block diagrams.
  • These computer-readable program instructions may also be stored in a computer-readable storage medium, and these instructions cause a computer, a programmable data processing apparatus, and / or other devices to work in a specific manner. Therefore, a computer-readable medium storing instructions includes: An article of manufacture that includes instructions to implement various aspects of the functions / acts specified in one or more blocks in the flowcharts and / or block diagrams.
  • Computer-readable program instructions can also be loaded onto a computer, other programmable data processing device, or other device, so that a series of operating steps can be performed on the computer, other programmable data processing device, or other device to produce a computer-implemented process , So that the instructions executed on the computer, other programmable data processing apparatus, or other equipment can implement the functions / actions specified in one or more blocks in the flowchart and / or block diagram.
  • each block in the flowchart or block diagram may represent a module, program segment, or part of an instruction.
  • a module, program segment, or part of an instruction contains one or more executable functions for implementing a specified logical function. instruction.
  • the functions marked in the blocks may also occur in a different order than those marked in the drawings. For example, two consecutive blocks may actually be executed substantially in parallel, and they may sometimes be executed in the reverse order, depending on the functions involved.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Signal Processing (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Library & Information Science (AREA)
  • Computational Linguistics (AREA)
  • Human Computer Interaction (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

本公开涉及一种视频处理方法及装置、电子设备和存储介质,所述方法包括:根据查询文本段落的段落信息和视频库中多个视频的视频信息,确定多个视频中与查询文本段落相关联的预选视频;根据预选视频的视频帧信息和查询文本段落的语句信息,确定预选视频中的目标视频。根据本公开的实施例的视频处理方法,可通过视频与查询文本段落的相关性来检索视频,可精确查找目标视频,避免查询结果冗余,并可处理自然语言形式的查询文本段落,不会受到内容标签的固有内容的限制。

Description

视频处理方法及装置、电子设备和存储介质
本公开要求在2018年8月7日提交中国专利局、申请号为201810892997.4、申请名称为“视频处理方法及装置、电子设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本公开中。
技术领域
本公开涉及计算机技术领域,尤其涉及一种视频处理方法及装置、电子设备和存储介质。
背景技术
在相关技术中,通过语句来查询或检索视频库中的视频,通常需要为视频库中的视频提前定义好内容标签,再通过标签来检索视频。在一些视频中,难以定义内容标签,并且内容标签不具有扩展性,难以检索到标签不具备的视频内容,此外,不同视频的内容标签可能重复,因此可能造成检索结果的冗余,因此,内容标签难以处理以自然语言形式的检索内容。
发明内容
本公开提出了一种视频处理方法及装置、电子设备和存储介质。
根据本公开的一方面,提供了一种视频处理方法,包括:根据查询文本段落的段落信息和视频库中多个视频的视频信息,确定所述多个视频中与所述查询文本段落相关联的预选视频;根据所述预选视频的视频帧信息和所述查询文本段落的语句信息,确定所述预选视频中的目标视频。
根据本公开的实施例的视频处理方法,通过查询文本段落的段落信息与视频的视频信息来确定预选视频,并根据查询文本段落的语句信息和预选视频的视频帧信息确定目标视频,可通过视频与查询文本段落的相关性来检索视频,可精确查找目标视频,避免查询结果冗余,并且,可处理自然语言形式的查询文本段落,不会受到内容标签的固有内容的限制。
根据本公开的另一方面,提供了一种视频处理装置,包括:预选视频确定模块,用于根据查询文本段落的段落信息和视频库中多个视频的视频信息,确定所述多个视频中与所述查询文本段落相关联的预选视频;目标视频确定模块,用于根据所述预选视频的视频帧信息和所述查询文本段落的语句信息,确定所述预选视频中的目标视频。
根据本公开的另一方面,提供了一种电子设备,包括:处理器;用于存储处理器可执行指令的存储器;其中,所述处理器被配置为:执行上述视频处理方法。
根据本公开的另一方面,提供了一种计算机可读存储介质,其上存储有计算机程序指令,所述计算机程序指令被处理器执行时实现上述视频处理方法。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,而非限制本公开。
根据下面参考附图对示例性实施例的详细说明,本公开的其它特征及方面将变得清楚。
附图说明
此处的附图被并入说明书中并构成本说明书的一部分,这些附图示出了符合本公开的实施例,并与说明书一起用于说明本公开的技术方案。
图1示出根据本公开实施例的视频处理方法的流程图;
图2示出根据本公开实施例的视频处理方法的流程图;
图3示出根据本公开实施例的视频处理方法的流程图;
图4示出根据本公开实施例的视频处理方法的流程图;
图5示出根据本公开实施例的视频处理方法的应用示意图;
图6示出根据本公开实施例的视频处理装置的框图;
图7示出根据本公开实施例的电子装置的框图;
图8示出根据本公开实施例的电子装置的框图。
具体实施方式
以下将参考附图详细说明本公开的各种示例性实施例、特征和方面。附图中相同的附图标记表示功能相同或相似的元件。尽管在附图中示出了实施例的各种方面,但是除非特别指出,不必按比例绘制附图。
在这里专用的词“示例性”意为“用作例子、实施例或说明性”。这里作为“示例性”所说明的任何实 施例不必解释为优于或好于其它实施例。
本文中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中术语“至少一种”表示多种中的任意一种或多种中的至少两种的任意组合,例如,包括A、B、C中的至少一种,可以表示包括从A、B和C构成的集合中选择的任意一个或多个元素。
另外,为了更好的说明本公开,在下文的具体实施方式中给出了众多的具体细节。本领域技术人员应当理解,没有某些具体细节,本公开同样可以实施。在一些实例中,对于本领域技术人员熟知的方法、手段、元件和电路未作详细描述,以便于凸显本公开的主旨。
图1示出根据本公开实施例的视频处理方法的流程图。如图1所示,视频处理方法包括:
在步骤S11中,根据查询文本段落的段落信息和视频库中多个视频的视频信息,确定多个视频中与查询文本段落相关联的预选视频;
在步骤S12中,根据预选视频的视频帧信息和查询文本段落的语句信息,确定预选视频中的目标视频。
根据本公开的实施例的视频处理方法,通过查询文本段落的段落信息与视频的视频信息来确定预选视频,并根据查询文本段落的语句信息和预选视频的视频帧信息确定目标视频,可通过视频与查询文本段落的相关性来检索视频,可精确查找目标视频,避免查询结果冗余,并且,可处理自然语言形式的查询文本段落,不会受到内容标签的固有内容的限制。
在一种可能的实现方式中,视频处理方法可以由终端设备或服务器或其它处理设备执行,其中,终端设备可以为用户设备(User Equipment,UE)、移动设备、用户终端、终端、蜂窝电话、无绳电话、个人数字处理(Personal Digital Assistant,PDA)、手持设备、计算设备、车载设备、可穿戴设备等。在一些可能的实现方式中,该视频处理方法可以通过处理器调用存储器中存储的计算机可读指令的方式来实现。在检索或查询数据库中的视频时,可输入查询文本段落,查询文本段落包括一个或多个语句,可在数据库中,查询到与查询文本段落描述的内容最接近的视频。
在一种可能的实现方式中,语句信息包括查询文本段落的一个或多个语句的第一特征信息,段落信息包括查询文本段落的第二特征信息,视频帧信息包括视频的多个视频帧的第四特征信息,视频信息包括视频的第三特征信息。
在一种可能的实现方式中,可获取查询文本段落中的一个或多个语句的第一特征信息,确定查询文本段落的第二特征信息。其中,语句的第一特征信息可以是表征语句的语义的特征向量,所述方法还包括:对查询文本段落的一个或多个语句分别进行特征提取处理,获得一个或多个语句的第一特征信息。根据查询文本段落中的一个或多个语句的第一特征信息,确定查询文本段落的第二特征信息。
在一种可能的实现方式中,可通过语义识别等方法对一个或多个语句的内容进行特征提取,获得一个或多个语句的第一特征信息。例如,可通过神经网络对一个或多个语句的内容进行语义识别,以对一个或多个语句的内容进行特征提取,从而获得一个或多个语句的第一特征信息。本公开对一个或多个语句的内容的特征提取的方法不做限制。
在一种可能的实现方式中,第一特征信息可以是表征语句的语义的特征向量,可对查询文本段落中的一个或多个语句的第一特征信息进行融合,可获得查询文本段落的第二特征信息,第二特征信息可以是表征查询文本段落的语义的特征向量。在示例中,第一特征信息是表征语句的语义的特征向量,可对一个或多个语句的第一特征信息进行求和、求平均或其他处理,获得查询文本段落的第二特征信息,例如,查询文本段落包括M个语句,M个语句的第一特征信息分别为s 1,s 2,…,s M,可对s 1,s 2,…,s M进行求和、求平均或其他处理,以融合成查询文本段落的第二特征信息P,第二特征信息P为与s 1,s 2,…,s M维度相同的特征向量。本公开对获得查询文本段落的第二特征信息的方法不做限制。
通过这种方式,可通过提取查询文本段落中每个语句的第一特征信息,来获得查询文本段落的第二特征信息,可通过第二特征信息准确地表征查询文本段落的语义。
在一种可能的实现方式中,可获取视频的各视频帧的第四特征信息,并根据第四特征信息获得视频的第三特征信息。所述方法还包括:对第二视频的多个视频帧分别进行特征提取处理,获得第二视 频的多个视频帧的第四特征信息,第二视频为多个视频中的任意一个;根据第二视频的多个视频帧的第四特征信息,确定第二视频的第三特征信息。
在一种可能的实现方式中,可对第二视频的多个视频帧分别进行特征提取处理,获得第二视频的多个视频帧的第四特征信息。在示例中,可对第二视频中的每个视频帧进行特征提取处理,也可每隔一定的帧数选取一个视频帧进行特征提取处理,在示例中,可每隔5个视频帧(即,每6个视频帧)选取一个视频帧进行特征提取处理(即,将6个视频帧中的被选中的一个视频帧的特征信息确定为第四特征信息),或可对该6个视频帧的特征信息进行融合处理(例如,进行求和、求平均或其他处理,即,将6个视频帧的特征信息融合为一个,并将6个视频帧的特征信息融合后获得的特征信息确定为第四特征信息),也可分别提取第二视频的每个视频帧的特征信息,作为第四特征信息。在示例中,第四特征信息可以是表征视频帧中的特征信息的特征向量,例如,第四特征信息可表征视频帧中的人物、衣着颜色、动作和场景等特征信息,可通过卷积神经网络对视频帧进行特征提取处理,本公开对提取视频帧中的特征信息的方法不做限制。
在一种可能的实现方式中,可对第二视频的多个视频帧的第四特征信息进行融合,可获得第二视频的第三特征信息。在示例中,第四特征信息是表征视频帧中的特征信息的特征向量,可对多个第四特征信息进行求和、求平均或其他处理,获得第二视频的第三特征信息,第三特征信息可以是表征第二视频的特征信息的特征向量。例如,在第二视频的多个视频帧中获取了T(T为正整数)个视频帧的第四特征信息f 1,f 2,…,f T,可对f 1,f 2,…,f T进行求和、求平均或其他处理,以融合成第二视频的第三特征信息V i,1≤i≤N,N为视频库中的视频数量。本公开对获得第三特征信息的方法不做限制。
在一种可能的实现方式中,可在步骤S11开始执行之前,预先对视频库中的所有视频进行特征提取,获得视频库中所有视频的第三特征信息和第四特征信息。在视频库中添加了新的视频时,可对新的视频进行特征提取,以获取新的视频的第三特征信息和第四特征信息。
通过这种方式,可通过提取第二视频中的视频帧的第四特征信息,来获得第二视频的第三特征信息,可通过第三特征信息准确地表征第二视频的特征信息。
图2示出根据本公开实施例的视频处理方法的流程图。如图2所示,在步骤S11包括:
在步骤S111中,根据查询文本段落的第二特征信息以及视频库中的多个视频的第三特征信息,确定多个视频中与查询文本段落相关联的预选视频。
在一种可能的实现方式中,根据第二特征信息以及视频库中的多个视频的第三特征信息,确定多个视频中与查询文本段落相关联的预选视频,可包括:根据第二特征信息以及视频库中的多个视频的第三特征信息,分别确定查询文本段落与多个视频之间的第一相关性分数;根据第一相关性分数,确定多个视频中的预选视频。
在一种可能的实现方式中,第二特征信息可以是表征查询文本段落的语义的特征向量,第三特征信息可以是表征第二视频的特征信息的特征向量,第二特征信息和第三特征信息的维度可不相同,即,第二特征信息和第三特征信息可不处于相同维度的向量空间,因此,可将第二特征信息和第三特征信息进行处理,以使处理后的第二特征信息和第三特征信息处于相同维度的向量空间。
在一种可能的实现方式中,根据第二特征信息以及视频库中的多个视频的第三特征信息,分别确定查询文本段落与多个视频之间的第一相关性分数,可包括:将第一视频的第三特征信息和第二特征信息映射到相同维度的向量空间,获得第一视频的第三特征向量和查询文本段落的第二特征向量,其中,第一视频为多个视频中的任意一个;将第二特征向量和第三特征向量的余弦相似度确定为查询文本段落与第一视频之间的第一相关性分数。
通过这种方式,可通过第二特征向量和第三特征向量的余弦相似度确定为查询文本段落与第一视频之间的第一相关性分数,可准确地确定查询文本段落的语义内容和第一视频的特征信息之间的相关性。
在一种可能的实现方式中,可使用映射的方式将第一视频的第三特征信息和第二特征信息映射到相同维度的向量空间。在示例中,第一视频的第三特征信息为特征向量V j,1≤j≤N,查询文本段落的第二特征信息为特征向量P,P和V j的维度不同,可通过映射的方式,将P和V j映射到相同维度的向 量空间,获得第一视频的第三特征向量
Figure PCTCN2019099486-appb-000001
和查询文本段落的第二特征向量
Figure PCTCN2019099486-appb-000002
在一种可能的实现方式中,可使用神经网络将第三特征信息和第二特征信息映射到相同维度的向量空间。在示例中,将第一视频的第三特征信息和第二特征信息映射到相同维度的向量空间,获得第一视频的第三特征向量和查询文本段落的第二特征向量,可包括:使用第一神经网络将第三特征信息映射为第三特征向量,并使用第二神经网络将第二特征信息映射为第二特征向量。
在示例中,第一神经网络和第二神经网络可以是反向传播(back propagation,BP)神经网络、卷积神经网络或循环神经网络等,本公开对第一神经网络和第二神经网络的类型不做限制。例如,第三特征信息V j的维度为10,第二特征信息P的维度为6,可确定一个相同维度的向量空间,例如,向量空间的维度为8,可使用第一神经网络
Figure PCTCN2019099486-appb-000003
将10维的第三特征信息V j映射至8维的向量空间,获得8维的第三特征向量
Figure PCTCN2019099486-appb-000004
可使用第二神经网络
Figure PCTCN2019099486-appb-000005
将6维的第二特征信息P映射至8维的向量空间,获得8维的第二特征向量
Figure PCTCN2019099486-appb-000006
本公开对维数不做限制。
在一种可能的实现方式中,可确定第二特征向量
Figure PCTCN2019099486-appb-000007
和第三特征向量
Figure PCTCN2019099486-appb-000008
的余弦相似度,并将
Figure PCTCN2019099486-appb-000009
Figure PCTCN2019099486-appb-000010
的余弦相似度确定为查询文本段落与第一视频之间的第一相关性分数St(V;P)。
在一种可能的实现方式中,可使用第一神经网络
Figure PCTCN2019099486-appb-000011
将视频库中的每个视频的第三特征信息V 1,V 2,…,V N进行映射,获得视频库中所有视频的第三特征向量
Figure PCTCN2019099486-appb-000012
并分别确定第二特征向量
Figure PCTCN2019099486-appb-000013
与所有视频的第三特征向量
Figure PCTCN2019099486-appb-000014
的余弦相似度,分别作为查询文本段落与每个视频的第一相关性分数。可根据第一相关性分数,确定多个视频中的预选视频。例如,可选出第一相关性分数高于某个分数阈值的视频作为预选视频,或将多个视频按照第一相关性分数进行排序,选择序列中预定数量的视频作为预选视频。本公开对预选视频的选择方式和选择数量不做限制。
通过这种方式,可通过第二特征信息和第三特征信息来确定查询文本段落与视频之间的第一相关性分数,并根据第一相关性分数来选择预选视频,提高预选视频选择的准确度,并且,在选择出预选视频后,可针对预选视频进行处理,而不必针对视频库中的所有视频进行处理,节省了运算开销,提高了处理效率。
在一种可能的实现方式中,可在使用第一神经网络和第二神经网络进行映射处理之前,对第一神经网络及第二神经网络进行训练。所述方法还包括:根据样本视频的第三样本特征信息以及样本文本段落的第二样本特征信息,训练第一神经网络以及第二神经网络。
在一种可能的实现方式中,可采用视频库中的视频作为样本视频,也可使用其他视频库中的视频作为样本视频,本公开对样本视频不做限制。可提取样本视频的视频帧的第四样本特征信息,并根据第四样本特征信息确定样本视频的第三样本特征信息。
在一种可能的实现方式中,可输入任意查询文本段落,作为样本文本段落。样本文本段落可包括一个或多个语句,可提取训练语句的第一样本特征信息,并根据第一样本特征信息确定样本文本段落的第二样本特征信息。在样本视频中,存在与样本文本段落对应的视频,即,存在某个样本视频的内容与样本文本段落的内容吻合。
在一种可能的实现方式中,可将多个样本视频的第三样本特征信息输入第一神经网络进行映射,获得第三样本特征向量。可将样本文本段落的第二样本特征信息输入第二神经网络,获得第二样本特征向量。
在一种可能的实现方式中,可分别确定第二样本特征向量与每个第三样本特征向量的余弦相似度,并根据余弦相似度确定第一综合网络损失,在示例中,可根据以下公式(1)确定第一综合网络损失:
L find=∑ ab≠amax(0,S t(V b,P a)-S t(V a,P a)+α)     (1)
其中,L find为第一综合网络损失,S t(V b,P a)为第a个样本文本段落的第二样本特征向量与第b个样本视频的第三样本特征向量的余弦相似度。其中,V a为与第a个样本文本段落对应的样本视频的第三样本特征信息,S t(V a,P a)为第a个样本文本段落的第二样本特征向量与对应的样本视频的第三样本特征向量的余弦相似度。a和b均为正整数。α为设定的常数,在示例中,α可设为0.2。
在一种可能的实现方式中,可使用第一综合网络损失来调整第一神经网络和第二神经网络的网络参数值。在示例中,按照使第一综合网络损失最小化的方向来调整第一神经网络和第二神经网络的网 络参数值,使调整后的第一神经网络和第二神经网络具有较高的拟合优度,同时避免过拟合。本公开对调整第一神经网络和第二神经网络的网络参数值的方法不做限制。
在一种可能的实现方式中,调整第一神经网络和第二神经网络的网络参数值的步骤可循环执行,并按照使第一综合网络损失降低或收敛的方式来逐次调整第一神经网络和第二神经网络的网络参数值。在示例中,可输入预定次数的样本文本段落,即,循环执行预定次数。在示例中,也可不限定循环执行的次数,在第一综合网络损失降低到一定程度或收敛于一定阈值内时,停止循环,并获得循环调整后的第一神经网络及第二神经网络。可将循环调整后的第一神经网络及第二神经网络用于映射第一视频的第三特征信息和查询文本段落的第二特征信息的过程中。
图3示出根据本公开实施例的视频处理方法的流程图。如图3所示,步骤S12包括:
在步骤S121中,根据查询文本段落的一个或多个语句的第一特征信息以及预选视频中的多个视频帧的第四特征信息,确定预选视频中的目标视频。
在一种可能的实现方式中,可根据一个或多个语句的第一特征信息以及预选视频中的多个视频帧的第四特征信息,进一步确定查询文本段落和预选视频中的视频的相关性。
在一种可能的实现方式中,根据一个或多个语句的第一特征信息以及预选视频中的多个视频帧的第四特征信息,确定预选视频中的目标视频,包括:根据一个或多个语句的第一特征信息以及预选视频中的多个视频帧的第四特征信息,确定查询文本段落与预选视频的第二相关性分数;根据第一相关性分数和第二相关性分数,确定预选视频中的目标视频。
在一种可能的实现方式中,根据一个或多个语句的第一特征信息以及预选视频中的多个视频帧的第四特征信息,确定查询文本段落与预选视频的第二相关性分数,可包括:将目标预选视频的多个视频帧的第四特征信息以及一个或多个语句的第一特征信息映射到相同维度的向量空间,分别获得目标预选视频的多个视频帧的第四特征向量以及一个或多个语句的第一特征向量,其中,目标预选视频为预选视频中的任意一个;确定第四特征向量中与目标语句的第一特征向量的余弦相似度大于或等于相似度阈值的目标特征向量,其中,目标语句为一个或多个语句中的任意一个;将目标特征向量对应的视频帧聚合成与目标语句对应的视频片段;根据目标特征向量,确定与目标语句对应的视频片段的第五特征向量;根据分别与一个或多个语句对应的视频片段的第五特征向量以及一个或多个语句的第一特征向量,确定查询文本段落与目标预选视频的第二相关性分数。
通过这种方式,可根据目标预选视频的多个视频帧的第四特征向量以及一个或多个语句第一特征向量确定查询文本段落与目标预选视频的第二相关性分数,可准确地确定查询文本段落的语义内容和目标预选视频之间的相关性。
在一种可能的实现方式中,目标预选视频的多个视频帧的第四特征信息与一个或多个语句的第一特征信息的维度不同,可使用映射的方式将第四特征信息和第一特征信息映射到相同维度的向量空间。在示例中,目标预选视频的多个视频帧的第四特征信息可以是特征向量f 1,f 2,…,f K(K为目标预选视频的视频帧的数量,K为正整数),一个或多个语句的第一特征信息可以是特征向量s 1,s 2,…,s M(M为查询文本段落的语句的数量,M为正整数),可通过映射的方式,将f 1,f 2,…,f K与s 1,s 2,…,s M映射到相同维度的向量空间,获得第四特征向量
Figure PCTCN2019099486-appb-000015
和第一特征向量
Figure PCTCN2019099486-appb-000016
在一种可能的实现方式中,可使用神经网络将第四特征信息和第一特征信息映射到相同维度的向量空间,在示例中,将目标预选视频的多个视频帧的第四特征信息以及一个或多个语句的第一特征信息映射到相同维度的向量空间,分别获得目标预选视频的多个视频帧的第四特征向量以及一个或多个语句的第一特征向量,包括:使用第三神经网络将第四特征信息映射为第四特征向量,并使用第四神经网络将第一特征信息映射为第一特征向量。
在示例中,第三神经网络和第四神经网络可以是BP神经网络、卷积神经网络或循环神经网络等,本公开对第三神经网络和第四神经网络的类型不做限制。例如,第四特征信息f 1,f 2,…,f K的维度为10,第一特征信息s 1,s 2,…,s M的维度为6,可确定一个相同维度的向量空间,例如,向量空间的维度为8,可使用第三神经网络
Figure PCTCN2019099486-appb-000017
将10维的第四特征信息f 1,f 2,…,f K映射至8维的向量空间,获得8维的第四特征向量
Figure PCTCN2019099486-appb-000018
可使用第四神经网络
Figure PCTCN2019099486-appb-000019
将6维的第一特征信息s 1,s 2,…,s M映射至8维的向量空间,获 得8维的第一特征向量
Figure PCTCN2019099486-appb-000020
本公开对维数不做限制。
在一种可能的实现方式中,可确定第四特征向量中与目标语句的第一特征向量的余弦相似度大于或等于相似度阈值的目标特征向量。在示例中,可在一个或多个语句中任意选取一个语句作为目标语句(例如,选取第y个语句作为目标语句,1≤y≤K),并分别计算在目标预选视频的多个第四特征向量
Figure PCTCN2019099486-appb-000021
与目标语句的第一特征向量s y的余弦相似度,并在多个第四特征向量
Figure PCTCN2019099486-appb-000022
中确定与第一特征向量s y的余弦相似度大于或等于相似度阈值的目标特征向量,例如,
Figure PCTCN2019099486-appb-000023
其中,1≤h≤K,1≤u≤K,1≤q≤K,相似度阈值可以是预先设定的阈值,例如0.5等,本公开对相似度阈值不做限制。
在一种可能的实现方式中,可将目标特征向量对应的视频帧聚合成与目标语句对应的视频片段。在示例中,第四特征信息可以是在目标预选视频中每隔5个视频帧(即,每6个视频帧)选取一个视频帧进行特征提取处理获得的特征向量,第四特征向量是将第四特征信息进行映射获得的特征向量,每个第四特征向量对应的视频帧可以是提取第四特征信息使用的视频帧以及该视频帧之前或之后的5个视频帧。可将所有目标特征向量对应的视频帧聚合在一起,获得视频片段,该视频片段即为与目标语句对应的视频片段,例如,将
Figure PCTCN2019099486-appb-000024
对应的视频帧进行聚合,获得与目标语句对应的视频片段。本公开对目标特征向量对应的视频帧不做限制。
在一种可能的实现方式中,可在目标预选视频中,确定每个语句的特征向量对应的视频片段,根据分别与每个语句的特征向量对应的视频片段所包括的视频帧的时间戳或帧号等信息,确定每个语句的语义内容在目标预选视频中对应的位置。
在一种可能的实现方式中,根据目标特征向量,确定与目标语句对应的视频片段的第五特征向量。在示例中,可对目标特征向量
Figure PCTCN2019099486-appb-000025
进行求和、求平均或其他处理,以融合成第五特征向量g y。在示例中,目标语句可具有多个对应的视频片段,例如,目标特征向量可以是
Figure PCTCN2019099486-appb-000026
其中,
Figure PCTCN2019099486-appb-000027
为相邻的目标特征向量,
Figure PCTCN2019099486-appb-000028
为相邻的目标特征向量,
Figure PCTCN2019099486-appb-000029
为相邻的目标特征向量,可将
Figure PCTCN2019099486-appb-000030
融合为第五特征向量g y1,将
Figure PCTCN2019099486-appb-000031
融合为第五特征向量g y2,将
Figure PCTCN2019099486-appb-000032
融合为第五特征向量g y3。即,每个语句可对应一个或多个第五特征向量。在示例中,每个第五特征向量可对应一个语句。
在一种可能的实现方式中,可根据分别与一个或多个语句对应的视频片段的第五特征向量以及一个或多个语句的第一特征向量,确定查询文本段落与目标预选视频的第二相关性分数。在示例中,多个语句的第一特征向量
Figure PCTCN2019099486-appb-000033
M为正整数,多个视频片段的第五特征向量为g 1,g 2,…,g W,W为正整数,其中,与第一特征向量
Figure PCTCN2019099486-appb-000034
对应的第五特征向量为g 1,g 2,…,g O(O为与第一特征向量
Figure PCTCN2019099486-appb-000035
对应的第五特征向量的数量,O为小于W的正整数),与
Figure PCTCN2019099486-appb-000036
对应的第五特征向量为g O+1,g O+2,…,g V(V为与第一特征向量
Figure PCTCN2019099486-appb-000037
对应的第五特征向量的数量,V为小于W且大于O的正整数),与
Figure PCTCN2019099486-appb-000038
对应的第五特征向量为g Z,g Z+1,…,g W(Z为与第一特征向量
Figure PCTCN2019099486-appb-000039
对应的第五特征向量的数量,V为小于W且大于O的正整数)。
在一种可能的实现方式中,可根据以下公式(2)来确定查询文本段落与目标预选视频的第二相关性分数:
Figure PCTCN2019099486-appb-000040
其中,x ij表征第i个语句和第j视频片段是否对应,当第j视频片段的第五特征向量为第i个语句的第一特征向量所对应的第五特征向量时,x ij=1,否则,x ij=0。在示例中,可确定第i个语句和第j视频在二部图中是否匹配,如果第i个语句和第j视频在二部图中匹配,则x ij=1,否则,x ij=0。在示例中,对于第i个语句,
Figure PCTCN2019099486-appb-000041
即,在目标预选视频中,一个语句最多具有u max个对应的视频片段,其中,u max为预设的视频片段的数量值,1≤u max≤W。在示例中,对于第j个视频片段,
Figure PCTCN2019099486-appb-000042
即,在目标预选视频中,每个视频片段仅具有一个对应的语句。r ij为第i个语句的第一特 征向量与第j视频片段的第五特征向量的余弦相似度。S p(V,P)为查询文本段落与目标预选视频的第二相关性分数。
在一种可能的实现方式中,可根据查询文本段落与目标预选视频的第一相关性分数S t(V,P),以及查询文本段落与目标预选视频的第二相关性分数S p(V,P),确定查询文本段落与目标预选视频的第三相关性分数S r(V,P),可确定查询文本段落与每个预选视频的第三相关性分数。在示例中,将第一相关性分数和第二相关性分数的乘积确定为第三相关性分数;并根据第三相关性分数,在预选视频中确定目标视频。可根据查询文本段落与每个预选视频的第三相关性分数将预选视频进行排序,选取排序的序列中的预定数量的视频,也可选取第三相关性分数大于或等于某个分数阈值的视频,本公开对选取目标视频的方法不做限制。
在一种可能的实现方式中,可在使用第三神经网络和第四神经网络进行映射处理之前,对第三神经网络和第四神经网络进行训练。所述方法还包括:根据样本视频中多个视频帧的第四样本特征信息以及样本文本段落的一个或多个语句的第一样本特征信息,训练第三神经网络以及第四神经网络。
在一种可能的实现方式中,可采用视频库中的视频作为样本视频,也可使用其他视频库中的视频作为样本视频,本公开对样本视频不做限制。可提取样本视频的视频帧的第四样本特征信息。可输入任意查询文本段落,作为样本文本段落。样本文本段落可包括一个或多个语句,可提取训练语句的第一样本特征信息。
在一种可能的实现方式中,可将样本视频的多个视频帧的第四样本特征信息输入第三神经网络,获得第四样本特征向量。可将样本文本段落的一个或多个语句的第一样本特征信息输入第四神经网络,获得第一样本特征向量。
在一种可能的实现方式中,可在第四样本特征向量中,确定与第一目标样本特征向量的余弦相似度大于或等于相似度阈值的目标样本特征向量,其中,第一目标样本特征向量为第一样本特征向量中的任意一个。进一步地,可将目标样本特征向量融合成与第一目标样本特征向量对应的第五样本特征向量。在示例中,可分别确定与每个第一样本特征向量对应的第五样本特征向量。
在一种可能的实现方式中,可分别确定每个第五样本特征向量与第一样本特征向量的余弦相似度,并根据余弦相似度确定第二综合网络损失,在示例中,可根据以下公式(3)确定第二综合网络损失:
Figure PCTCN2019099486-appb-000043
其中,L ref为第二综合网络损失,
Figure PCTCN2019099486-appb-000044
为第一目标样本特征向量,g d为第d个第五样本特征向量,g +为与第一目标样本特征向量对应的第五样本特征向量,
Figure PCTCN2019099486-appb-000045
为g d
Figure PCTCN2019099486-appb-000046
的余弦相似度,
Figure PCTCN2019099486-appb-000047
为g +
Figure PCTCN2019099486-appb-000048
的余弦相似度。β为设定的常数,在示例中,β可设为0.1。
在一种可能的实现方式中,可使用第二综合网络损失来调整第三神经网络和第四神经网络的网络参数值。在示例中,按照使第二综合网络损失最小化的方向来调整第三神经网络和第四神经网络的网络参数值,使调整后的第三神经网络和第四神经网络具有较高的拟合优度,同时避免过拟合。本公开对调整第三神经网络和第四神经网络的网络参数值的方法不做限制。
在一种可能的实现方式中,调整第三神经网络和第四神经网络的网络参数值的步骤可循环执行,并按照使第二综合网络损失降低或收敛的方式来逐次调整第三神经网络和第四神经网络的网络参数值。在示例中,可输入预定次数的样本文本段落或样本视频,即,循环执行预定次数。在示例中,也可不限定循环执行的次数,在第二综合网络损失降低到一定程度或收敛于一定阈值内时,停止循环,并获得循环调整后的第三神经网络及第四神经网络。可将循环调整后的第三神经网络及第四神经网络用于映射目标预选视频的多个视频帧的第四特征信息以及一个或多个语句的第一特征信息的过程中。
图4示出根据本公开实施例的视频处理方法的流程图。综上所述,可在步骤S111中根据查询文本段落的第二特征信息以及第三特征信息,确定预选视频,并在步骤S121中根据查询文本段落的一个或多个语句的第一特征信息以及第四特征信息,从预选视频中确定目标视频。关于上述视频处理方法的具体处理方式可参见上述实施例,这里不再展开说明。
图5示出根据本公开实施例的视频处理方法的应用示意图。如图5所示,视频库中可包括N个视频,可分别获得每个视频的多个视频帧的第四特征信息,并根据第四特征信息获得每个视频的第三特征信息。
在一种可能的实现方式中,可输入查询文本段落,查询文本段落可包括一个或多个语句,可提取每个语句的第一特征信息,并根据第一特征信息确定查询文本段落的第二特征信息。
在一种可能的实现方式中,第三特征信息和第二特征信息的维度可不同,可通过第一神经网络将第三特征信息映射为第三特征向量,并可通过第二神经网络将第二特征信息映射为第二特征向量。第三特征向量与第二特征向量处于相同维度的向量空间。可确定查询文本段落的第二特征向量分别与每个视频的第三特征向量的余弦相似度,并将余弦相似度确定为查询文本段落分别与每个视频的第一相关性分数。可根据第一相关性分数对视频库中的视频进行排序,如图6中的左侧的视频库,根据第一相关性分数对视频库中的视频进行排序获得的视频序列为视频1、视频2、视频3…视频N。从该视频序列中选择前E(1≤E≤N)个视频,作为预选视频。
在一种可能的实现方式中,可使用第三神经网络将预选视频的第四特征信息映射为第四特征向量,并可使用第四神经网络将查询文本段落的一个或多个语句的第一特征信息映射为第一特征向量。第四特征向量与第一特征向量处于相同维度的向量空间。在目标预选视频中,可确定与目标语句的第一特征向量的余弦相似度大于或等于相似度阈值的第四特征向量作为目标特征向量,并可将目标特征向量对应的目标预选视频的视频帧聚合成视频片段,还可将目标特征向量融合成第五特征向量,可通过公式(2)确定查询文本段落与目标预选视频的第二相关性分数。进一步地,可确定查询文本段落与每个预选视频的第二相关性分数。
在一种可能的实现方式中,可将查询文本段落与预选视频的第一相关性分数以及查询文本段落与预选视频的第二相关性分数相乘,获得查询文本段落与预选视频第三相关性分数,并根据第三相关性分数对E个预选视频进行排序,如图6中右侧的视频库,根据第三相关性分数对E个预选视频进行排序获得的视频序列为视频3、视频5、视频8…在经过此次排序后,视频3为第三相关性分数最高的视频,即,与查询文本段落的语义内容相关性最高的视频,随后为视频5、视频8…,可选择视频3作为目标视频,也可选择前X(X≤E)个视频作为目标视频。
根据本公开的实施例的视频处理方法,通过查询文本段落的第二特征向量与视频的第三特征向量的余弦相似度确定为查询文本段落与视频之间的第一相关性分数,可准确地确定查询文本段落的语义内容和视频的特征信息之间的相关性,从而准确地选取预选视频,在选择出预选视频后,可针对预选视频进行处理,而不必针对视频库中的所有视频进行处理,节省了运算开销,提高处理效率。进一步地,可根据目标预选视频的多个视频帧的第四特征向量以及一个或多个语句第一特征向量确定查询文本段落与目标预选视频的第二相关性分数,并根据第二相关性分数和第一相关性分数确定目标视频,可通过视频与查询文本段落的相关性来检索视频,可精确查找目标视频,避免查询结果冗余,并且,可处理自然语言形式的查询文本段落,不会受到内容标签的固有内容的限制。
可以理解,本公开提及的上述各个方法实施例,在不违背原理逻辑的情况下,均可以彼此相互结合形成结合后的实施例,限于篇幅,本公开不再赘述。
此外,本公开还提供了视频处理装置、电子设备、计算机可读存储介质、程序,上述均可用来实现本公开提供的任一种视频处理方法,相应技术方案和描述和参见方法部分的相应记载,不再赘述。
本领域技术人员可以理解,在具体实施方式的上述方法中,各步骤的撰写顺序并不意味着严格的执行顺序而对实施过程构成任何限定,各步骤的具体执行顺序应当以其功能和可能的内在逻辑确定。
图6示出根据本公开实施例的视频处理装置的框图。如图6所示,所述装置包括:
预选视频确定模块11,用于根据查询文本段落的段落信息和视频库中多个视频的视频信息,确定多个视频中与查询文本段落相关联的预选视频;
目标视频确定模块12,用于根据预选视频的视频帧信息和查询文本段落的语句信息,确定预选视频中的目标视频。
语句信息包括查询文本段落的一个或多个语句的第一特征信息,段落信息包括查询文本段落的第二特征信息,视频帧信息包括视频的多个视频帧的第四特征信息,视频信息包括视频的第三特征信息。
预选视频确定模块进一步用于:
根据第二特征信息以及视频库中的多个视频的第三特征信息,确定多个视频中与查询文本段落相关联的预选视频。
在一种可能的实现方式中,所述装置还包括:
语句特征提取模块,用于对查询文本段落的一个或多个语句分别进行特征提取处理,获得一个或多个语句的第一特征信息;
第二确定模块,用于根据查询文本段落中的一个或多个语句的第一特征信息,确定查询文本段落的第二特征信息。
在一种可能的实现方式中,所述装置还包括:
视频特征提取模块,用于对第二视频的多个视频帧分别进行特征提取处理,获得第二视频的多个视频帧的第四特征信息,第二视频为多个视频中的任意一个;
第一确定模块,用于根据第二视频的多个视频帧的第四特征信息,确定第二视频的第三特征信息。
在一种可能的实现方式中,预选视频确定模块进一步用于:
根据第二特征信息以及视频库中的多个视频的第三特征信息,分别确定查询文本段落与多个视频之间的第一相关性分数;
根据第一相关性分数,确定多个视频中的预选视频。
在一种可能的实现方式中,预选视频确定模块进一步用于:
将第一视频的第三特征信息和第二特征信息映射到相同维度的向量空间,获得第一视频的第三特征向量和查询文本段落的第二特征向量,其中,第一视频为多个视频中的任意一个;
将第二特征向量和第三特征向量的余弦相似度确定为查询文本段落与第一视频之间的第一相关性分数。
在一种可能的实现方式中,目标视频确定模块进一步用于:
根据一个或多个语句的第一特征信息以及预选视频中的多个视频帧的第四特征信息,确定预选视频中的目标视频。
在一种可能的实现方式中,目标视频确定模块进一步用于:
根据一个或多个语句的第一特征信息以及预选视频中的多个视频帧的第四特征信息,确定查询文本段落与预选视频的第二相关性分数;
根据第一相关性分数和第二相关性分数,确定预选视频中的目标视频。
在一种可能的实现方式中,目标视频确定模块进一步用于:
将目标预选视频的多个视频帧的第四特征信息以及一个或多个语句的第一特征信息映射到相同维度的向量空间,分别获得目标预选视频的多个视频帧的第四特征向量以及一个或多个语句的第一特征向量,其中,目标预选视频为预选视频中的任意一个;
确定第四特征向量中与目标语句的第一特征向量的余弦相似度大于或等于相似度阈值的目标特征向量,其中,目标语句为一个或多个语句中的任意一个;
将目标特征向量对应的视频帧聚合成与目标语句对应的视频片段;
根据目标特征向量,确定与目标语句对应的视频片段的第五特征向量;
根据分别与一个或多个语句对应的视频片段的第五特征向量以及一个或多个语句的第一特征向量,确定查询文本段落与目标预选视频的第二相关性分数。
在一种可能的实现方式中,目标视频确定模块进一步用于:
将第一相关性分数和第二相关性分数的乘积确定为第三相关性分数;
根据第三相关性分数,在预选视频中确定目标视频。
在一些实施例中,本公开实施例提供的装置具有的功能或包含的模块可以用于执行上文方法实施例描述的方法,其具体实现可以参照上文方法实施例的描述,为了简洁,这里不再赘述
本公开实施例还提出一种计算机可读存储介质,其上存储有计算机程序指令,计算机程序指令被处理器执行时实现上述方法。计算机可读存储介质可以是非易失性计算机可读存储介质。
本公开实施例还提出一种电子设备,包括:处理器;用于存储处理器可执行指令的存储器;其中,处理器被配置为上述方法。
电子设备可以被提供为终端、服务器或其它形态的设备。
图7是根据一示例性实施例示出的一种电子设备800的框图。例如,电子设备800可以是移动电话,计算机,数字广播终端,消息收发设备,游戏控制台,平板设备,医疗设备,健身设备,个人数字助理等终端。
参照图7,电子设备800可以包括以下一个或多个组件:处理组件802,存储器804,电源组件806,多媒体组件808,音频组件810,输入/输出(I/O)的接口812,传感器组件814,以及通信组件816。
处理组件802通常控制电子设备800的整体操作,诸如与显示,电话呼叫,数据通信,相机操作和记录操作相关联的操作。处理组件802可以包括一个或多个处理器820来执行指令,以完成上述的方法的全部或部分步骤。此外,处理组件802可以包括一个或多个模块,便于处理组件802和其他组件之间的交互。例如,处理组件802可以包括多媒体模块,以方便多媒体组件808和处理组件802之间的交互。
存储器804被配置为存储各种类型的数据以支持在电子设备800的操作。这些数据的示例包括用于在电子设备800上操作的任何应用程序或方法的指令,联系人数据,电话簿数据,消息,图片,视频等。存储器804可以由任何类型的易失性或非易失性存储设备或者它们的组合实现,如静态随机存取存储器(SRAM),电可擦除可编程只读存储器(EEPROM),可擦除可编程只读存储器(EPROM),可编程只读存储器(PROM),只读存储器(ROM),磁存储器,快闪存储器,磁盘或光盘。
电源组件806为电子设备800的各种组件提供电力。电源组件806可以包括电源管理系统,一个或多个电源,及其他与为电子设备800生成、管理和分配电力相关联的组件。
多媒体组件808包括在电子设备800和用户之间的提供一个输出接口的屏幕。在一些实施例中,屏幕可以包括液晶显示器(LCD)和触摸面板(TP)。如果屏幕包括触摸面板,屏幕可以被实现为触摸屏,以接收来自用户的输入信号。触摸面板包括一个或多个触摸传感器以感测触摸、滑动和触摸面板上的手势。触摸传感器可以不仅感测触摸或滑动动作的边界,而且还检测与触摸或滑动操作相关的持续时间和压力。在一些实施例中,多媒体组件808包括一个前置摄像头和/或后置摄像头。当电子设备800处于操作模式,如拍摄模式或视频模式时,前置摄像头和/或后置摄像头可以接收外部的多媒体数据。每个前置摄像头和后置摄像头可以是一个固定的光学透镜系统或具有焦距和光学变焦能力。
音频组件810被配置为输出和/或输入音频信号。例如,音频组件810包括一个麦克风(MIC),当电子设备800处于操作模式,如呼叫模式、记录模式和语音识别模式时,麦克风被配置为接收外部音频信号。所接收的音频信号可以被进一步存储在存储器804或经由通信组件816发送。在一些实施例中,音频组件810还包括一个扬声器,用于输出音频信号。
I/O接口812为处理组件802和外围接口模块之间提供接口,上述外围接口模块可以是键盘,点击轮,按钮等。这些按钮可包括但不限于:主页按钮、音量按钮、启动按钮和锁定按钮。
传感器组件814包括一个或多个传感器,用于为电子设备800提供各个方面的状态评估。例如,传感器组件814可以检测到电子设备800的打开/关闭状态,组件的相对定位,例如组件为电子设备800的显示器和小键盘,传感器组件814还可以检测电子设备800或电子设备800一个组件的位置改变,用户与电子设备800接触的存在或不存在,电子设备800方位或加速/减速和电子设备800的温度变化。传感器组件814可以包括接近传感器,被配置用来在没有任何的物理接触时检测附近物体的存在。传感器组件814还可以包括光传感器,如CMOS或CCD图像传感器,用于在成像应用中使用。在一些实施例中,该传感器组件814还可以包括加速度传感器,陀螺仪传感器,磁传感器,压力传感器或温度传感器。
通信组件816被配置为便于电子设备800和其他设备之间有线或无线方式的通信。电子设备800可以接入基于通信标准的无线网络,如WiFi,2G或3G,或它们的组合。在一个示例性实施例中,通信组件816经由广播信道接收来自外部广播管理系统的广播信号或广播相关信息。在一个示例性实施例 中,通信组件816还包括近场通信(NFC)模块,以促进短程通信。例如,在NFC模块可基于射频识别(RFID)技术,红外数据协会(IrDA)技术,超宽带(UWB)技术,蓝牙(BT)技术和其他技术来实现。
在示例性实施例中,电子设备800可以被一个或多个应用专用集成电路(ASIC)、数字信号处理器(DSP)、数字信号处理设备(DSPD)、可编程逻辑器件(PLD)、现场可编程门阵列(FPGA)、控制器、微控制器、微处理器或其他电子元件实现,用于执行上述方法。
在示例性实施例中,还提供了一种非易失性计算机可读存储介质,例如包括计算机程序指令的存储器804,上述计算机程序指令可由电子设备800的处理器820执行以完成上述方法。
图8是根据一示例性实施例示出的一种电子设备1900的框图。例如,电子设备1900可以被提供为一服务器。参照图8,电子设备1900包括处理组件1922,其进一步包括一个或多个处理器,以及由存储器1932所代表的存储器资源,用于存储可由处理组件1922的执行的指令,例如应用程序。存储器1932中存储的应用程序可以包括一个或一个以上的每一个对应于一组指令的模块。此外,处理组件1922被配置为执行指令,以执行上述方法。
电子设备1900还可以包括一个电源组件1926被配置为执行电子设备1900的电源管理,一个有线或无线网络接口1950被配置为将电子设备1900连接到网络,和一个输入输出(I/O)接口1958。电子设备1900可以操作基于存储在存储器1932的操作系统,例如Windows ServerTM,Mac OS XTM,UnixTM,LinuxTM,FreeBSDTM或类似。
在示例性实施例中,还提供了一种非易失性计算机可读存储介质,例如包括计算机程序指令的存储器1932,上述计算机程序指令可由电子设备1900的处理组件1922执行以完成上述方法。
本公开可以是系统、方法和/或计算机程序产品。计算机程序产品可以包括计算机可读存储介质,其上载有用于使处理器实现本公开的各个方面的计算机可读程序指令。
计算机可读存储介质可以是可以保持和存储由指令执行设备使用的指令的有形设备。计算机可读存储介质例如可以是――但不限于――电存储设备、磁存储设备、光存储设备、电磁存储设备、半导体存储设备或者上述的任意合适的组合。计算机可读存储介质的更具体的例子(非穷举的列表)包括:便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、静态随机存取存储器(SRAM)、便携式压缩盘只读存储器(CD-ROM)、数字多功能盘(DVD)、记忆棒、软盘、机械编码设备、例如其上存储有指令的打孔卡或凹槽内凸起结构、以及上述的任意合适的组合。这里所使用的计算机可读存储介质不被解释为瞬时信号本身,诸如无线电波或者其他自由传播的电磁波、通过波导或其他传输媒介传播的电磁波(例如,通过光纤电缆的光脉冲)、或者通过电线传输的电信号。
这里所描述的计算机可读程序指令可以从计算机可读存储介质下载到各个计算/处理设备,或者通过网络、例如因特网、局域网、广域网和/或无线网下载到外部计算机或外部存储设备。网络可以包括铜传输电缆、光纤传输、无线传输、路由器、防火墙、交换机、网关计算机和/或边缘服务器。每个计算/处理设备中的网络适配卡或者网络接口从网络接收计算机可读程序指令,并转发该计算机可读程序指令,以供存储在各个计算/处理设备中的计算机可读存储介质中。
用于执行本公开操作的计算机程序指令可以是汇编指令、指令集架构(ISA)指令、机器指令、机器相关指令、微代码、固件指令、状态设置数据、或者以一种或多种编程语言的任意组合编写的源代码或目标代码,编程语言包括面向对象的编程语言—诸如Smalltalk、C++等,以及常规的过程式编程语言—诸如“C”语言或类似的编程语言。计算机可读程序指令可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络—包括局域网(LAN)或广域网(WAN)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。在一些实施例中,通过利用计算机可读程序指令的状态信息来个性化定制电子电路,例如可编程逻辑电路、现场可编程门阵列(FPGA)或可编程逻辑阵列(PLA),该电子电路可以执行计算机可读程序指令,从而实现本公开的各个方面。
这里参照根据本公开实施例的方法、装置(系统)和计算机程序产品的流程图和/或框图描述了本公开的各个方面。应当理解,流程图和/或框图的每个方框以及流程图和/或框图中各方框的组合,都可以由计算机可读程序指令实现。
这些计算机可读程序指令可以提供给通用计算机、专用计算机或其它可编程数据处理装置的处理器,从而生产出一种机器,使得这些指令在通过计算机或其它可编程数据处理装置的处理器执行时,产生了实现流程图和/或框图中的一个或多个方框中规定的功能/动作的装置。也可以把这些计算机可读程序指令存储在计算机可读存储介质中,这些指令使得计算机、可编程数据处理装置和/或其他设备以特定方式工作,从而,存储有指令的计算机可读介质则包括一个制造品,其包括实现流程图和/或框图中的一个或多个方框中规定的功能/动作的各个方面的指令。
也可以把计算机可读程序指令加载到计算机、其它可编程数据处理装置、或其它设备上,使得在计算机、其它可编程数据处理装置或其它设备上执行一系列操作步骤,以产生计算机实现的过程,从而使得在计算机、其它可编程数据处理装置、或其它设备上执行的指令实现流程图和/或框图中的一个或多个方框中规定的功能/动作。
附图中的流程图和框图显示了根据本公开的多个实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段或指令的一部分,模块、程序段或指令的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个连续的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或动作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
以上已经描述了本公开的各实施例,上述说明是示例性的,并非穷尽性的,并且也不限于所披露的各实施例。在不偏离所说明的各实施例的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选择,旨在最好地解释各实施例的原理、实际应用或对市场中的技术的技术改进,或者使本技术领域的其它普通技术人员能理解本文披露的各实施例。

Claims (22)

  1. 一种视频处理方法,其特征在于,包括:
    根据查询文本段落的段落信息和视频库中多个视频的视频信息,确定所述多个视频中与所述查询文本段落相关联的预选视频;
    根据所述预选视频的视频帧信息和所述查询文本段落的语句信息,确定所述预选视频中的目标视频。
  2. 根据权利要求1所述的方法,其特征在于,所述段落信息包括查询文本段落的第二特征信息,所述视频信息包括视频的第三特征信息;
    所述根据查询文本段落的段落信息和视频库中多个视频的视频信息,确定所述多个视频中与所述查询文本段落相关联的预选视频,包括:
    根据所述第二特征信息以及视频库中的多个视频的第三特征信息,确定所述多个视频中与所述查询文本段落相关联的预选视频。
  3. 根据权利要求2所述的方法,其特征在于,所述根据所述第二特征信息以及视频库中的多个视频的第三特征信息,确定所述多个视频中与所述查询文本段落相关联的预选视频,包括:
    根据所述第二特征信息以及视频库中的多个视频的第三特征信息,分别确定所述查询文本段落与所述多个视频之间的第一相关性分数;
    根据所述第一相关性分数,确定所述多个视频中的预选视频。
  4. 根据权利要求3所述的方法,其特征在于,所述根据所述第二特征信息以及视频库中的多个视频的第三特征信息,分别确定所述查询文本段落与所述多个视频之间的第一相关性分数,包括:
    将第一视频的第三特征信息和所述第二特征信息映射到相同维度的向量空间,获得第一视频的第三特征向量和查询文本段落的第二特征向量,其中,所述第一视频为所述多个视频中的任意一个;
    将所述第二特征向量和第三特征向量的余弦相似度确定为所述查询文本段落与所述第一视频之间的第一相关性分数。
  5. 根据权利要求1至4任一所述的方法,其特征在于,所述语句信息包括查询文本段落的一个或多个语句的第一特征信息,所述视频帧信息包括所述预选视频的多个视频帧的第四特征信息;
    所述根据所述预选视频的视频帧信息和所述查询文本段落的语句信息,确定所述预选视频中的目标视频,包括:
    根据所述一个或多个语句的第一特征信息以及所述预选视频中的多个视频帧的第四特征信息,确定所述预选视频中的目标视频。
  6. 根据权利要求5所述的方法,其特征在于,所述根据所述一个或多个语句的第一特征信息以及所述预选视频中的多个视频帧的第四特征信息,确定所述预选视频中的目标视频,包括:
    根据所述一个或多个语句的第一特征信息以及所述预选视频中的多个视频帧的第四特征信息,确定所述查询文本段落与所述预选视频的第二相关性分数;
    根据第一相关性分数和所述第二相关性分数,确定所述预选视频中的目标视频。
  7. 根据权利要求6所述的方法,其特征在于,所述根据所述一个或多个语句的第一特征信息以及所述预选视频中的多个视频帧的第四特征信息,确定所述查询文本段落与所述预选视频的第二相关性分数,包括:
    将目标预选视频的多个视频帧的第四特征信息以及所述一个或多个语句的第一特征信息映射到相同维度的向量空间,分别获得目标预选视频的多个视频帧的第四特征向量以及一个或多个语句的第一特征向量,其中,所述目标预选视频为所述预选视频中的任意一个;
    确定第四特征向量中与目标语句的第一特征向量的余弦相似度大于或等于相似度阈值的目标特征向量,其中,目标语句为所述一个或多个语句中的任意一个;
    将所述目标特征向量对应的视频帧聚合成与目标语句对应的视频片段;
    根据所述目标特征向量,确定所述与目标语句对应的视频片段的第五特征向量;
    根据分别与一个或多个语句对应的视频片段的第五特征向量以及一个或多个语句的第一特征向量,确定所述查询文本段落与所述目标预选视频的第二相关性分数。
  8. 根据权利要求6所述的方法,其特征在于,所述根据第一相关性分数和所述第二相关性分数,确定所述预选视频中的目标视频,包括:
    将所述第一相关性分数和所述第二相关性分数的乘积确定为第三相关性分数;
    根据所述第三相关性分数,在所述预选视频中确定目标视频。
  9. 根据权利要求1-8中任一项所述的方法,其特征在于,所述方法还包括:
    对第二视频的多个视频帧分别进行特征提取处理,获得所述第二视频的多个视频帧的第四特征信息,所述第二视频为所述多个视频中的任意一个;
    根据所述第二视频的多个视频帧的第四特征信息,确定所述第二视频的第三特征信息。
  10. 根据权利要求1-9中任一项所述的方法,其特征在于,所述方法还包括:
    对所述查询文本段落的一个或多个语句分别进行特征提取处理,获得所述一个或多个语句的第一特征信息;
    根据所述查询文本段落中的一个或多个语句的第一特征信息,确定所述查询文本段落的第二特征信息。
  11. 一种视频处理装置,其特征在于,包括:
    预选视频确定模块,用于根据查询文本段落的段落信息和视频库中多个视频的视频信息,确定所述多个视频中与所述查询文本段落相关联的预选视频;
    目标视频确定模块,用于根据所述预选视频的视频帧信息和所述查询文本段落的语句信息,确定所述预选视频中的目标视频。
  12. 根据权利要求11所述的装置,其特征在于,所述段落信息包括查询文本段落的第二特征信息,所述视频信息包括视频的第三特征信息;
    所述预选视频确定模块进一步用于:
    根据所述第二特征信息以及视频库中的多个视频的第三特征信息,确定所述多个视频中与所述查询文本段落相关联的预选视频。
  13. 根据权利要求12所述的装置,其特征在于,所述预选视频确定模块进一步用于:
    根据所述第二特征信息以及视频库中的多个视频的第三特征信息,分别确定所述查询文本段落与所述多个视频之间的第一相关性分数;
    根据所述第一相关性分数,确定所述多个视频中的预选视频。
  14. 根据权利要求13所述的装置,其特征在于,所述预选视频确定模块进一步用于:
    将第一视频的第三特征信息和所述第二特征信息映射到相同维度的向量空间,获得第一视频的第三特征向量和查询文本段落的第二特征向量,其中,所述第一视频为所述多个视频中的任意一个;
    将所述第二特征向量和第三特征向量的余弦相似度确定为所述查询文本段落与所述第一视频之间的第一相关性分数。
  15. 根据权利要求11至14任一所述的装置,其特征在于,所述语句信息包括查询文本段落的一个或多个语句的第一特征信息,所述视频帧信息包括所述预选视频的多个视频帧的第四特征信息;
    所述目标视频确定模块进一步用于:
    根据所述一个或多个语句的第一特征信息以及所述预选视频中的多个视频帧的第四特征信息,确定所述预选视频中的目标视频。
  16. 根据权利要求15所述的装置,其特征在于,所述目标视频确定模块进一步用于:
    根据所述一个或多个语句的第一特征信息以及所述预选视频中的多个视频帧的第四特征信息,确定所述查询文本段落与所述预选视频的第二相关性分数;
    根据第一相关性分数和所述第二相关性分数,确定所述预选视频中的目标视频。
  17. 根据权利要求16所述的装置,其特征在于,所述目标视频确定模块进一步用于:
    将目标预选视频的多个视频帧的第四特征信息以及所述一个或多个语句的第一特征信息映射到相同维度的向量空间,分别获得目标预选视频的多个视频帧的第四特征向量以及一个或多个语句的第一特征向量,其中,所述目标预选视频为所述预选视频中的任意一个;
    确定第四特征向量中与目标语句的第一特征向量的余弦相似度大于或等于相似度阈值的目标特征向量,其中,目标语句为所述一个或多个语句中的任意一个;
    将所述目标特征向量对应的视频帧聚合成与目标语句对应的视频片段;
    根据所述目标特征向量,确定所述与目标语句对应的视频片段的第五特征向量;
    根据分别与一个或多个语句对应的视频片段的第五特征向量以及一个或多个语句的第一特征向量,确定所述查询文本段落与所述目标预选视频的第二相关性分数。
  18. 根据权利要求16所述的装置,其特征在于,所述目标视频确定模块进一步用于:
    将所述第一相关性分数和所述第二相关性分数的乘积确定为第三相关性分数;
    根据所述第三相关性分数,在所述预选视频中确定目标视频。
  19. 根据权利要求11-18中任一项所述的装置,其特征在于,所述装置还包括:
    视频特征提取模块,用于对第二视频的多个视频帧分别进行特征提取处理,获得所述第二视频的多个视频帧的第四特征信息,所述第二视频为所述多个视频中的任意一个;
    第一确定模块,用于根据所述第二视频的多个视频帧的第四特征信息,确定所述第二视频的第三特征信息。
  20. 根据权利要求11-19中任一项所述的装置,其特征在于,所述装置还包括:
    语句特征提取模块,用于对所述查询文本段落的一个或多个语句分别进行特征提取处理,获得所述一个或多个语句的第一特征信息;
    第二确定模块,用于根据所述查询文本段落中的一个或多个语句的第一特征信息,确定所述查询文本段落的第二特征信息。
  21. 一种电子设备,其特征在于,包括:
    处理器;
    用于存储处理器可执行指令的存储器;
    其中,所述处理器被配置为:执行权利要求1至10中任意一项所述的方法。
  22. 一种计算机可读存储介质,其上存储有计算机程序指令,其特征在于,所述计算机程序指令被处理器执行时实现权利要求1至10中任意一项所述的方法。
PCT/CN2019/099486 2018-08-07 2019-08-06 视频处理方法及装置、电子设备和存储介质 WO2020029966A1 (zh)

Priority Applications (4)

Application Number Priority Date Filing Date Title
US16/975,347 US11120078B2 (en) 2018-08-07 2019-08-06 Method and device for video processing, electronic device, and storage medium
JP2020573569A JP6916970B2 (ja) 2018-08-07 2019-08-06 ビデオ処理方法及び装置、電子機器並びに記憶媒体
SG11202008134YA SG11202008134YA (en) 2018-08-07 2019-08-06 Method and device for video processing, electronic device, and storage medium
KR1020207030575A KR102222300B1 (ko) 2018-08-07 2019-08-06 비디오 처리 방법 및 장치, 전자 기기 및 저장 매체

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810892997.4 2018-08-07
CN201810892997.4A CN109089133B (zh) 2018-08-07 2018-08-07 视频处理方法及装置、电子设备和存储介质

Publications (1)

Publication Number Publication Date
WO2020029966A1 true WO2020029966A1 (zh) 2020-02-13

Family

ID=64834271

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/099486 WO2020029966A1 (zh) 2018-08-07 2019-08-06 视频处理方法及装置、电子设备和存储介质

Country Status (7)

Country Link
US (1) US11120078B2 (zh)
JP (1) JP6916970B2 (zh)
KR (1) KR102222300B1 (zh)
CN (1) CN109089133B (zh)
MY (1) MY187857A (zh)
SG (1) SG11202008134YA (zh)
WO (1) WO2020029966A1 (zh)

Families Citing this family (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110674331A (zh) * 2018-06-15 2020-01-10 华为技术有限公司 信息处理方法、相关设备及计算机存储介质
CN110163050B (zh) * 2018-07-23 2022-09-27 腾讯科技(深圳)有限公司 一种视频处理方法及装置、终端设备、服务器及存储介质
CN109089133B (zh) 2018-08-07 2020-08-11 北京市商汤科技开发有限公司 视频处理方法及装置、电子设备和存储介质
US11621081B1 (en) * 2018-11-13 2023-04-04 Iqvia Inc. System for predicting patient health conditions
CN111435432B (zh) * 2019-01-15 2023-05-26 北京市商汤科技开发有限公司 网络优化方法及装置、图像处理方法及装置、存储介质
CN110213668A (zh) * 2019-04-29 2019-09-06 北京三快在线科技有限公司 视频标题的生成方法、装置、电子设备和存储介质
CN110188829B (zh) * 2019-05-31 2022-01-28 北京市商汤科技开发有限公司 神经网络的训练方法、目标识别的方法及相关产品
CN113094550B (zh) * 2020-01-08 2023-10-24 百度在线网络技术(北京)有限公司 视频检索方法、装置、设备和介质
CN111209439B (zh) * 2020-01-10 2023-11-21 北京百度网讯科技有限公司 视频片段检索方法、装置、电子设备及存储介质
CN113641782A (zh) * 2020-04-27 2021-11-12 北京庖丁科技有限公司 基于检索语句的信息检索方法、装置、设备和介质
CN111918146B (zh) * 2020-07-28 2021-06-01 广州筷子信息科技有限公司 一种视频合成方法和系统
CN112181982B (zh) * 2020-09-23 2021-10-12 况客科技(北京)有限公司 数据选取方法、电子设备和介质
CN112738557A (zh) * 2020-12-22 2021-04-30 上海哔哩哔哩科技有限公司 视频处理方法及装置
CN113032624B (zh) * 2021-04-21 2023-07-25 北京奇艺世纪科技有限公司 视频观影兴趣度确定方法、装置、电子设备及介质
CN113254714B (zh) * 2021-06-21 2021-11-05 平安科技(深圳)有限公司 基于query分析的视频反馈方法、装置、设备及介质
CN113590881B (zh) * 2021-08-09 2024-03-19 北京达佳互联信息技术有限公司 视频片段检索方法、视频片段检索模型的训练方法及装置
CN113792183B (zh) * 2021-09-17 2023-09-08 咪咕数字传媒有限公司 一种文本生成方法、装置及计算设备
WO2024015322A1 (en) * 2022-07-12 2024-01-18 Loop Now Technologies, Inc. Search using generative model synthesized images

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120131060A1 (en) * 2010-11-24 2012-05-24 Robert Heidasch Systems and methods performing semantic analysis to facilitate audio information searches
CN102750366A (zh) * 2012-06-18 2012-10-24 海信集团有限公司 基于自然交互输入的视频搜索系统及方法和视频搜索服务器
CN103593363A (zh) * 2012-08-15 2014-02-19 中国科学院声学研究所 视频内容索引结构的建立方法、视频检索方法及装置
CN104798068A (zh) * 2012-11-30 2015-07-22 汤姆逊许可公司 视频检索方法和装置
CN106156204A (zh) * 2015-04-23 2016-11-23 深圳市腾讯计算机系统有限公司 文本标签的提取方法和装置
CN109089133A (zh) * 2018-08-07 2018-12-25 北京市商汤科技开发有限公司 视频处理方法及装置、电子设备和存储介质

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110047163A1 (en) * 2009-08-24 2011-02-24 Google Inc. Relevance-Based Image Selection
CN101894170B (zh) * 2010-08-13 2011-12-28 武汉大学 基于语义关联网络的跨模信息检索方法
CN104239501B (zh) * 2014-09-10 2017-04-12 中国电子科技集团公司第二十八研究所 一种基于Spark的海量视频语义标注方法
US9807473B2 (en) 2015-11-20 2017-10-31 Microsoft Technology Licensing, Llc Jointly modeling embedding and translation to bridge video and language
US11409791B2 (en) 2016-06-10 2022-08-09 Disney Enterprises, Inc. Joint heterogeneous language-vision embeddings for video tagging and search
US10346417B2 (en) * 2016-08-18 2019-07-09 Google Llc Optimizing digital video distribution
CN108304506B (zh) * 2018-01-18 2022-08-26 腾讯科技(深圳)有限公司 检索方法、装置及设备
US11295783B2 (en) * 2018-04-05 2022-04-05 Tvu Networks Corporation Methods, apparatus, and systems for AI-assisted or automatic video production

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120131060A1 (en) * 2010-11-24 2012-05-24 Robert Heidasch Systems and methods performing semantic analysis to facilitate audio information searches
CN102750366A (zh) * 2012-06-18 2012-10-24 海信集团有限公司 基于自然交互输入的视频搜索系统及方法和视频搜索服务器
CN103593363A (zh) * 2012-08-15 2014-02-19 中国科学院声学研究所 视频内容索引结构的建立方法、视频检索方法及装置
CN104798068A (zh) * 2012-11-30 2015-07-22 汤姆逊许可公司 视频检索方法和装置
CN106156204A (zh) * 2015-04-23 2016-11-23 深圳市腾讯计算机系统有限公司 文本标签的提取方法和装置
CN109089133A (zh) * 2018-08-07 2018-12-25 北京市商汤科技开发有限公司 视频处理方法及装置、电子设备和存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LIU, YAO ET AL.: "Research on Key rechnologies and Application of Intelligent Search Engine", LIBRARY AND INFORMATION SERVICE, vol. 59, no. 5, 31 March 2015 (2015-03-31) *

Also Published As

Publication number Publication date
CN109089133A (zh) 2018-12-25
US20200394216A1 (en) 2020-12-17
CN109089133B (zh) 2020-08-11
KR102222300B1 (ko) 2021-03-03
MY187857A (en) 2021-10-26
JP6916970B2 (ja) 2021-08-11
JP2021519474A (ja) 2021-08-10
US11120078B2 (en) 2021-09-14
SG11202008134YA (en) 2020-09-29
KR20200128165A (ko) 2020-11-11

Similar Documents

Publication Publication Date Title
WO2020029966A1 (zh) 视频处理方法及装置、电子设备和存储介质
CN107102746B (zh) 候选词生成方法、装置以及用于候选词生成的装置
WO2021031645A1 (zh) 图像处理方法及装置、电子设备和存储介质
WO2017088245A1 (zh) 参考文档的推荐方法及装置
WO2021036382A1 (zh) 图像处理方法及装置、电子设备和存储介质
WO2021208666A1 (zh) 字符识别方法及装置、电子设备和存储介质
CN111259967B (zh) 图像分类及神经网络训练方法、装置、设备及存储介质
CN110764627B (zh) 一种输入方法、装置和电子设备
WO2021082463A1 (zh) 数据处理方法及装置、电子设备和存储介质
CN111242303A (zh) 网络训练方法及装置、图像处理方法及装置
WO2023078414A1 (zh) 相关文章搜索方法、装置、电子设备和存储介质
CN112784142A (zh) 一种信息推荐方法及装置
CN111046210A (zh) 一种信息推荐方法、装置和电子设备
CN112307281A (zh) 一种实体推荐方法及装置
CN111241844A (zh) 一种信息推荐方法及装置
WO2023092975A1 (zh) 图像处理方法及装置、电子设备、存储介质及计算机程序产品
CN114302231B (zh) 视频处理方法及装置、电子设备和存储介质
CN109144286B (zh) 一种输入方法及装置
CN112987941B (zh) 生成候选词的方法及装置
WO2021082461A1 (zh) 存储和读取方法、装置、电子设备和存储介质
CN111368161B (zh) 一种搜索意图的识别方法、意图识别模型训练方法和装置
CN108241438B (zh) 一种输入方法、装置和用于输入的装置
CN110929122A (zh) 一种数据处理方法、装置和用于数据处理的装置
WO2019047616A1 (zh) 多媒体内容的推荐方法及装置
CN111381685B (zh) 一种句联想方法和装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19848672

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 20207030575

Country of ref document: KR

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 2020573569

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19848672

Country of ref document: EP

Kind code of ref document: A1