WO2020029966A1 - 视频处理方法及装置、电子设备和存储介质 - Google Patents
视频处理方法及装置、电子设备和存储介质 Download PDFInfo
- Publication number
- WO2020029966A1 WO2020029966A1 PCT/CN2019/099486 CN2019099486W WO2020029966A1 WO 2020029966 A1 WO2020029966 A1 WO 2020029966A1 CN 2019099486 W CN2019099486 W CN 2019099486W WO 2020029966 A1 WO2020029966 A1 WO 2020029966A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- video
- feature information
- preselected
- feature
- target
- Prior art date
Links
- 238000012545 processing Methods 0.000 title claims abstract description 56
- 238000000034 method Methods 0.000 title claims abstract description 55
- 239000013598 vector Substances 0.000 claims description 213
- 238000003672 processing method Methods 0.000 claims description 25
- 238000000605 extraction Methods 0.000 claims description 23
- 238000004590 computer program Methods 0.000 claims description 14
- 238000013507 mapping Methods 0.000 claims description 13
- 239000012634 fragment Substances 0.000 claims description 8
- 238000013528 artificial neural network Methods 0.000 description 76
- 238000010586 diagram Methods 0.000 description 18
- 238000004891 communication Methods 0.000 description 10
- 230000006870 function Effects 0.000 description 10
- 238000005516 engineering process Methods 0.000 description 8
- 230000008569 process Effects 0.000 description 7
- 230000009471 action Effects 0.000 description 5
- 230000005540 biological transmission Effects 0.000 description 4
- 230000003287 optical effect Effects 0.000 description 4
- 230000005236 sound signal Effects 0.000 description 4
- 238000012549 training Methods 0.000 description 4
- 238000012935 Averaging Methods 0.000 description 3
- 230000001133 acceleration Effects 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 230000007423 decrease Effects 0.000 description 2
- 239000000835 fiber Substances 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 230000002093 peripheral effect Effects 0.000 description 2
- 230000001902 propagating effect Effects 0.000 description 2
- 230000000306 recurrent effect Effects 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- RYGMFSIKBFXOCR-UHFFFAOYSA-N Copper Chemical compound [Cu] RYGMFSIKBFXOCR-UHFFFAOYSA-N 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 229910052802 copper Inorganic materials 0.000 description 1
- 239000010949 copper Substances 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 238000007499 fusion processing Methods 0.000 description 1
- 238000003384 imaging method Methods 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000002441 reversible effect Effects 0.000 description 1
- 238000010187 selection method Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000001052 transient effect Effects 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/23—Processing of content or additional data; Elementary server operations; Server middleware
- H04N21/232—Content retrieval operation locally within server, e.g. reading video streams from disk arrays
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/783—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
- G06F16/7844—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/73—Querying
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/74—Browsing; Visualisation therefor
- G06F16/743—Browsing; Visualisation therefor a collection of video files or sequences
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/23—Processing of content or additional data; Elementary server operations; Server middleware
- H04N21/234—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
- H04N21/23418—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/25—Management operations performed by the server for facilitating the content distribution or administrating data related to end-users or client devices, e.g. end-user or client device authentication, learning user preferences for recommending movies
- H04N21/251—Learning process for intelligent management, e.g. learning user preferences for recommending movies
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/432—Content retrieval operation from a local storage medium, e.g. hard-disk
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/44—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
- H04N21/44008—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/45—Management operations performed by the client for facilitating the reception of or the interaction with the content or administrating data related to the end-user or to the client device itself, e.g. learning user preferences for recommending movies, resolving scheduling conflicts
- H04N21/466—Learning process for intelligent management, e.g. learning user preferences for recommending movies
- H04N21/4668—Learning process for intelligent management, e.g. learning user preferences for recommending movies for recommending content, e.g. movies
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/47—End-user applications
- H04N21/482—End-user interface for program selection
- H04N21/4826—End-user interface for program selection using recommendation lists, e.g. of programs or channels sorted out according to their score
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/47—End-user applications
- H04N21/482—End-user interface for program selection
- H04N21/4828—End-user interface for program selection for searching program descriptors
Definitions
- querying or retrieving videos in a video library through a sentence usually requires defining content tags for the videos in the video library in advance, and then retrieving the videos through the tags.
- content tags are not extensible, and it is difficult to retrieve video content that the tags do not have.
- the content tags of different videos may be duplicated, which may result in redundant search results. Tags can't handle retrieval of content in natural language.
- a video processing method including: determining, according to paragraph information of a query text paragraph and video information of multiple videos in a video library, that the multiple videos are associated with the query text paragraph Determine a target video in the preselected video according to the video frame information of the preselected video and the sentence information of the query text paragraph.
- a video processing device including: a pre-selected video determining module, configured to determine the number of videos in the multiple videos according to the paragraph information of the query text paragraph and the video information of multiple videos in the video library.
- a preselected video associated with the query text paragraph ;
- a target video determination module configured to determine a target video in the preselected video according to video frame information of the preselected video and sentence information of the query text paragraph.
- FIG. 1 shows a flowchart of a video processing method according to an embodiment of the present disclosure
- FIG. 2 shows a flowchart of a video processing method according to an embodiment of the present disclosure
- FIG. 4 illustrates a flowchart of a video processing method according to an embodiment of the present disclosure
- FIG. 5 shows an application diagram of a video processing method according to an embodiment of the present disclosure
- FIG. 6 illustrates a block diagram of a video processing apparatus according to an embodiment of the present disclosure
- FIG. 7 illustrates a block diagram of an electronic device according to an embodiment of the present disclosure
- FIG. 8 illustrates a block diagram of an electronic device according to an embodiment of the present disclosure.
- exemplary means “serving as an example, embodiment, or illustration.” Any embodiment described herein as “exemplary” is not necessarily to be construed as superior to or better than other embodiments.
- FIG. 1 illustrates a flowchart of a video processing method according to an embodiment of the present disclosure. As shown in Figure 1, the video processing method includes:
- step S11 according to the paragraph information of the query text paragraph and the video information of the multiple videos in the video library, determine a preselected video associated with the query text paragraph from the multiple videos;
- step S12 the target video in the preselected video is determined according to the video frame information of the preselected video and the sentence information of the query text paragraph.
- the preselected video is determined by querying the paragraph information of the text paragraph and the video information of the video
- the target video is determined based on the query sentence information of the text paragraph and the video frame information of the preselected video. Retrieval of videos with relevance to query text paragraphs can accurately find the target video, avoid redundant query results, and can process query text paragraphs in the form of natural language without being limited by the inherent content of content tags.
- the video processing method may be executed by a terminal device or a server or other processing devices, where the terminal device may be a user equipment (User Equipment, UE), a mobile device, a user terminal, a terminal, a cellular phone, Cordless phones, Personal Digital Processing (PDA), handheld devices, computing devices, in-vehicle devices, wearable devices, etc.
- the video processing method may be implemented by a processor invoking computer-readable instructions stored in a memory.
- query text paragraphs include one or more sentences.
- the sentence information includes first feature information of one or more sentences of the query text paragraph
- the paragraph information includes second feature information of the query text paragraph
- the video frame information includes multiple video frames of the video.
- Fourth feature information and the video information includes third feature information of the video.
- first feature information of one or more sentences in a query text paragraph may be acquired, and second feature information of the query text paragraph is determined.
- the first feature information of the sentence may be a feature vector representing the semantics of the sentence.
- the method further includes: performing feature extraction processing on one or more sentences of the query text paragraph to obtain the first features of the one or more sentences. information.
- the second feature information of the query text paragraph is determined according to the first feature information of one or more sentences in the query text paragraph.
- feature extraction may be performed on the content of one or more sentences through methods such as semantic recognition to obtain first feature information of the one or more sentences.
- the content of one or more sentences may be semantically identified through a neural network to perform feature extraction on the content of one or more sentences, thereby obtaining first feature information of the one or more sentences.
- the present disclosure does not limit the method of feature extraction of the content of one or more sentences.
- the first feature information may be a feature vector representing the semantics of the sentence
- the first feature information of one or more sentences in the query text paragraph may be fused to obtain the first feature information of the query text paragraph.
- Two feature information, the second feature information may be a feature vector representing the semantics of the query text paragraph.
- the first feature information is a feature vector representing the semantics of the sentence
- the first feature information of one or more sentences may be summed, averaged, or otherwise processed to obtain the second feature information of the query text paragraph
- the query text paragraph includes M sentences
- the first feature information of the M sentences is s 1 , s 2 , ..., s M
- s 1 , s 2 , ..., s M can be summed, averaged, or other Processing to fuse the second feature information P into a query text paragraph
- the second feature information P is a feature vector with the same dimensions as s 1 , s 2 ,..., S M.
- the disclosure does not limit the method for obtaining the second characteristic information of the query text paragraph.
- the second feature information of the query text paragraph can be obtained by extracting the first feature information of each sentence in the query text paragraph, and the semantics of the query text paragraph can be accurately characterized by the second feature information.
- the fourth feature information of each video frame of the video may be obtained, and the third feature information of the video may be obtained according to the fourth feature information.
- the method further includes: separately performing feature extraction processing on multiple video frames of the second video to obtain fourth feature information of the multiple video frames of the second video, where the second video is any one of the multiple videos; The fourth feature information of the multiple video frames of the two videos determines the third feature information of the second video.
- feature extraction processing may be performed on multiple video frames of the second video separately to obtain fourth feature information of the multiple video frames of the second video.
- feature extraction processing may be performed for each video frame in the second video, or one video frame may be selected for feature extraction processing every certain number of frames.
- every 5 video frames select a video frame for feature extraction processing (that is, determine the feature information of a selected video frame among the 6 video frames as the fourth feature information), or the 6 video frames may be Fusion processing of feature information (for example, summing, averaging, or other processing, that is, merging feature information of 6 video frames into one, and determining feature information obtained by fusing feature information of 6 video frames as (Fourth feature information), or feature information of each video frame of the second video may be separately extracted as the fourth feature information.
- the fourth feature information may be a feature vector that characterizes the feature information in the video frame.
- the fourth feature information may characterize the feature information such as a person, clothing color, action, and scene in the video frame.
- the network performs feature extraction processing on video frames, and the present disclosure does not limit the method of extracting feature information in the video frames.
- the fourth feature information of multiple video frames of the second video may be fused to obtain the third feature information of the second video.
- the fourth feature information is a feature vector that characterizes the feature information in the video frame. Multiple fourth feature information may be summed, averaged, or otherwise processed to obtain the third feature information of the second video.
- the feature information may be a feature vector representing the feature information of the second video.
- the fourth feature information f 1 , f 2 , ..., f T of T (T is a positive integer) video frames are obtained from multiple video frames of the second video, and f 1 , f 2 , ..., f T performs summing, averaging, or other processing to fuse into the third feature information V i of the second video, 1 ⁇ i ⁇ N, where N is the number of videos in the video library.
- the disclosure does not limit the method of obtaining the third characteristic information.
- step S11 feature extraction is performed on all videos in the video library in advance to obtain third feature information and fourth feature information of all videos in the video library.
- feature extraction may be performed on the new video to obtain third feature information and fourth feature information of the new video.
- the third feature information of the second video can be obtained by extracting the fourth feature information of the video frame in the second video, and the feature information of the second video can be accurately characterized by the third feature information.
- step S111 according to the second feature information of the query text paragraph and the third feature information of the plurality of videos in the video library, a preselected video associated with the query text paragraph in the plurality of videos is determined.
- determining the preselected video associated with the query text paragraph in the multiple videos according to the second feature information and the third feature information of multiple videos in the video library may include: according to the second feature The information and the third characteristic information of the multiple videos in the video library respectively determine a first correlation score between the query text paragraph and the multiple videos; and determine a preselected video among the multiple videos according to the first correlation score.
- the second feature information may be a feature vector representing the semantics of the query text paragraph
- the third feature information may be a feature vector representing the feature information of the second video, the second feature information, and the third feature.
- the dimensions of the information may be different, that is, the second feature information and the third feature information may not be in a vector space of the same dimension. Therefore, the second feature information and the third feature information may be processed to make the processed second feature information A vector space in the same dimension as the third feature information.
- the cosine similarity of the second feature vector and the third feature vector can be determined as the first correlation score between the query text paragraph and the first video, and the semantic content and first Correlation between feature information of a video.
- the third feature information and the second feature information of the first video may be mapped to a vector space of the same dimension in a mapping manner.
- the third feature information of the first video is a feature vector V j , 1 ⁇ j ⁇ N
- the second feature information of the query text paragraph is that the feature vectors P, P and V j have different dimensions, and can be mapped by way of , Map P and V j to a vector space of the same dimension to obtain the third feature vector of the first video And query the second feature vector of the text paragraph
- a neural network may be used to map the third feature information and the second feature information to a vector space of the same dimension.
- mapping the third feature information and the second feature information of the first video to a vector space of the same dimension, and obtaining the third feature vector of the first video and the second feature vector of the query text paragraph may include: using the first A neural network maps the third feature information to a third feature vector, and uses a second neural network to map the second feature information to a second feature vector.
- the first neural network and the second neural network may be a back propagation (BP) neural network, a convolutional neural network, or a recurrent neural network, and the like.
- BP back propagation
- the third feature information V j has a dimension of 10 and the second feature information P has a dimension of 6.
- a vector space of the same dimension can be determined.
- the vector space has a dimension of 8 and a first neural network can be used. Map the 10-dimensional third feature information V j to an 8-dimensional vector space to obtain an 8-dimensional third feature vector Can use a second neural network Map the 6-dimensional second feature information P to an 8-dimensional vector space to obtain an 8-dimensional second feature vector This disclosure does not limit the number of dimensions.
- a second feature vector may be determined And the third eigenvector Cosine similarity, and with The cosine similarity of is determined as the first correlation score St (V; P) between the query text paragraph and the first video.
- a first neural network may be used Map the third feature information V 1 , V 2 , ..., V N of each video in the video library to obtain the third feature vector of all videos in the video library And determine the second feature vector separately Third feature vector with all videos , The cosine similarity of the two, is used as the first relevance score of the query text paragraph and each video.
- a preselected video among the plurality of videos may be determined according to the first correlation score. For example, a video with a first correlation score higher than a certain score threshold may be selected as a preselected video, or multiple videos may be sorted according to the first correlation score, and a predetermined number of videos in the sequence are selected as the preselected video.
- the present disclosure does not limit the selection method and the number of preselected videos.
- the first correlation score between the query text paragraph and the video can be determined through the second feature information and the third feature information, and the preselected video is selected according to the first correlation score, thereby improving the accuracy of the preselected video selection.
- the preselected video can be processed without having to process all the videos in the video library, which saves computing overhead and improves processing efficiency.
- the first neural network and the second neural network may be trained before the first neural network and the second neural network are used for mapping processing.
- the method further includes: training the first neural network and the second neural network according to the third sample feature information of the sample video and the second sample feature information of the sample text paragraph.
- the videos in the video library may be used as the sample videos, and the videos in other video libraries may also be used as the sample videos.
- the disclosure does not limit the sample videos.
- the fourth sample feature information of the video frame of the sample video may be extracted, and the third sample feature information of the sample video is determined according to the fourth sample feature information.
- the third sample feature information of a plurality of sample videos may be input to a first neural network for mapping to obtain a third sample feature vector.
- the second sample feature information of the sample text paragraph can be input to a second neural network to obtain a second sample feature vector.
- the cosine similarity between the second sample feature vector and each third sample feature vector may be determined separately, and the first comprehensive network loss may be determined according to the cosine similarity.
- the following formula may be used (1) Determine the first comprehensive network loss:
- L find is the first comprehensive network loss
- S t (V b , P a ) is the cosine similarity between the second sample feature vector of the a-th sample text paragraph and the third sample feature vector of the b-th sample video.
- V a is the third sample feature information of the sample video corresponding to the a-th sample text paragraph
- S t (V a , P a ) is the second sample feature vector of the a-th sample text paragraph and the corresponding sample video.
- a and b are both positive integers.
- ⁇ is a set constant. In the example, ⁇ can be set to 0.2.
- the first comprehensive network loss may be used to adjust the network parameter values of the first neural network and the second neural network.
- the network parameter values of the first neural network and the second neural network are adjusted in a direction that minimizes the loss of the first comprehensive network, so that the adjusted first neural network and the second neural network have a higher fit. Goodness, while avoiding overfitting.
- the disclosure does not limit the method of adjusting the network parameter values of the first neural network and the second neural network.
- the steps of adjusting the network parameter values of the first neural network and the second neural network may be executed cyclically, and the first neural network and the first neural network are adjusted successively in a manner that reduces or converges the loss of the first comprehensive network.
- Network parameter values of the second neural network may be used in the process of mapping the third feature information of the first video and querying the second feature information of the text paragraph.
- FIG. 3 illustrates a flowchart of a video processing method according to an embodiment of the present disclosure. As shown in FIG. 3, step S12 includes:
- step S121 the target video in the preselected video is determined according to the first feature information of one or more sentences of the query text paragraph and the fourth feature information of the multiple video frames in the preselected video.
- the correlation between the query text paragraph and the video in the preselected video may be further determined according to the first feature information of one or more sentences and the fourth feature information of multiple video frames in the preselected video. .
- determining the target video in the preselected video according to the first feature information of one or more sentences and the fourth feature information of multiple video frames in the preselected video includes: according to one or more The first feature information of the sentence and the fourth feature information of multiple video frames in the preselected video determine the second correlation score of the query text paragraph and the preselected video; and determine the preselection according to the first correlation score and the second correlation score.
- the target video in the video includes: according to one or more The first feature information of the sentence and the fourth feature information of multiple video frames in the preselected video determine the second correlation score of the query text paragraph and the preselected video; and determine the preselection according to the first correlation score and the second correlation score.
- determining the second correlation score of the query text paragraph and the preselected video may be Including: mapping the fourth feature information of multiple video frames of the target preselected video and the first feature information of one or more sentences to a vector space of the same dimension, and respectively obtaining the fourth feature vectors of multiple video frames of the target preselected video And the first feature vector of one or more sentences, wherein the target preselected video is any one of the preselected videos; determining that the cosine similarity between the fourth feature vector and the first feature vector of the target sentence is greater than or equal to the similarity threshold
- the second correlation score of the query text paragraph and the target preselected video can be determined according to the fourth feature vector of multiple video frames of the target preselected video and one or more sentence first feature vectors, and the query can be accurately determined Relevance between the semantic content of a text paragraph and the target pre-selected video.
- the fourth feature information of multiple video frames of the target preselected video has different dimensions from the first feature information of one or more sentences, and the fourth feature information and the first feature information may be mapped in a mapping manner The feature information is mapped to a vector space of the same dimension.
- the fourth feature information of multiple video frames of the target preselected video may be feature vectors f 1 , f 2 , ..., f K (K is the number of video frames of the target preselected video, and K is a positive integer), one
- the first feature information of one or more sentences may be feature vectors s 1 , s 2 , ..., s M (M is the number of sentences in the query text paragraph, and M is a positive integer), and f 1 may be mapped by mapping.
- f 2 ,..., f K and s 1 , s 2 ,..., s M are mapped to a vector space of the same dimension to obtain a fourth feature vector And the first eigenvector
- a neural network may be used to map the fourth feature information and the first feature information to a vector space of the same dimension.
- the fourth feature information of multiple video frames of the target preselected video and The first feature information of one or more sentences is mapped to a vector space of the same dimension, and the fourth feature vector of multiple video frames of the target preselected video and the first feature vector of one or more sentences are obtained, including: using a third
- the neural network maps the fourth feature information into a fourth feature vector, and uses the fourth neural network to map the first feature information into a first feature vector.
- the third neural network and the fourth neural network may be a BP neural network, a convolutional neural network, a recurrent neural network, or the like, and the disclosure does not limit the types of the third neural network and the fourth neural network.
- the fourth feature information f 1 , f 2 , ..., f K has a dimension of 10
- the first feature information s 1 , s 2 , ..., s M has a dimension of 6
- a vector space of the same dimension can be determined, for example ,
- the vector space has a dimension of 8, and a third neural network can be used 10-dimensional fourth feature information f 1 , f 2 , ..., f K is mapped to an 8-dimensional vector space to obtain an 8-dimensional fourth feature vector
- This disclosure does not limit the number of dimensions.
- a target feature vector in which the cosine similarity between the fourth feature vector and the first feature vector of the target sentence is greater than or equal to a similarity threshold may be determined.
- one sentence can be arbitrarily selected as the target sentence in one or more sentences (for example, the yth sentence is selected as the target sentence, 1 ⁇ y ⁇ K), and multiple fourth Feature vector The cosine similarity with the first feature vector s y of the target sentence, and Determine the target feature vector whose cosine similarity with the first feature vector s y is greater than or equal to the similarity threshold, for example, Among them, 1 ⁇ h ⁇ K, 1 ⁇ u ⁇ K, 1 ⁇ q ⁇ K, and the similarity threshold may be a preset threshold, such as 0.5, and the disclosure does not limit the similarity threshold.
- the video frames corresponding to the target feature vector may be aggregated into video fragments corresponding to the target sentence.
- the fourth feature information may be a feature vector obtained by selecting a video frame for feature extraction processing every 5 video frames (that is, every 6 video frames) in the target preselected video.
- the fourth feature vector is a The feature vector obtained by mapping the four feature information, and the video frame corresponding to each fourth feature vector may be a video frame used for extracting the fourth feature information and five video frames before or after the video frame.
- the video frames corresponding to all target feature vectors can be aggregated together to obtain a video fragment, which is a video fragment corresponding to the target sentence.
- Corresponding video frames are aggregated to obtain video fragments corresponding to the target sentence.
- the disclosure does not limit the video frames corresponding to the target feature vector.
- a video segment corresponding to a feature vector of each sentence may be determined, and according to a timestamp of a video frame included in a video segment corresponding to the feature vector of each sentence or The frame number and other information determine the corresponding position of the semantic content of each sentence in the target preselected video.
- the fifth feature vector of the video segment corresponding to the target sentence is determined according to the target feature vector.
- the target feature vector can be Summing, averaging, or other processing is performed to fuse into a fifth feature vector g y .
- the target sentence may have multiple corresponding video segments, for example, the target feature vector may be among them, Are adjacent target feature vectors, Are adjacent target feature vectors, Is the adjacent target feature vector, Fused into a fifth eigenvector g y1 , Fused into a fifth eigenvector g y2 , Fusion into a fifth feature vector g y3 . That is, each sentence may correspond to one or more fifth feature vectors. In an example, each fifth feature vector may correspond to one sentence.
- the second feature vector of the video segment corresponding to the one or more sentences and the first feature vector of the one or more sentences may be used to determine the second part of the query text paragraph and the target preselected video. Relevance score.
- the first feature vector of multiple sentences M is a positive integer
- the fifth feature vector of the plurality of video clips is g 1 , g 2 , ..., g W , and W is a positive integer, where is the same as the first feature vector
- the corresponding fifth feature vector is g 1 , g 2 , ..., g O (O is the same as the first feature vector The number of corresponding fifth feature vectors, where O is a positive integer less than W), and
- the corresponding fifth feature vector is g O + 1 , g O + 2 , ..., g V (V is the same as the first feature vector The number of corresponding fifth feature vectors, V is a positive integer less than W and greater than O), and
- the corresponding fifth feature vector is g Z ,
- the second correlation score of the query text paragraph and the target preselected video may be determined according to the following formula (2):
- x ij represents whether the i-th sentence corresponds to the j-th video segment.
- u max is the preset number of values of the video segment, 1 ⁇ u max ⁇ W.
- r ij is the cosine similarity between the first feature vector of the i-th sentence and the fifth feature vector of the j-th video segment.
- the first relevance score S t (V, P) of the query text paragraph and the target preselected video may be obtained according to the second relevance score S p (V of the query text paragraph and the target preselected video).
- P determine the third correlation score S r (V, P) of the query text paragraph and the target preselected video, and determine the third correlation score of the query text paragraph and each preselected video.
- a product of the first correlation score and the second correlation score is determined as a third correlation score; and a target video is determined in the preselected video according to the third correlation score.
- Preselected videos can be sorted based on the third relevance score of the query text paragraph and each preselected video, a predetermined number of videos in the sorted sequence can be selected, or videos with a third relevance score greater than or equal to a certain threshold
- the disclosure does not limit the method of selecting the target video.
- the third neural network and the fourth neural network may be trained before the third neural network and the fourth neural network are used for mapping processing.
- the method further includes training a third neural network and a fourth neural network according to the fourth sample feature information of the plurality of video frames in the sample video and the first sample feature information of one or more sentences of the sample text paragraph.
- the videos in the video library may be used as the sample videos, and the videos in other video libraries may also be used as the sample videos.
- the disclosure does not limit the sample videos.
- the fourth sample feature information of the video frame of the sample video may be extracted. You can enter any query text paragraph as a sample text paragraph.
- the sample text paragraph may include one or more sentences, and the first sample feature information of the training sentence may be extracted.
- the fourth sample feature information of multiple video frames of the sample video may be input to a third neural network to obtain a fourth sample feature vector.
- the first sample feature information of one or more sentences of the sample text paragraph can be input into the fourth neural network to obtain the first sample feature vector.
- a target sample feature vector whose cosine similarity with the first target sample feature vector is greater than or equal to a similarity threshold may be determined, where the first target sample feature vector Is any one of the first sample feature vectors. Further, the target sample feature vector may be fused into a fifth sample feature vector corresponding to the first target sample feature vector. In an example, a fifth sample feature vector corresponding to each first sample feature vector may be determined separately.
- the cosine similarity between each fifth sample feature vector and the first sample feature vector can be determined separately, and the second comprehensive network loss is determined according to the cosine similarity.
- the following can be determined according to the following Formula (3) determines the second comprehensive network loss:
- L ref is the second comprehensive network loss
- g d is the d fifth sample feature vector
- g + is the fifth sample feature vector corresponding to the first target sample feature vector
- ⁇ is a set constant. In the example, ⁇ can be set to 0.1.
- the second comprehensive network loss may be used to adjust the network parameter values of the third neural network and the fourth neural network.
- the network parameter values of the third neural network and the fourth neural network are adjusted in a direction that minimizes the loss of the second comprehensive network, so that the adjusted third neural network and the fourth neural network have a higher fit. Goodness, while avoiding overfitting.
- the disclosure does not limit the method of adjusting the network parameter values of the third neural network and the fourth neural network.
- the steps of adjusting the network parameter values of the third neural network and the fourth neural network may be executed cyclically, and the third neural network and the third neural network are adjusted successively in a manner that reduces or converges the second comprehensive network loss.
- Network parameter values of the fourth neural network may be used in the process of mapping the fourth feature information of multiple video frames of the target preselected video and the first feature information of one or more sentences.
- FIG. 4 illustrates a flowchart of a video processing method according to an embodiment of the present disclosure.
- the preselected video may be determined according to the second feature information and the third feature information of the query text paragraph in step S111, and according to the first feature information of one or more sentences of the query text paragraph and
- the fourth feature information is to determine a target video from a preselected video.
- FIG. 5 shows an application diagram of a video processing method according to an embodiment of the present disclosure.
- the video library may include N videos, and fourth feature information of multiple video frames of each video may be obtained separately, and third feature information of each video may be obtained according to the fourth feature information.
- a query text paragraph may be input, and the query text paragraph may include one or more sentences.
- the first feature information of each sentence may be extracted, and the second feature text of the query text paragraph may be determined according to the first feature information.
- Feature information may be extracted.
- the dimensions of the third feature information and the second feature information may be different.
- the third feature information may be mapped to a third feature vector through a first neural network, and the third feature information may be mapped through a second neural network.
- the two feature information are mapped into a second feature vector.
- the third feature vector and the second feature vector are in a vector space of the same dimension.
- the cosine similarity between the second feature vector of the query text paragraph and the third feature vector of each video may be determined, and the cosine similarity is determined as the first correlation score of the query text paragraph with each video.
- the videos in the video library can be sorted according to the first correlation score, such as the video library on the left in FIG. 6.
- the video sequence obtained by sorting the videos in the video library according to the first correlation score is video 1, video. 2.
- Video 3 ... Video N The first E (1 ⁇ E ⁇ N) videos are selected from the video sequence as a preselected video.
- the third neural network may be used to map the fourth feature information of the preselected video to a fourth feature vector, and the fourth neural network may be used to query the first of the one or more sentences of the text paragraph.
- the feature information is mapped to a first feature vector.
- the fourth feature vector is in a vector space of the same dimension as the first feature vector.
- a fourth feature vector whose cosine similarity to the first feature vector of the target sentence is greater than or equal to the similarity threshold may be determined as the target feature vector, and the video frame of the target preselected video corresponding to the target feature vector may be determined Aggregated into video fragments, the target feature vector can also be fused into a fifth feature vector, and the second correlation score of the query text paragraph and the target preselected video can be determined by formula (2). Further, a second relevance score of the query text paragraph and each preselected video may be determined.
- the first correlation score of the query text paragraph and the preselected video and the second correlation score of the query text paragraph and the preselected video may be multiplied to obtain the third correlation between the query text paragraph and the preselected video.
- Scores, and sort the E preselected videos according to the third correlation score as shown in the video library on the right in Figure 6, the video sequence obtained by sorting the E preselected videos according to the third correlation score is video 3, video 5 Video 8: After this sorting, Video 3 is the video with the third highest relevance score, that is, the video with the highest relevance to the semantic content of the query text paragraph, followed by Video 5, Video 8 ... 3 as the target video, the first X (X ⁇ E) videos can also be selected as the target video.
- the cosine similarity between the second feature vector of the text paragraph and the third feature vector of the video is determined as the first correlation score between the query text paragraph and the video, which can accurately Determine the correlation between the semantic content of the query text paragraph and the feature information of the video, so as to accurately select the pre-selected video.
- the pre-selected video can be processed instead of all the videos in the video library. , Which saves computing overhead and improves processing efficiency.
- the second correlation score of the query text paragraph and the target preselected video may be determined according to the fourth feature vector of multiple video frames of the target preselected video and one or more sentence first feature vectors, and according to the second correlation score And the first correlation score to determine the target video, the video can be retrieved based on the correlation between the video and the query text paragraph, the target video can be accurately found, the query result is redundant, and the query text paragraph in natural language form can be processed without Limited by the inherent content of the content tag.
- the present disclosure also provides a video processing device, an electronic device, a computer-readable storage medium, and a program, all of which can be used to implement any one of the video processing methods provided by the present disclosure.
- a video processing device an electronic device, a computer-readable storage medium, and a program, all of which can be used to implement any one of the video processing methods provided by the present disclosure.
- FIG. 6 illustrates a block diagram of a video processing apparatus according to an embodiment of the present disclosure. As shown in FIG. 6, the device includes:
- the preselected video determining module 11 is configured to determine a preselected video associated with the query text paragraph among the multiple videos according to the paragraph information of the query text paragraph and the video information of multiple videos in the video library;
- the target video determining module 12 is configured to determine the target video in the preselected video according to the video frame information of the preselected video and the sentence information of the query text paragraph.
- the sentence information includes first feature information of one or more sentences of the query text paragraph, the paragraph information includes second feature information of the query text paragraphs, the video frame information includes fourth feature information of a plurality of video frames of the video, and the video information includes video Third characteristic information.
- the pre-selected video determination module is further used for:
- a preselected video associated with the query text paragraph in the multiple videos is determined.
- the apparatus further includes:
- Sentence feature extraction module configured to perform feature extraction processing on one or more sentences of a query text paragraph to obtain first feature information of one or more sentences;
- the second determining module is configured to determine the second feature information of the query text paragraph according to the first feature information of one or more sentences in the query text paragraph.
- the apparatus further includes:
- a video feature extraction module configured to separately perform feature extraction processing on multiple video frames of the second video to obtain fourth feature information of the multiple video frames of the second video, where the second video is any one of the multiple videos;
- the first determining module is configured to determine third feature information of the second video according to fourth feature information of multiple video frames of the second video.
- the preselected video determination module is further configured to:
- a preselected video among the plurality of videos is determined.
- the preselected video determination module is further configured to:
- the cosine similarity of the second feature vector and the third feature vector is determined as a first correlation score between the query text paragraph and the first video.
- the target video determination module is further configured to:
- the fourth feature information of multiple video frames of the target preselected video and the first feature information of one or more sentences are mapped to a vector space of the same dimension, and the fourth feature vectors of multiple video frames of the target preselected video and one Or the first feature vector of multiple sentences, wherein the target preselected video is any one of the preselected videos;
- the second correlation score of the query text paragraph and the target preselected video is determined.
- the target video determination module is further configured to:
- a target video is determined among the preselected videos.
- the functions provided by the apparatus provided in the embodiments of the present disclosure or the modules included may be used to execute the method described in the foregoing method embodiments.
- the functions provided by the apparatus provided in the embodiments of the present disclosure or the modules included may be used to execute the method described in the foregoing method embodiments.
- An embodiment of the present disclosure also provides a computer-readable storage medium having computer program instructions stored thereon.
- the computer program instructions implement the above method when executed by a processor.
- the computer-readable storage medium may be a non-volatile computer-readable storage medium.
- An embodiment of the present disclosure further provides an electronic device including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured as the foregoing method.
- the electronic device may be provided as a terminal, a server, or other forms of devices.
- Fig. 7 is a block diagram of an electronic device 800 according to an exemplary embodiment.
- the electronic device 800 may be a terminal such as a mobile phone, a computer, a digital broadcasting terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, and the like.
- the electronic device 800 may include one or more of the following components: a processing component 802, a memory 804, a power component 806, a multimedia component 808, an audio component 810, an input / output (I / O) interface 812, and a sensor component 814 , And communication component 816.
- the processing component 802 generally controls overall operations of the electronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations.
- the processing component 802 may include one or more processors 820 to execute instructions to complete all or part of the steps of the method described above.
- the processing component 802 may include one or more modules to facilitate the interaction between the processing component 802 and other components.
- the processing component 802 may include a multimedia module to facilitate the interaction between the multimedia component 808 and the processing component 802.
- the memory 804 is configured to store various types of data to support operation at the electronic device 800. Examples of such data include instructions for any application or method for operating on the electronic device 800, contact data, phone book data, messages, pictures, videos, and the like.
- the memory 804 may be implemented by any type of volatile or non-volatile storage devices, or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), Programming read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic disk or optical disk.
- SRAM static random access memory
- EEPROM electrically erasable programmable read-only memory
- EPROM Programming read-only memory
- PROM programmable read-only memory
- ROM read-only memory
- magnetic memory flash memory
- flash memory magnetic disk or optical disk.
- the power component 806 provides power to various components of the electronic device 800.
- the power component 806 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the electronic device 800.
- the multimedia component 808 includes a screen that provides an output interface between the electronic device 800 and a user.
- the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user.
- the touch panel includes one or more touch sensors to sense touch, swipe, and gestures on the touch panel. The touch sensor can not only sense the boundary of a touch or slide action, but also detect duration and pressure related to the touch or slide operation.
- the multimedia component 808 includes a front camera and / or a rear camera. When the electronic device 800 is in an operation mode, such as a shooting mode or a video mode, the front camera and / or the rear camera can receive external multimedia data. Each front camera and rear camera can be a fixed optical lens system or have focal length and optical zoom capabilities.
- the audio component 810 is configured to output and / or input audio signals.
- the audio component 810 includes a microphone (MIC).
- the microphone is configured to receive an external audio signal.
- the received audio signal may be further stored in the memory 804 or transmitted via the communication component 816.
- the audio component 810 further includes a speaker for outputting audio signals.
- the I / O interface 812 provides an interface between the processing component 802 and a peripheral interface module.
- the peripheral interface module may be a keyboard, a click wheel, a button, or the like. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.
- the sensor component 814 includes one or more sensors for providing various aspects of the state evaluation of the electronic device 800.
- the sensor component 814 can detect the on / off state of the electronic device 800, and the relative positioning of the components, such as the display and keypad of the electronic device 800.
- the sensor component 814 can also detect the electronic device 800 or a component of the electronic device 800.
- the location changes, the presence or absence of the user's contact with the electronic device 800, the orientation or acceleration / deceleration of the electronic device 800, and the temperature change of the electronic device 800.
- the sensor component 814 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact.
- the sensor component 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications.
- the sensor component 814 may further include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
- the communication component 816 is configured to facilitate wired or wireless communication between the electronic device 800 and other devices.
- the electronic device 800 can access a wireless network based on a communication standard, such as WiFi, 2G, or 3G, or a combination thereof.
- the communication component 816 receives a broadcast signal or broadcast-related information from an external broadcast management system via a broadcast channel.
- the communication component 816 also includes a near field communication (NFC) module to facilitate short-range communication.
- the NFC module can be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
- RFID radio frequency identification
- IrDA infrared data association
- UWB ultra wideband
- Bluetooth Bluetooth
- the electronic device 800 may be implemented by one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), Implementation of a programming gate array (FPGA), controller, microcontroller, microprocessor, or other electronic component to perform the above method.
- ASICs application specific integrated circuits
- DSPs digital signal processors
- DSPDs digital signal processing devices
- PLDs programmable logic devices
- FPGA programming gate array
- controller microcontroller, microprocessor, or other electronic component to perform the above method.
- a non-volatile computer-readable storage medium such as a memory 804 including computer program instructions, and the computer program instructions may be executed by the processor 820 of the electronic device 800 to complete the above method.
- Fig. 8 is a block diagram of an electronic device 1900 according to an exemplary embodiment.
- the electronic device 1900 may be provided as a server.
- the electronic device 1900 includes a processing component 1922, which further includes one or more processors, and a memory resource represented by a memory 1932, for storing instructions executable by the processing component 1922, such as an application program.
- the application program stored in the memory 1932 may include one or more modules each corresponding to a set of instructions.
- the processing component 1922 is configured to execute instructions to perform the method described above.
- the electronic device 1900 may further include a power supply component 1926 configured to perform power management of the electronic device 1900, a wired or wireless network interface 1950 configured to connect the electronic device 1900 to a network, and an input / output (I / O) interface 1958 .
- the electronic device 1900 can operate based on an operating system stored in the memory 1932, such as Windows ServerTM, Mac OSXTM, UnixTM, LinuxTM, FreeBSDTM, or the like.
- a non-volatile computer-readable storage medium is also provided, such as a memory 1932 including computer program instructions, which can be executed by the processing component 1922 of the electronic device 1900 to complete the above method.
- the present disclosure may be a system, method, and / or computer program product.
- the computer program product may include a computer-readable storage medium having computer-readable program instructions for causing a processor to implement various aspects of the present disclosure.
- the computer-readable storage medium may be a tangible device that can hold and store instructions used by the instruction execution device.
- the computer-readable storage medium may be, for example, but not limited to, an electric storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
- Non-exhaustive list of computer-readable storage media include: portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM) Or flash memory), static random access memory (SRAM), portable compact disc read only memory (CD-ROM), digital versatile disc (DVD), memory stick, floppy disk, mechanical encoding device, such as a printer with instructions stored there A protruding structure in the hole card or groove, and any suitable combination of the above.
- RAM random access memory
- ROM read-only memory
- EPROM erasable programmable read-only memory
- flash memory flash memory
- SRAM static random access memory
- CD-ROM compact disc read only memory
- DVD digital versatile disc
- memory stick floppy disk
- mechanical encoding device such as a printer with instructions stored there A protruding structure in the hole card or groove, and any suitable combination of the above.
- Computer-readable storage media used herein are not to be interpreted as transient signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (for example, light pulses through fiber optic cables), or via electrical wires Electrical signal transmitted.
- the computer-readable program instructions described herein can be downloaded from a computer-readable storage medium to various computing / processing devices, or downloaded to an external computer or external storage device via a network, such as the Internet, a local area network, a wide area network, and / or a wireless network.
- the network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and / or edge servers.
- the network adapter card or network interface in each computing / processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in each computing / processing device .
- Computer program instructions for performing the operations of the present disclosure may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-related instructions, microcode, firmware instructions, state setting data, or in one or more programming languages.
- Programming languages include object-oriented programming languages—such as Smalltalk, C ++, and so on—as well as regular procedural programming languages—such as "C” or similar programming languages.
- Computer-readable program instructions may be executed entirely on a user's computer, partly on a user's computer, as a stand-alone software package, partly on a user's computer, partly on a remote computer, or entirely on a remote computer or server carried out.
- the remote computer can be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (such as through the Internet using an Internet service provider) connection).
- LAN local area network
- WAN wide area network
- an electronic circuit such as a programmable logic circuit, a field-programmable gate array (FPGA), or a programmable logic array (PLA), can be personalized by using state information of computer-readable program instructions.
- FPGA field-programmable gate array
- PDA programmable logic array
- the electronic circuit can Computer-readable program instructions are executed to implement various aspects of the present disclosure.
- These computer-readable program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, or other programmable data processing device, thereby producing a machine such that when executed by a processor of a computer or other programmable data processing device , Means for implementing the functions / actions specified in one or more blocks in the flowcharts and / or block diagrams.
- These computer-readable program instructions may also be stored in a computer-readable storage medium, and these instructions cause a computer, a programmable data processing apparatus, and / or other devices to work in a specific manner. Therefore, a computer-readable medium storing instructions includes: An article of manufacture that includes instructions to implement various aspects of the functions / acts specified in one or more blocks in the flowcharts and / or block diagrams.
- Computer-readable program instructions can also be loaded onto a computer, other programmable data processing device, or other device, so that a series of operating steps can be performed on the computer, other programmable data processing device, or other device to produce a computer-implemented process , So that the instructions executed on the computer, other programmable data processing apparatus, or other equipment can implement the functions / actions specified in one or more blocks in the flowchart and / or block diagram.
- each block in the flowchart or block diagram may represent a module, program segment, or part of an instruction.
- a module, program segment, or part of an instruction contains one or more executable functions for implementing a specified logical function. instruction.
- the functions marked in the blocks may also occur in a different order than those marked in the drawings. For example, two consecutive blocks may actually be executed substantially in parallel, and they may sometimes be executed in the reverse order, depending on the functions involved.
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Theoretical Computer Science (AREA)
- Signal Processing (AREA)
- Databases & Information Systems (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Library & Information Science (AREA)
- Computational Linguistics (AREA)
- Human Computer Interaction (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Biology (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
Abstract
Description
Claims (22)
- 一种视频处理方法,其特征在于,包括:根据查询文本段落的段落信息和视频库中多个视频的视频信息,确定所述多个视频中与所述查询文本段落相关联的预选视频;根据所述预选视频的视频帧信息和所述查询文本段落的语句信息,确定所述预选视频中的目标视频。
- 根据权利要求1所述的方法,其特征在于,所述段落信息包括查询文本段落的第二特征信息,所述视频信息包括视频的第三特征信息;所述根据查询文本段落的段落信息和视频库中多个视频的视频信息,确定所述多个视频中与所述查询文本段落相关联的预选视频,包括:根据所述第二特征信息以及视频库中的多个视频的第三特征信息,确定所述多个视频中与所述查询文本段落相关联的预选视频。
- 根据权利要求2所述的方法,其特征在于,所述根据所述第二特征信息以及视频库中的多个视频的第三特征信息,确定所述多个视频中与所述查询文本段落相关联的预选视频,包括:根据所述第二特征信息以及视频库中的多个视频的第三特征信息,分别确定所述查询文本段落与所述多个视频之间的第一相关性分数;根据所述第一相关性分数,确定所述多个视频中的预选视频。
- 根据权利要求3所述的方法,其特征在于,所述根据所述第二特征信息以及视频库中的多个视频的第三特征信息,分别确定所述查询文本段落与所述多个视频之间的第一相关性分数,包括:将第一视频的第三特征信息和所述第二特征信息映射到相同维度的向量空间,获得第一视频的第三特征向量和查询文本段落的第二特征向量,其中,所述第一视频为所述多个视频中的任意一个;将所述第二特征向量和第三特征向量的余弦相似度确定为所述查询文本段落与所述第一视频之间的第一相关性分数。
- 根据权利要求1至4任一所述的方法,其特征在于,所述语句信息包括查询文本段落的一个或多个语句的第一特征信息,所述视频帧信息包括所述预选视频的多个视频帧的第四特征信息;所述根据所述预选视频的视频帧信息和所述查询文本段落的语句信息,确定所述预选视频中的目标视频,包括:根据所述一个或多个语句的第一特征信息以及所述预选视频中的多个视频帧的第四特征信息,确定所述预选视频中的目标视频。
- 根据权利要求5所述的方法,其特征在于,所述根据所述一个或多个语句的第一特征信息以及所述预选视频中的多个视频帧的第四特征信息,确定所述预选视频中的目标视频,包括:根据所述一个或多个语句的第一特征信息以及所述预选视频中的多个视频帧的第四特征信息,确定所述查询文本段落与所述预选视频的第二相关性分数;根据第一相关性分数和所述第二相关性分数,确定所述预选视频中的目标视频。
- 根据权利要求6所述的方法,其特征在于,所述根据所述一个或多个语句的第一特征信息以及所述预选视频中的多个视频帧的第四特征信息,确定所述查询文本段落与所述预选视频的第二相关性分数,包括:将目标预选视频的多个视频帧的第四特征信息以及所述一个或多个语句的第一特征信息映射到相同维度的向量空间,分别获得目标预选视频的多个视频帧的第四特征向量以及一个或多个语句的第一特征向量,其中,所述目标预选视频为所述预选视频中的任意一个;确定第四特征向量中与目标语句的第一特征向量的余弦相似度大于或等于相似度阈值的目标特征向量,其中,目标语句为所述一个或多个语句中的任意一个;将所述目标特征向量对应的视频帧聚合成与目标语句对应的视频片段;根据所述目标特征向量,确定所述与目标语句对应的视频片段的第五特征向量;根据分别与一个或多个语句对应的视频片段的第五特征向量以及一个或多个语句的第一特征向量,确定所述查询文本段落与所述目标预选视频的第二相关性分数。
- 根据权利要求6所述的方法,其特征在于,所述根据第一相关性分数和所述第二相关性分数,确定所述预选视频中的目标视频,包括:将所述第一相关性分数和所述第二相关性分数的乘积确定为第三相关性分数;根据所述第三相关性分数,在所述预选视频中确定目标视频。
- 根据权利要求1-8中任一项所述的方法,其特征在于,所述方法还包括:对第二视频的多个视频帧分别进行特征提取处理,获得所述第二视频的多个视频帧的第四特征信息,所述第二视频为所述多个视频中的任意一个;根据所述第二视频的多个视频帧的第四特征信息,确定所述第二视频的第三特征信息。
- 根据权利要求1-9中任一项所述的方法,其特征在于,所述方法还包括:对所述查询文本段落的一个或多个语句分别进行特征提取处理,获得所述一个或多个语句的第一特征信息;根据所述查询文本段落中的一个或多个语句的第一特征信息,确定所述查询文本段落的第二特征信息。
- 一种视频处理装置,其特征在于,包括:预选视频确定模块,用于根据查询文本段落的段落信息和视频库中多个视频的视频信息,确定所述多个视频中与所述查询文本段落相关联的预选视频;目标视频确定模块,用于根据所述预选视频的视频帧信息和所述查询文本段落的语句信息,确定所述预选视频中的目标视频。
- 根据权利要求11所述的装置,其特征在于,所述段落信息包括查询文本段落的第二特征信息,所述视频信息包括视频的第三特征信息;所述预选视频确定模块进一步用于:根据所述第二特征信息以及视频库中的多个视频的第三特征信息,确定所述多个视频中与所述查询文本段落相关联的预选视频。
- 根据权利要求12所述的装置,其特征在于,所述预选视频确定模块进一步用于:根据所述第二特征信息以及视频库中的多个视频的第三特征信息,分别确定所述查询文本段落与所述多个视频之间的第一相关性分数;根据所述第一相关性分数,确定所述多个视频中的预选视频。
- 根据权利要求13所述的装置,其特征在于,所述预选视频确定模块进一步用于:将第一视频的第三特征信息和所述第二特征信息映射到相同维度的向量空间,获得第一视频的第三特征向量和查询文本段落的第二特征向量,其中,所述第一视频为所述多个视频中的任意一个;将所述第二特征向量和第三特征向量的余弦相似度确定为所述查询文本段落与所述第一视频之间的第一相关性分数。
- 根据权利要求11至14任一所述的装置,其特征在于,所述语句信息包括查询文本段落的一个或多个语句的第一特征信息,所述视频帧信息包括所述预选视频的多个视频帧的第四特征信息;所述目标视频确定模块进一步用于:根据所述一个或多个语句的第一特征信息以及所述预选视频中的多个视频帧的第四特征信息,确定所述预选视频中的目标视频。
- 根据权利要求15所述的装置,其特征在于,所述目标视频确定模块进一步用于:根据所述一个或多个语句的第一特征信息以及所述预选视频中的多个视频帧的第四特征信息,确定所述查询文本段落与所述预选视频的第二相关性分数;根据第一相关性分数和所述第二相关性分数,确定所述预选视频中的目标视频。
- 根据权利要求16所述的装置,其特征在于,所述目标视频确定模块进一步用于:将目标预选视频的多个视频帧的第四特征信息以及所述一个或多个语句的第一特征信息映射到相同维度的向量空间,分别获得目标预选视频的多个视频帧的第四特征向量以及一个或多个语句的第一特征向量,其中,所述目标预选视频为所述预选视频中的任意一个;确定第四特征向量中与目标语句的第一特征向量的余弦相似度大于或等于相似度阈值的目标特征向量,其中,目标语句为所述一个或多个语句中的任意一个;将所述目标特征向量对应的视频帧聚合成与目标语句对应的视频片段;根据所述目标特征向量,确定所述与目标语句对应的视频片段的第五特征向量;根据分别与一个或多个语句对应的视频片段的第五特征向量以及一个或多个语句的第一特征向量,确定所述查询文本段落与所述目标预选视频的第二相关性分数。
- 根据权利要求16所述的装置,其特征在于,所述目标视频确定模块进一步用于:将所述第一相关性分数和所述第二相关性分数的乘积确定为第三相关性分数;根据所述第三相关性分数,在所述预选视频中确定目标视频。
- 根据权利要求11-18中任一项所述的装置,其特征在于,所述装置还包括:视频特征提取模块,用于对第二视频的多个视频帧分别进行特征提取处理,获得所述第二视频的多个视频帧的第四特征信息,所述第二视频为所述多个视频中的任意一个;第一确定模块,用于根据所述第二视频的多个视频帧的第四特征信息,确定所述第二视频的第三特征信息。
- 根据权利要求11-19中任一项所述的装置,其特征在于,所述装置还包括:语句特征提取模块,用于对所述查询文本段落的一个或多个语句分别进行特征提取处理,获得所述一个或多个语句的第一特征信息;第二确定模块,用于根据所述查询文本段落中的一个或多个语句的第一特征信息,确定所述查询文本段落的第二特征信息。
- 一种电子设备,其特征在于,包括:处理器;用于存储处理器可执行指令的存储器;其中,所述处理器被配置为:执行权利要求1至10中任意一项所述的方法。
- 一种计算机可读存储介质,其上存储有计算机程序指令,其特征在于,所述计算机程序指令被处理器执行时实现权利要求1至10中任意一项所述的方法。
Priority Applications (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US16/975,347 US11120078B2 (en) | 2018-08-07 | 2019-08-06 | Method and device for video processing, electronic device, and storage medium |
JP2020573569A JP6916970B2 (ja) | 2018-08-07 | 2019-08-06 | ビデオ処理方法及び装置、電子機器並びに記憶媒体 |
SG11202008134YA SG11202008134YA (en) | 2018-08-07 | 2019-08-06 | Method and device for video processing, electronic device, and storage medium |
KR1020207030575A KR102222300B1 (ko) | 2018-08-07 | 2019-08-06 | 비디오 처리 방법 및 장치, 전자 기기 및 저장 매체 |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810892997.4 | 2018-08-07 | ||
CN201810892997.4A CN109089133B (zh) | 2018-08-07 | 2018-08-07 | 视频处理方法及装置、电子设备和存储介质 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020029966A1 true WO2020029966A1 (zh) | 2020-02-13 |
Family
ID=64834271
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2019/099486 WO2020029966A1 (zh) | 2018-08-07 | 2019-08-06 | 视频处理方法及装置、电子设备和存储介质 |
Country Status (7)
Country | Link |
---|---|
US (1) | US11120078B2 (zh) |
JP (1) | JP6916970B2 (zh) |
KR (1) | KR102222300B1 (zh) |
CN (1) | CN109089133B (zh) |
MY (1) | MY187857A (zh) |
SG (1) | SG11202008134YA (zh) |
WO (1) | WO2020029966A1 (zh) |
Families Citing this family (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110674331A (zh) * | 2018-06-15 | 2020-01-10 | 华为技术有限公司 | 信息处理方法、相关设备及计算机存储介质 |
CN110163050B (zh) * | 2018-07-23 | 2022-09-27 | 腾讯科技(深圳)有限公司 | 一种视频处理方法及装置、终端设备、服务器及存储介质 |
CN109089133B (zh) | 2018-08-07 | 2020-08-11 | 北京市商汤科技开发有限公司 | 视频处理方法及装置、电子设备和存储介质 |
US11621081B1 (en) * | 2018-11-13 | 2023-04-04 | Iqvia Inc. | System for predicting patient health conditions |
CN111435432B (zh) * | 2019-01-15 | 2023-05-26 | 北京市商汤科技开发有限公司 | 网络优化方法及装置、图像处理方法及装置、存储介质 |
CN110213668A (zh) * | 2019-04-29 | 2019-09-06 | 北京三快在线科技有限公司 | 视频标题的生成方法、装置、电子设备和存储介质 |
CN110188829B (zh) * | 2019-05-31 | 2022-01-28 | 北京市商汤科技开发有限公司 | 神经网络的训练方法、目标识别的方法及相关产品 |
CN113094550B (zh) * | 2020-01-08 | 2023-10-24 | 百度在线网络技术(北京)有限公司 | 视频检索方法、装置、设备和介质 |
CN111209439B (zh) * | 2020-01-10 | 2023-11-21 | 北京百度网讯科技有限公司 | 视频片段检索方法、装置、电子设备及存储介质 |
CN113641782A (zh) * | 2020-04-27 | 2021-11-12 | 北京庖丁科技有限公司 | 基于检索语句的信息检索方法、装置、设备和介质 |
CN111918146B (zh) * | 2020-07-28 | 2021-06-01 | 广州筷子信息科技有限公司 | 一种视频合成方法和系统 |
CN112181982B (zh) * | 2020-09-23 | 2021-10-12 | 况客科技(北京)有限公司 | 数据选取方法、电子设备和介质 |
CN112738557A (zh) * | 2020-12-22 | 2021-04-30 | 上海哔哩哔哩科技有限公司 | 视频处理方法及装置 |
CN113032624B (zh) * | 2021-04-21 | 2023-07-25 | 北京奇艺世纪科技有限公司 | 视频观影兴趣度确定方法、装置、电子设备及介质 |
CN113254714B (zh) * | 2021-06-21 | 2021-11-05 | 平安科技(深圳)有限公司 | 基于query分析的视频反馈方法、装置、设备及介质 |
CN113590881B (zh) * | 2021-08-09 | 2024-03-19 | 北京达佳互联信息技术有限公司 | 视频片段检索方法、视频片段检索模型的训练方法及装置 |
CN113792183B (zh) * | 2021-09-17 | 2023-09-08 | 咪咕数字传媒有限公司 | 一种文本生成方法、装置及计算设备 |
WO2024015322A1 (en) * | 2022-07-12 | 2024-01-18 | Loop Now Technologies, Inc. | Search using generative model synthesized images |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120131060A1 (en) * | 2010-11-24 | 2012-05-24 | Robert Heidasch | Systems and methods performing semantic analysis to facilitate audio information searches |
CN102750366A (zh) * | 2012-06-18 | 2012-10-24 | 海信集团有限公司 | 基于自然交互输入的视频搜索系统及方法和视频搜索服务器 |
CN103593363A (zh) * | 2012-08-15 | 2014-02-19 | 中国科学院声学研究所 | 视频内容索引结构的建立方法、视频检索方法及装置 |
CN104798068A (zh) * | 2012-11-30 | 2015-07-22 | 汤姆逊许可公司 | 视频检索方法和装置 |
CN106156204A (zh) * | 2015-04-23 | 2016-11-23 | 深圳市腾讯计算机系统有限公司 | 文本标签的提取方法和装置 |
CN109089133A (zh) * | 2018-08-07 | 2018-12-25 | 北京市商汤科技开发有限公司 | 视频处理方法及装置、电子设备和存储介质 |
Family Cites Families (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110047163A1 (en) * | 2009-08-24 | 2011-02-24 | Google Inc. | Relevance-Based Image Selection |
CN101894170B (zh) * | 2010-08-13 | 2011-12-28 | 武汉大学 | 基于语义关联网络的跨模信息检索方法 |
CN104239501B (zh) * | 2014-09-10 | 2017-04-12 | 中国电子科技集团公司第二十八研究所 | 一种基于Spark的海量视频语义标注方法 |
US9807473B2 (en) | 2015-11-20 | 2017-10-31 | Microsoft Technology Licensing, Llc | Jointly modeling embedding and translation to bridge video and language |
US11409791B2 (en) | 2016-06-10 | 2022-08-09 | Disney Enterprises, Inc. | Joint heterogeneous language-vision embeddings for video tagging and search |
US10346417B2 (en) * | 2016-08-18 | 2019-07-09 | Google Llc | Optimizing digital video distribution |
CN108304506B (zh) * | 2018-01-18 | 2022-08-26 | 腾讯科技(深圳)有限公司 | 检索方法、装置及设备 |
US11295783B2 (en) * | 2018-04-05 | 2022-04-05 | Tvu Networks Corporation | Methods, apparatus, and systems for AI-assisted or automatic video production |
-
2018
- 2018-08-07 CN CN201810892997.4A patent/CN109089133B/zh active Active
-
2019
- 2019-08-06 US US16/975,347 patent/US11120078B2/en active Active
- 2019-08-06 WO PCT/CN2019/099486 patent/WO2020029966A1/zh active Application Filing
- 2019-08-06 SG SG11202008134YA patent/SG11202008134YA/en unknown
- 2019-08-06 JP JP2020573569A patent/JP6916970B2/ja active Active
- 2019-08-06 KR KR1020207030575A patent/KR102222300B1/ko active IP Right Grant
- 2019-08-06 MY MYPI2020004347A patent/MY187857A/en unknown
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120131060A1 (en) * | 2010-11-24 | 2012-05-24 | Robert Heidasch | Systems and methods performing semantic analysis to facilitate audio information searches |
CN102750366A (zh) * | 2012-06-18 | 2012-10-24 | 海信集团有限公司 | 基于自然交互输入的视频搜索系统及方法和视频搜索服务器 |
CN103593363A (zh) * | 2012-08-15 | 2014-02-19 | 中国科学院声学研究所 | 视频内容索引结构的建立方法、视频检索方法及装置 |
CN104798068A (zh) * | 2012-11-30 | 2015-07-22 | 汤姆逊许可公司 | 视频检索方法和装置 |
CN106156204A (zh) * | 2015-04-23 | 2016-11-23 | 深圳市腾讯计算机系统有限公司 | 文本标签的提取方法和装置 |
CN109089133A (zh) * | 2018-08-07 | 2018-12-25 | 北京市商汤科技开发有限公司 | 视频处理方法及装置、电子设备和存储介质 |
Non-Patent Citations (1)
Title |
---|
LIU, YAO ET AL.: "Research on Key rechnologies and Application of Intelligent Search Engine", LIBRARY AND INFORMATION SERVICE, vol. 59, no. 5, 31 March 2015 (2015-03-31) * |
Also Published As
Publication number | Publication date |
---|---|
CN109089133A (zh) | 2018-12-25 |
US20200394216A1 (en) | 2020-12-17 |
CN109089133B (zh) | 2020-08-11 |
KR102222300B1 (ko) | 2021-03-03 |
MY187857A (en) | 2021-10-26 |
JP6916970B2 (ja) | 2021-08-11 |
JP2021519474A (ja) | 2021-08-10 |
US11120078B2 (en) | 2021-09-14 |
SG11202008134YA (en) | 2020-09-29 |
KR20200128165A (ko) | 2020-11-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2020029966A1 (zh) | 视频处理方法及装置、电子设备和存储介质 | |
CN107102746B (zh) | 候选词生成方法、装置以及用于候选词生成的装置 | |
WO2021031645A1 (zh) | 图像处理方法及装置、电子设备和存储介质 | |
WO2017088245A1 (zh) | 参考文档的推荐方法及装置 | |
WO2021036382A1 (zh) | 图像处理方法及装置、电子设备和存储介质 | |
WO2021208666A1 (zh) | 字符识别方法及装置、电子设备和存储介质 | |
CN111259967B (zh) | 图像分类及神经网络训练方法、装置、设备及存储介质 | |
CN110764627B (zh) | 一种输入方法、装置和电子设备 | |
WO2021082463A1 (zh) | 数据处理方法及装置、电子设备和存储介质 | |
CN111242303A (zh) | 网络训练方法及装置、图像处理方法及装置 | |
WO2023078414A1 (zh) | 相关文章搜索方法、装置、电子设备和存储介质 | |
CN112784142A (zh) | 一种信息推荐方法及装置 | |
CN111046210A (zh) | 一种信息推荐方法、装置和电子设备 | |
CN112307281A (zh) | 一种实体推荐方法及装置 | |
CN111241844A (zh) | 一种信息推荐方法及装置 | |
WO2023092975A1 (zh) | 图像处理方法及装置、电子设备、存储介质及计算机程序产品 | |
CN114302231B (zh) | 视频处理方法及装置、电子设备和存储介质 | |
CN109144286B (zh) | 一种输入方法及装置 | |
CN112987941B (zh) | 生成候选词的方法及装置 | |
WO2021082461A1 (zh) | 存储和读取方法、装置、电子设备和存储介质 | |
CN111368161B (zh) | 一种搜索意图的识别方法、意图识别模型训练方法和装置 | |
CN108241438B (zh) | 一种输入方法、装置和用于输入的装置 | |
CN110929122A (zh) | 一种数据处理方法、装置和用于数据处理的装置 | |
WO2019047616A1 (zh) | 多媒体内容的推荐方法及装置 | |
CN111381685B (zh) | 一种句联想方法和装置 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19848672 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 20207030575 Country of ref document: KR Kind code of ref document: A |
|
ENP | Entry into the national phase |
Ref document number: 2020573569 Country of ref document: JP Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19848672 Country of ref document: EP Kind code of ref document: A1 |