CN106529492A

CN106529492A - Video topic classification and description method based on multi-image fusion in view of network query

Info

Publication number: CN106529492A
Application number: CN201611035152.0A
Authority: CN
Inventors: 冀中; 马亚茹
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2016-11-17
Filing date: 2016-11-17
Publication date: 2017-03-22

Abstract

The invention belongs to the technical field of video processing. According to characteristics of multiple video data, video event detection is realized, event text description is formed, and video topic classification and description based on contents in view of network query are realized. According to the technical scheme adopted by the invention, the video topic classification and description method based on multi-image fusion in view of network query comprises steps: 1) in combination with the text information and the visual information of the video, through building a multi-image model, an image cutting method is used for realizing event classification; and 2) a tf-idf or word vector technology is used for extracting key words in a video event, the key words are modified to be perfect by using priori information about the topic from websites such as Wikipedia, and text description on the event is realized. the method of the invention is mainly applied to a video classification occasion.

Description

Network-oriented inquiry is based on the classification of many figure fusion video subjects and description method

Technical field

The invention belongs to technical field of video processing.There is substantial amounts of video data for MultiMedia Field in the present invention, no A kind of the features such as information being easy to needed for user obtains, there is provided subject classification side for realizing multiple videos in same Query Result Method, and on this basis the corresponding keyword of subject distillation under event is described, realize the video of network-oriented inquiry The classification of theme and description.

Background technology

As the fast development of information technology, video data are emerged in multitude, become people obtain information important channel it One.However, due to the sharp increase of number of videos, occurring redundancy and the information for repeating in multitude of video data.In the face of substantial amounts of webpage Video, user wants to obtain correct information becomes extremely difficult.When the topic of dependent event is searched for, most of user's sense is emerging Interesting is staple of conversation event and their development of the topic.But from track of events in substantial amounts of video search result What progress was very difficult to.Therefore, in this case, the massive video data under same subject can be carried out in the urgent need to a kind of The technology integrated, analyze, meets people and wants fast and accurately to browse the demand of video main information, improves people and obtains The ability of information.

Usually, news topic be by occur the specific time, it is specific where, with common focus A series of dependent event compositions.And event is by described by some have identification, representational word.In the past few decades In, in order to improve the efficiency of management of video data, allowing users to quickly and accurately obtain the information that they want, correlation is ground Property of the person of studying carefully for video data information, it is proposed that some methods Internet video classified and is described, but the technology Still in the elementary step.This is mainly due to following reason：1) as visual signature has semantic gap, it is more difficult to visually right Event is classified, and this is accomplished by the classification that Video Events are realized with reference to the text message of video.Upload yet with user Text message be it is limited, and generally have noise, it is fuzzy, incomplete even have it is misleading, hence with Text message to event category and description when with certain error.2) in addition, tag information is retouched just for whole video State, be not that a certain specific video scene or camera lens are described, and to there is theme multifarious for longer video Feature, this brings certain difficulty to the classification of video.

In recent years, with the development of multimedia technology, correlative study person is carried with description problem for the classification of many video subjects Some countermeasures are gone out.Wherein, the event structure for exploring Internet video is a class classical way.The method is first with co-occurrence The text feature of analysis (co-occurrence) model analysis video explores the Text Mode of event.Then by shifting closure Classifiable event, and from the angle of text, give event description.Finally video is detected using the approximate repeating frame detection of video Main matter.And by the Events Fusion with similar vision and text property, realize the exploration and description of event.Although the party Method has certain lifting in the effect that event is explored, but the method explores event from the angle of vision and text respectively, does not have Have while using the structure of multiple modalities detecting event, the multi-modal information not using video during detection is complementary to one another Advantage.

The present invention proposes many graph models, is merged by many figures, realizes visual classification using the method that figure cuts.And utilize Tf-idf extracts the keyword per class event, and event is described.Video multi-modal information is taken full advantage of in this scenario Complementary advantage, preferably realizes the video subject classification based on many figure fusions and description of network-oriented inquiry.

The content of the invention

To overcome the deficiencies in the prior art, it is contemplated that proposing a kind of video master based on content of network-oriented inquiry Topic classification and description.The event detection of video according to the characteristics of many video datas, is realized, the text to event is formed and is described, it is real The video subject classification based on content of existing network-oriented inquiry and description.The technical solution used in the present invention is, network-oriented Based on the classification of many figure fusion video subjects and description method, step is for inquiry, 1) with reference to the text message and visual information of video, By building many graph models, the classification of event is realized using figure blanking method.2) using the reverse document-frequency idf of word frequency tf- or Text depth representing model word2vector extracts the keyword of Video Events, and to keyword using from wikipedia etc. Website is modified with regard to the prior information of the topic, is allowed to perfect, realizes that the text to event is described.

Comprising the concrete steps that in one example,

A topic inquiry is given first, then related content is searched for from the related web sites such as wikipedia, is obtained and is somebody's turn to do The relevant prior information of topic：

M video under same event is given, with T={ t₁,t₂,...,t_MRepresenting the text label under corresponding video Set, t_iI-th text feature in expression text collection T, V={ v₁,v₂,...,v_M, wherein v_iRepresent i-th video and M The vision similarity vector of individual video, and v_iJ () represents the approximate repeating frame of i-th video and j-th video, v_i(i)=0, structure Build many graph model G₁=(T, E₁),G₂=(V, E₂), wherein T, V are the vertex set of two figures respectively, E₁,E₂Side collection, respectively from Text and visual information represent the relation between any two videoIts weight calculation formula is as follows：

Wherein s_ijIt is the average camera lens number between video i and video j, v_i(j) represent i-th video and j-th video it Between it is approximate repeat frame number, carry out many figure fusions using linear fusion technology, its detailed process is expressed as follows with formula：

Wherein α be between (0,1) between positive number, for balancing the relation of before and after two；

Then realize that Video Events are classified by the method that figure cuts, the Text character extraction finally by video file is each Keyword under class sub-topicses, and the keyword set of sub-topicses is carried out with regard to the relevant information of this event according to wikipedia Modification, expansion.

The characteristics of of the invention and beneficial effect are：

The present invention is primarily directed to the shortcoming that the event category of existing many videos and description are present, and design is suitable for many videos The video subject classification based on content of the network-oriented inquiry of data structure feature and description, are allowed to fully using data Peculiar information.Its advantage is mainly reflected in：

(1) novelty：Many graph models are applied to into Video Events classification, the multi-modal information of video is sufficiently used, The event detection of many video sets has been better achieved.

(2) multimode state property：In video sub-topicses detection process, the text message of video is on the one hand make use of to calculate video Between similitude.On the other hand the nearly repeating frame between video is detected using the visual information of video, video is calculated with this Between similitude.The sub-topicses detection for realizing jointly video is combined in terms of two.

(3) validity：Be experimentally confirmed be typically applied to video subject classification with describe method compared with, Video subject classification based on many graph models and the performance of the method for description of present invention design is substantially better than both, therefore more suitable Together in the video subject classification and description of network-oriented inquiry.

(4) practicality：Simple possible, can be used in multimedia signal processing field.

Description of the drawings：

Fig. 1 is the stream of the video subject detection that algorithm is cut based on the figure of many graph models with keyword extraction process of the present invention Cheng Tu.

Specific embodiment

It is an object of the invention to provide the video subject classification based on content and description of a kind of network-oriented inquiry.Root The characteristics of according to many video datas, first, it is proposed that build many graph models using the text message and visual information of video, by figure The cluster that method realizes video such as cut, that is, realize the event detection of video.Then, using tf-idf or text depth representing Keyword of the similar techniques such as the model word2vector extraction per class event, TF word frequency (Term Frequency), IDF is reverse Document-frequency (Inverse Document Frequency).Form the text to event to describe, realize network-oriented inquiry Based on content video subject classification with description.

Method provided by the present invention is broadly divided into two processes：1) with reference to the text message and visual information of video, lead to Cross and build many graph models, the classification that method realizes event such as cut using figure.2) it is similar using tf-idf or word2vector etc. Technology extract Video Events keyword, and to keyword using from the websites such as wikipedia with regard to the topic prior information Modify, be allowed to perfect, realize that the text to event is described.Below its general procedure is described：

A topic inquiry is given first, then related content is searched for from the related web sites such as wikipedia, is obtained and is somebody's turn to do The relevant prior information of topic.

M video under same event is given, with T={ t₁,t₂,...,t_MRepresenting the text label under corresponding video Set, t_iI-th text feature in expression text collection T.These text labels are one-to-one with corresponding video.V ={ v₁,v₂,...,v_M, wherein v_iRepresent the vision similarity vector of i-th video and M video, and v_iJ () represents i-th The approximate repeating frame of video and j-th video, v_i(i)=0.Build many graph model G₁=(T, E₁),G₂=(V, E₂), wherein T, V It is the vertex set of two figures respectively, E₁,E₂Be side collection, the pass between any two video is represented from text and visual information respectively SystemIts weight calculation formula is as follows：

Wherein s_ijIt is the average camera lens number between video i and video j, v_i(j) represent i-th video and j-th video it Between approximate repeat frame number.Many figure fusions are carried out using linear fusion technology, its detailed process is expressed as follows with formula：

Wherein α be between (0,1) between positive number, for balancing the relation of before and after two.

Then realize that Video Events are classified by the method that figure cuts.Text character extraction finally by video file is each Keyword under class sub-topicses.And the keyword set of sub-topicses is carried out with regard to the relevant information of this event according to wikipedia Modification, expansion.

Fig. 1 describes the topic detection process of many videos for being proposed.It is assumed that under same event, have M video,

First, the text feature of text corresponding to video, T={ t are extracted₁,t₂,...,t_MRepresent corresponding to M video Text set, t_iRepresent the text feature corresponding to i-th video.Then, the visual signature of frame of video is extracted, and is based on this feature Calculate the similarity between video.Detect approximate between any two video here by the similarity such as minhash detection algorithm Repeat frame number.V={ v₁,v₂,…v_M, wherein v_iRepresent that i-th video repeats frame number (near with the near of M video Duplicate keyframes) vector that constituted, v_iJ () represents the nearly repetition frame number of i-th video and j-th video, and v_i(i)=0.

Finally, many graph model G are built respectively using the text message and visual information of video₁=(T, E₁),G₂=(V, E₂), wherein T, V are the vertex set of two figures respectively, E₁,E₂It is side collection, represents that any two is regarded from text and visual information respectively Relation between frequencyThat is (1), (2) formula.Average weight coefficient integration technology is recycled to carry out many figure fusions, i.e. (3) formula. Then the detection of sub-topicses is realized by the method that figure cuts.And according to the keyword of the corresponding sub-topicses of tf-idf extractions, realize The video subject classification based on content of network-oriented inquiry and description.

Claims

1. a kind of network-oriented inquiry is characterized in that based on the classification of many figure fusion video subjects and description method step is 1) to tie The text message and visual information of video are closed, and by building many graph models, the classification of event are realized using figure blanking method；2) utilize The reverse document-frequency idf of word frequency tf- or text depth representing model word2vector extracts the keyword of Video Events, and Keyword is modified using the prior information from websites such as wikipedias with regard to the topic, is allowed to perfect, realize to thing The text description of part.

2. network-oriented inquiry as claimed in claim 1 is based on the classification of many figure fusion video subjects and description method, its feature It is, comprising the concrete steps that in an example,

A topic inquiry is given first, is then searched for related content from the related web sites such as wikipedia, is obtained and the topic Relevant prior information：

M video under same event is given, with T={ t₁,t₂,...,t_MRepresenting the text label set under corresponding video, t_iI-th text feature in expression text collection T, V={ v₁,v₂,...,v_M, wherein v_iRepresent i-th video and M video Vision similarity vector, and v_iJ () represents the approximate repeating frame of i-th video and j-th video, v_iI ()=0, builds many figures Model G₁=(T, E₁),G₂=(V, E₂), wherein T, V are the vertex set of two figures respectively, E₁,E₂Side collection, respectively from text and Visual information represents the relation between any two videoIts weight calculation formula is as follows：

W_{i j}^{1} = \exp (| | t_{i} - t_{j} | |^{2} / 2 σ^{2}) - - - (1)

W_{i j}^{2} = v_{i} (j) / s_{i j} - - - (2)

Wherein s_ijIt is the average camera lens number between video i and video j, v_iJ () is represented between i-th video and j-th video It is approximate to repeat frame number, many figure fusions are carried out using linear fusion technology, its detailed process is expressed as follows with formula：

W_{i j} = {αW}_{i j}^{1} + (1 - α) W_{i j}^{2} - - - (3)

Then realize that Video Events are classified by the method that figure cuts, finally by each class of the Text character extraction of video file Keyword under theme, and according to wikipedia with regard to this event relevant information, the keyword set of sub-topicses is modified, Expand.