CN110489593A

CN110489593A - Topic processing method, device, electronic equipment and the storage medium of video

Info

Publication number: CN110489593A
Application number: CN201910770189.5A
Authority: CN
Inventors: 何奕江; 郑茂
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-08-20
Filing date: 2019-08-20
Publication date: 2019-11-22
Anticipated expiration: 2039-08-20
Also published as: CN110489593B

Abstract

The present invention provides topic processing method, device, electronic equipment and the storage mediums of a kind of video；Method includes: to extract video frame from video to form sets of video frames, and obtain the corresponding videotext of the video；Obtain the first semantic set of words for characterizing the video frame, and from second set of words of video text extraction；Based on first set of words and second set of words, the probability that the video belongs to each topic in topic set is obtained；When the probability is more than topic probability threshold value, determine that the corresponding topic of the probability is the topic of video ownership.The present invention is by combining the information of video frame mode and text modality that can accurately determine video topic.

Description

Topic processing method, device, electronic equipment and the storage medium of video

Technical field

The present invention relates to data mining technology more particularly to a kind of topic processing method of video, device, electronic equipment and Storage medium.

Background technique

Increasingly developed with Internet technology, terminal user gets used to the upload of the applications client in using terminal equipment Video or browsing video, the media datas such as video have become the main body of big data, can accurately feel in video recommendations Know the interest of user, how automatic topic generation is carried out based on video content and point of interest excavates, for meeting user's Access to information demand is significant.

The topic automatic mining of video content is to carry out analysis and understanding to video content to automatically form the skill of video topic Art, the relevant technologies are mainly based upon what videotext was carried out for the excavation of video topic, by obtaining in videotext Effective word, to infer the topic of video ownership；For videotext depend on unduly and the missing of videotext causes to be difficult to It is accurate to determine video topic.

Summary of the invention

The embodiment of the present invention provides topic processing method, device, electronic equipment and the storage medium of a kind of video, Neng Goutong Video topic can accurately be determined in conjunction with the information of video frame mode and text modality by crossing.

The technical solution of the embodiment of the present invention is achieved in that

The embodiment of the present invention provides a kind of topic processing method of video, comprising:

Video frame is extracted from video and forms sets of video frames, and obtains the corresponding videotext of the video；

Obtain the first semantic set of words for characterizing the video frame, and from second word of video text extraction Language set；

Based on first set of words and second set of words, obtains the video and belong in topic set each The probability of topic；

When the probability is more than topic probability threshold value, if determining that the corresponding topic of the probability is video ownership Topic.

The embodiment of the present invention provides a kind of topic processing unit of video, comprising:

Video extraction forms sets of video frames for extracting video frame from video, and it is corresponding to obtain the video Videotext；

Word extraction module, for obtaining the first semantic set of words for characterizing the video frame, and from described The second set of words of video text extraction；

Topic probability evaluation entity, for being based on first set of words and second set of words, described in acquisition Video belongs to the probability of each topic in topic set；

Topic determining module, for determining the corresponding topic of the probability when the probability is more than topic probability threshold value For the topic of video ownership.

In above scheme, the video extraction is also used to:

The video is carried out every one section of sampling time to take out frame processing, to form the sets of video frames；Or

The crucial plot position of the video is obtained, it is described to be formed to extract the corresponding video frame in the crucial plot position Sets of video frames.

In above scheme, the word extraction module is also used to:

Obtain the effective visible sensation word of each video frame in the sets of video frames；

The effective visible sensation word is obtained in the frequency of occurrence of all video frames of the sets of video frames；

When the frequency of occurrence is greater than effective visible sensation word threshold value, the frequency of occurrence is greater than the effective visible sensation word threshold The effective visible sensation word combination of value, to form first set of words.

In above scheme, the word extraction module is also used to:

By neural network model from each video frame extraction characteristics of image, simultaneously by the connection of extracted characteristics of image Be converted to the vision Word probability of corresponding multiple visual words；

When the vision Word probability is greater than visual word probability threshold value, determine that the corresponding visual word of the vision Word probability is The effective visible sensation word.

In above scheme, the word extraction module is also used to:

Word segmentation processing and mark part of speech processing are carried out to the videotext respectively by condition random field, to be had Imitate text set of words；

Determine the reverse document-frequency of the effective text word of each of effective text set of words；

When the reverse document-frequency is greater than reverse document-frequency threshold value, the reverse document-frequency is greater than described inverse It is closed to effective textual phrase of document-frequency threshold value, to form second set of words.

In above scheme, the word extraction module is also used to:

By the frequency of occurrence of the effective visible sensation word in first set of words, it is mapped to the occurrence out with effective text word The identical value interval of number, to update the frequency of occurrence of the effective visible sensation word；

Wherein, effective text word is in second set of words for characterizing the semantic word of videotext.

In above scheme, the video extraction is also used to:

Obtain the corresponding videotext of the video；

When the data volume of the videotext of acquisition is not enough to be individually used for determine the topic of video ownership, from described Sets of video frames is extracted in video；

The word extraction module is also used to: when the data volume of the videotext of acquisition is enough to be individually used for determining the view When the topic of frequency ownership, the second set of words is extracted from the videotext；

The topic probability evaluation entity is also used to: being based on second set of words, is obtained the video and belong to topic The probability of each topic in set；

The topic determining module is also used to: when the probability is more than topic probability threshold value, determining that the probability is corresponding If entitled video ownership topic.

In above scheme, described device further include: historical topic obtains module and recommends video determining module.

The historical topic obtains module, for the register in response to client, obtains and corresponds to login account Historical viewings topic；

The recommendation video determining module, for being matched when the corresponding topic of the video with the historical viewings topic When, the video is determined as video to be recommended.

In above scheme, the recommendation video determining module is also used to:

Based on the corresponding topic of the video and the historical viewings topic, the corresponding topic of the video and institute are determined State the distance between historical viewings topic；

When the distance is less than distance threshold, the corresponding topic of the video and the historical viewings topic are determined Match, and the video is determined as video to be recommended.

The embodiment of the present invention provides a kind of topic processing electronic equipment of video, comprising:

Memory, for storing executable instruction；

Processor when for executing the executable instruction stored in the memory, is realized provided in an embodiment of the present invention The topic processing method of video.

The embodiment of the present invention provides a kind of storage medium, is stored with executable instruction, real when for causing processor to execute The topic processing method of existing video provided in an embodiment of the present invention.

The embodiment of the present invention has the advantages that

By obtaining the first semantic set of words for characterizing video frame, and the second word is extracted from videotext Set, thus combine the information of video frame mode and text modality both modalities determine video ownership topic, can It is enough to be further visually finely divided on the basis of text topic, and vision can be passed through when text modality is performed poor Word determines video topic.

Detailed description of the invention

Fig. 1 is the configuration diagram of the topic processing system of video provided in an embodiment of the present invention；

Fig. 2 is the structural schematic diagram of the topic processing unit of video provided in an embodiment of the present invention；

Fig. 3 A-3D is the flow diagram of the topic processing method of video provided in an embodiment of the present invention；

Fig. 4 is the structural schematic diagram of depth residual error module provided in an embodiment of the present invention；

Fig. 5 is the application schematic diagram of condition random field provided in an embodiment of the present invention；

Fig. 6 is the schematic illustration that topic provided in an embodiment of the present invention generates model；

Fig. 7 is the practical application flow chart of the topic processing method of video provided in an embodiment of the present invention；

Fig. 8 is that the topic processing method of video provided in an embodiment of the present invention is applied to the flow chart of video recommendations；

Fig. 9 is the structural schematic diagram of recommender system provided in an embodiment of the present invention.

Specific embodiment

To make the objectives, technical solutions, and advantages of the present invention clearer, below in conjunction with attached drawing to the present invention make into It is described in detail to one step, described embodiment is not construed as limitation of the present invention, and those of ordinary skill in the art are not having All other embodiment obtained under the premise of creative work is made, shall fall within the protection scope of the present invention.

In the following description, it is related to " some embodiments ", which depict the subsets of all possible embodiments, but can To understand, " some embodiments " can be the same subsets or different subsets of all possible embodiments, and can not conflict In the case where be combined with each other.

In the following description, related term " first second " be only be the similar object of difference, do not represent needle To the particular sorted of object, it is possible to understand that specific sequence or successively can be interchanged in ground, " first second " in the case where permission Order, so that the embodiment of the present invention described herein can be implemented with the sequence other than illustrating or describing herein.

Unless otherwise defined, all technical and scientific terms used herein and belong to technical field of the invention The normally understood meaning of technical staff is identical.Term used herein is intended merely to the purpose of the description embodiment of the present invention, It is not intended to limit the present invention.

Before the embodiment of the present invention is further elaborated, to noun involved in the embodiment of the present invention and term It is illustrated, noun involved in the embodiment of the present invention and term are suitable for following explanation.

1) visual word: the word obtained from the image feature maps of video frame, for example, the visual word obtained by video frame It can be forest, desert etc..

2) text word: the word directly extracted from videotext.

3) visual word of the semantic information of video frame effective visible sensation word: can be characterized in multiple visual words of video frame.

4) the text word of the semantic information of video frame effective text word: can be characterized in multiple text words of video frame.

5) it implies Di Li Cray distribution (LDA, Latent Dirichlet Allocation): being that a kind of document subject matter is raw It include word, theme and document three-decker at model, also referred to as three layers of bayesian probability model.

6) gibbs sampler (Gibbs sampling): being one that Markov Monte Carlo (MCMC) is used in statistics Kind of algorithm, for when being difficult to directly sample from the distribution of a certain multivariate probability approximate sample drawn sequence.The sequence is available In approximate Joint Distribution, the edge distribution of Partial Variable or calculate integral (desired value of such as a certain variable).

In the video topic excavation problem of personalized recommendation product, the relevant technologies only considered the text modality of video, Information provided by the video frame mode of video is had ignored, and includes due to the problem of limiting number of words in videotext title Information it is often very limited, can be also greatly reduced using the effect of topic generating algorithm, for new videotext, such as Fruit, which lacks effective word, can not then judge topic.

In view of the above-mentioned problems, the embodiment of the present invention proposes a kind of topic processing method of video based on video content. The media datas such as video have become the main body of big data, and the interest of user, Yi Jiru can be accurately perceived in recommendation What carries out automatic topic to video content and point of interest excavates, significant for the access to information demand for meeting user, depending on The topic processing of frequency content is that analysis and understanding is carried out to video content to automatically form video topic and automatically by new video It is divided into the technology of the topic pending for excavating and.

The multi-modal topic automatic mining method and technology scheme that the embodiment of the present invention is proposed is it is contemplated that video frame Mode, the information of text modality, have used the information of both modalities which, can not only further exist on the basis of text topic It is visually finely divided, can determine video topic by visual word when text modality is performed poor.The topic of video content Processing can accurately identify the possible interested topic of user in video, by the topic information of the video identified by this method Be supplied to the recommendation that recommender system on line carries out video so that the portraying, be cold-started of user interest, have in terms of Product Experience it is bright Aobvious promotion.

The topic processing method of the video of the embodiment of the present invention includes the effective visible sensation word extraction of video frame mode, text mould Effective text word of state extracts, multi-modal topic generates three processes:

1) the effective visible sensation word of frequency frame mode extracts: to a certain number of video frames of each video extraction, using convolution mind The entity word in video frame is extracted through network, the entity word extracted in multiple frames is combined and forms visual word and then had Visual word is imitated, the total degree that effective visible sensation word occurs is carried out smoothly, to obtain video frame mode further according to the quantity of video frame Effective visible sensation word frequency of occurrence.

2) effective text word of text modality extracts: the label provided using the title of video and user is as text modality Data, effective text word in text modality is chosen by the part of speech and statistical indicator of text model, then extract text mould All effective word informations in state.

3) multi-modal topic generates: by effective word unified representation of the visual word of video frame mode and text modality, so Model, which is generated, using topic afterwards determines theme, the theme that will be generated by video frame mode visual word and the effective word combination of text modality The topic obtained as processing.

Below in conjunction with the exemplary application and implementation of terminal provided in an embodiment of the present invention, illustrate that the embodiment of the present invention mentions The topic processing method of the video of confession.

The embodiment of the present invention provides topic processing method, device, electronic equipment and the storage medium of a kind of video, can solve The problem of video topic can not be accurately inferred when certainly text modality is performed poor, illustrate electronics provided in an embodiment of the present invention below The exemplary application of equipment, electronic equipment provided in an embodiment of the present invention may be embodied as server or terminal.In the following, by explanation Electronic equipment is embodied as exemplary application when server.

It is an optional framework of the topic processing system 100 of video provided in an embodiment of the present invention referring to Fig. 1, Fig. 1 Schematic diagram, terminal 400 connect electronic equipment 200 by network 300, and network 300 can be wide area network or local area network, or It is combination.

It include electronic equipment 200, user terminal 400,500 and of document storage system in the topic processing system of the video Database 600.It include video recommendation system in electronic equipment 200.Electronic equipment 200 is read from document storage system 500 Video, the topic processing method of the video provided through the embodiment of the present invention determine the topic of read video, and server is also The historical viewings video of user is read from database 600 and then obtains the historical viewings topic of user, and server is according to video Topic and user historical viewings topic matching result, recommended video or suppressed place by video recommendation system Reason, by the matched video push of the historical viewings topic of topic and user to user, by the historical viewings topic of topic and user Unmatched video is suppressed.

Referring to fig. 2, Fig. 2 is that the structure of the electronic equipment 200 provided in an embodiment of the present invention for the processing of video topic is shown It is intended to, structural schematic diagram shown in Fig. 2 can be suitable for terminal and server, according to actual needs selective implementation Component therein.Electronic equipment 200 shown in Fig. 2 includes: at least one processor 210, memory 250, at least one network Interface 220 and user interface 230.Various components in electronic equipment 200 are coupled by bus system 240.It is understood that Bus system 240 is for realizing the connection communication between these components.Bus system 240 is also wrapped in addition to including data/address bus Include power bus, control bus and status signal bus in addition.But for the sake of clear explanation, various buses are all marked in Fig. 2 For bus system 240.

Processor 210 can be a kind of IC chip, the processing capacity with signal, such as general processor, number Word signal processor (DSP, Digital Signal Processor) either other programmable logic device, discrete gate or Transistor logic, discrete hardware components etc., wherein general processor can be microprocessor or any conventional processing Device etc..

User interface 230 include make it possible to present one or more output devices 231 of media content, including one or Multiple loudspeakers and/or one or more visual display screens.User interface 230 further includes one or more input units 232, packet Include the user interface component for facilitating user's input, for example keyboard, mouse, microphone, touch screen display screen, camera, other are defeated Enter button and control.

Memory 250 can be it is removable, it is non-removable or combinations thereof.Illustrative hardware device includes that solid-state is deposited Reservoir, hard disk drive, CD drive etc..Memory 250 optionally includes one geographically far from processor 210 A or multiple storage equipment.

Memory 250 includes volatile memory or nonvolatile memory, may also comprise volatile and non-volatile and deposits Both reservoirs.Nonvolatile memory can be read-only memory (ROM, Read Only Memory), and volatile memory can To be random access memory (RAM, Random Access Memory).The memory 250 of description of the embodiment of the present invention is intended to Memory including any suitable type.

In some embodiments, memory 250 can storing data to support various operations, the example of these data includes Program, module and data structure or its subset or superset, below exemplary illustration.

Operating system 251, including for handle various basic system services and execute hardware dependent tasks system program, Such as ccf layer, core library layer, driving layer etc., for realizing various basic businesses and the hardware based task of processing；

Network communication module 252, for reaching other calculating via one or more (wired or wireless) network interfaces 220 Equipment, illustrative network interface 220 include: bluetooth, Wireless Fidelity (WiFi) and universal serial bus (USB, Universal Serial Bus) etc.；

Module 253 is presented, for via one or more associated with user interface 230 output device 231 (for example, Display screen, loudspeaker etc.) make it possible to present information (for example, for operating peripheral equipment and showing the user of content and information Interface)；

Input processing module 254, for one to one or more from one of one or more input units 232 or Multiple user's inputs or interaction detect and translate input or interaction detected.

In some embodiments, topic processing unit provided in an embodiment of the present invention can be realized using software mode, Fig. 2 The topic processing unit 255 for the video being stored in memory 250 is shown, can be the soft of the forms such as program and plug-in unit Part, including following software module: video extraction 2551, word extraction module 2552,2553 and of topic probability evaluation entity Topic determining module 2554 can be the software of the forms such as program and plug-in unit, and can be embedded in various clients, these moulds Block is in logic, therefore to can be combined arbitrarily according to the function of being realized or further split, and will hereinafter be had Body illustrates the function of modules.

In further embodiments, video topic processing unit provided in an embodiment of the present invention can be real using hardware mode It is existing, as an example, video topic processing unit provided in an embodiment of the present invention can be using hardware decoding processor form Processor is programmed to perform the topic processing method of video provided in an embodiment of the present invention, for example, hardware decoding processor The processor of form can use one or more application specific integrated circuit (ASIC, Application Specific Integrated Circuit), DSP, programmable logic device (PLD, Programmable Logic Device), complexity can Programmed logic device (CPLD, Co mplex Programmable Logic Device), field programmable gate array (FPGA, Field-Progr ammable Gate Array) or other electronic components.

Referring to Fig. 3 A, Fig. 3 A is that an optional process of the topic processing method of video provided in an embodiment of the present invention is shown The step of the step of being intended to, showing in conjunction with Fig. 3 A is illustrated, following methods can be in above-mentioned any type of electronic equipment It is realized on (such as terminal or server).

Illustrate the topic processing method of the video of the embodiment of the present invention by taking electronic equipment is server as an example below.

In a step 101, video frame is extracted from video and form sets of video frames, and obtain the corresponding video text of video This.

By taking electronic equipment is server as an example, server can pull the short-sighted frequency of user's viewing or collection, so that service The video that the recommender system of device operation recalls same topic is recommended to user.Server can also pull new online video, it is new on The video of line has title but is the absence of label, the technical solution based on the embodiment of the present invention, by combine video frame mode and The information of text modality both modalities determines the topic of video ownership, so that server can be quick to new online video Ground carries out topic prediction, and carries out the classification based on video topic to new online video.

Referring to Fig. 3 B, it is based on Fig. 3 A, step 101 can also be implemented by step 101A or step 101B.

When the resolution ratio that the frame number of the video frame of video is more than the video frame of frame number threshold value or video is more than resolution threshold When, if all video frames to video extract, it will lead to and expend biggish meter during subsequent processing video frame Resource is calculated, in addition, the content difference of most of video frames in the same video is smaller.

Therefore, efficiency is speculated in order to promote topic, as the alternative solution for extracting all videos frame in server, some In embodiment, server extracts the video frame of the part in video, for example, server can be extracted using a variety of pumping frame algorithms Partial video frame forms sets of video frames.

In step 101A, the video is carried out every one section of sampling time to take out frame processing, to form sets of video frames.

In some embodiments, when the resolution ratio that the frame number of the video frame of video is more than the video frame of frame number threshold value or video When more than resolution threshold, server carries out pumping frame to video every one section of sampling time and handles to obtain video frame, to form view Set, that is, sets of video frames of frequency frame.

For example, pumping frame processing here can use uniform sampling mechanism or nonuniform sampling mechanism.When using uniform When sampling mechanism carries out taking out frame processing, sampling time interval be it is fixed, when using nonuniform sampling mechanism, between the sampling time Every not being constant.When sampling time interval is fixed, the setting in sampling time and the play time of video file are related, example Such as, sampling time interval can be directly proportional to the play time of video file.When sampling time interval is not constant, when sampling Between interval can be random, sampling time interval can be related to video content, it is assumed that the middle section of video file and view The degree of correlation of frequency content is greater than the beginning and end part of video file and the degree of correlation of video content, in the centre of video file Partial sampling time interval is less than the sampling time interval in video file beginning and end part.

As the alternative steps of step 101A, in step 101B, the crucial plot position of video is obtained, is extracted crucial The corresponding video frame in plot position is to form sets of video frames.

In some embodiments, when the resolution ratio that the frame number of the video frame of video is more than the video frame of frame number threshold value or video When more than resolution threshold, server obtains the crucial plot position of video, extract the corresponding video frame in crucial plot position with Sets of video frames is formed, crucial plot position can also be prior information, and according to crucial plot position, it is corresponding to extract crucial plot All video frames form sets of video frames.Crucial plot can be the plot that can characterize the representative content of video, example Such as, there is the plot of scene changes in the plot for having high priest to occur.

In some embodiments, the pumping frame processing of video is also based on inter-frame difference progress, and server is by two field pictures Difference is carried out, the Average pixel intensity for obtaining frame image is used to measure the variation of two field pictures, average strong based on inter-frame difference Degree, whenever a certain frame in video produces big variation with former frame image content, which is determined as being extracted by server Frame, and extracted.

Explanation extracts video frame based on inter-frame difference below, and server successively calculates the inter-frame difference between every two frame, into And average interframe differential intensity is obtained, all frames are ranked up according to average interframe differential intensity, select average inter-frame difference Intensity is more than several pictures of differential intensity threshold value as the frame being extracted in video, alternatively, server selection has averagely The frame of inter-frame difference intensity local maximum is as the frame being extracted in video.When server selection has average inter-frame difference strong When spending the frame of local maximum as the frame being extracted in video, average interframe differential intensity time series is smoothly located Reason, with several frames for effectively removing noise to avoid extracting under similar scene simultaneously.

In a step 102, the first semantic set of words for characterizing video frame in sets of video frames is obtained, and from view The second set of words is extracted in frequency text.

Here, the first set of words is the set for characterizing the semantic word of video frame in sets of video frames, second Set of words is the set of the word directly extracted from videotext.

Referring to Fig. 3 C, it is based on Fig. 3 A, semantic first for characterizing video frame in sets of video frames is obtained in step 102 Set of words can be realized especially by step 1021-1023.

In step 1021, the effective visible sensation word of each video frame in sets of video frames is obtained.

In some embodiments, the effective visible sensation word of each video in sets of video frames is identified by neural network model, Neural network model can be ResNet-101 model or other can classify to video frame, obtains characterization classification results The model of semantic visual word.The input of neural network model be the obtained video frame of step 101, neural network model it is defeated What is predicted out is the probability that video frame respectively corresponds the visual word that neural network model can be predicted, is carried out to vision Word probability The corresponding visual word of probability is determined as the effective visible sensation word of video frame when probability is greater than visual word probability threshold value by sequence.

In step 1022, effective visible sensation word is obtained in the frequency of occurrence of the video frame of sets of video frames.

In some embodiments, each effective visible sensation word that statistics obtains in step 1021 is in all of sets of video frames Frequency of occurrence in video frame.For example, there are three video frame, respectively video frame A, video frame B and views altogether in sets of video frames Effective visible sensation word in frequency frame C, video frame A is a and b, and the effective visible sensation word in video frame B is a and c, effective in video frame C Visual word is a and b, then effective visible sensation word a is 3 in the frequency of occurrence of all video frames of sets of video frames, and effective visible sensation word b exists The frequency of occurrence of all video frames of sets of video frames is 1, appearance of the effective visible sensation word c in all video frames of sets of video frames Number is 1.

In step 1023, when effective visible sensation word is greater than effectively view in the frequency of occurrence of all video frames of sets of video frames When feeling word threshold value, the effective visible sensation word combination that frequency of occurrence is greater than effective visible sensation word threshold value is formed in the first set of words.

In some embodiments, it is greater than when effective visible sensation word in the frequency of occurrence of all video frames of sets of video frames effective When visual word threshold value, then frequency of occurrence be greater than effective visible sensation word threshold value effective visible sensation word do not occur accidentally, but with view Frequency content is relevant, thus server determines that frequency of occurrence belongs to first greater than the effective visible sensation word of effective visible sensation word threshold value Set of words, effective visible sensation word in the first set of words can with the semanteme of video frame in Efficient Characterization sets of video frames, first Each effective visible sensation word in set of words is different, also, each visual word respectively carries it in sets of video frames Frequency of occurrence information in all video frames, frequency of occurrence information are the attribute that each effective visible sensation word carries.

In some embodiments, effective visible sensation word threshold value is set according to the number of video frame in sets of video frames, When the number of video frame in sets of video frames is larger, effective visible sensation word threshold value is also corresponding larger, when video in sets of video frames When the number of frame is smaller, effective visible sensation word threshold value is also corresponding smaller, that is to say, that effective visible sensation word threshold value and sets of video frames Video frame frame number is positively correlated.

Number of video frames is different from the sets of video frames that different videos obtains, some videos itself only have 2 Second, some videos itself have 15 seconds, and the video frame frame number in the corresponding sets of video frames of 2 seconds videos of duration is much smaller than duration 15 Video frame frame number in the corresponding sets of video frames of second video.If effective visual word threshold value remains unchanged, for duration Only 2 seconds videos, server can not may obtain the effective visible sensation word for reaching effective visible sensation word threshold value at all, and for when be up to Number to 15 seconds videos, the obtained effective visible sensation word for belonging to the first set of words of server is more, and these are effectively Visual word may not be able to really characterize the semanteme of video content.When the video frame of effective visible sensation word threshold value and sets of video frames When frame number is positively correlated, it is ensured that the effective visible sensation word for belonging to the first set of words determined by server being capable of real table The semanteme of video frame in sets of video frames is levied, to improve the accuracy that subsequent topic is inferred

In some embodiments, first clustering processing can be carried out by the effective visible sensation word to video frame, by effective visible sensation word point Enter in bucket (i.e. for storing the data structure of effective visual word), is numbered with the bucket of effective visible sensation word to distinguish each effective visible sensation Word, so as to reduce the space size of effective visible sensation word.

Specifically, step can specifically be passed through by obtaining the effective visible sensation word of each video frame in sets of video frames in step 1021 Rapid 10211-10212 is realized.

In step 10211, by neural network model from each video frame extraction characteristics of image, by extracted image Feature connects and is converted to the vision Word probability of corresponding multiple visual words.

In some embodiments, passed through by the convolutional layer of neural network model from each video frame extraction characteristics of image Full articulamentum connects extracted characteristics of image, and the view for corresponding to the visual word of multiple priori is converted to by softmax layers Feel Word probability.

In some embodiments, it is neural network model that the visual word of video frame, which extracts model, can be ResNet- 101 models.When the convolutional neural networks number of plies is more, it will lead to gradient disperse or gradient explosion, here, asked to solve gradient Topic, has used depth residual error module, as shown in figure 4, Fig. 4 is the knot of the depth residual error module in neural network of the embodiment of the present invention Structure schematic diagram.The neural network that the relevant technologies provide all is only one data flow, and the cross-layer being not provided on the right side of Fig. 4 is straight Connect, and depth residual error module is realized by way of fast connecting (shortcut connection), i.e., increases by one in a model Outputting and inputting for this residual error module is carried out plus is folded by shortcut by a identical mapping (identity mapping).X For neural network module input, F (x) is the mode commonly connected, is operated for example including convolution, pond and activation primitive, H It (x) is the neural module output for being added to residual error mode, H (x)=F (x)+x.In the neural network model of script, need to lead to Cross neural network model fitting H (x), but after increasing shortcut connection, will originally required for the function that learns H (x) is converted into F (x)+x, therefore only needs to learn residual error H (x)-x, so as to be continuously increased the layer of neural network Number improves performance, can be good at the effect for playing optimization training.

Here, visual word is substantially the semantic expressiveness of the classification of picture frame, the view that neural network model can be predicted Feel that word depends on the image pattern obtained in the preset tranining database, every kind of image pattern is mapped as a visual word.Mind The obtained video frame of input step 101 through network model, the output prediction of neural network model is that video frame is right respectively The probability for the visual word for answering neural network model that can predict.Neural network model extracts the process of vision Word probability such as Under: neural network model extracts characteristics of image from video frame, and the dimension of characteristics of image is compressed, so that compressed dimension Degree is consistent with the number for the visual word that can be predicted, is then classified and is normalized by softmax function, it is right respectively to obtain The probability for the visual word that should be able to be predicted.

Here preset tranining database is for training neural network model, is the foundation of video frame classification.Example Such as, based on the corresponding image pattern of visual word of 10 types in database, each image pattern corresponds to this 10 visual words One or more of, it is based on database training neural network model, then neural network model is subsequent can be with forecast image frame point Not Dui Ying 10 visual words probability.

In step 10212, when vision Word probability is greater than visual word probability threshold value, the corresponding view of vision Word probability is determined Feel word is effective visible sensation word.

In some embodiments, the vision Word probability obtained in the step 10212 is greater than pre-set vision Word probability It is then effective visible sensation word greater than the corresponding visual word of vision Word probability of pre-set visual word probability threshold value when threshold value.

Referring to Fig. 3 D, it is based on Fig. 3 A, the second word collection is extracted from videotext in step 102 can be especially by step 1024-1026 is realized.

In step 1024, videotext is carried out at word segmentation processing and mark part of speech respectively by condition random field Reason, to obtain effective text set of words.

In some embodiments, participle and main use condition random field (CRF, the Conditional of part-of-speech tagging Random Field), CRF is broadly divided into three layers.Referring to the application signal that Fig. 5, Fig. 5 are conditional of embodiment of the present invention random field Figure.When completing word segmentation processing using condition random field, the input of bottom isThe word of i-th of word as in sentence It is embedded in vector, middle layer is two-way shot and long term memory network (LSTM, Long Short-Term Memory), and LSTM is a kind of special Recognition with Recurrent Neural Network (RNN, Recurrent Neural Networks) structure, be for a long time and special in order to solve the problems, such as Door designs, and two-way LSTM can use contextual information, h simultaneously compared with common LSTM_i ^lIt is i-th of word in sentence double The input of left hidden layer in shot and long term memory network (LSTM, Long Short-Term Memory), mainly contains sentence The information that word in son before i-th of word transmits backward, h_iR is right hidden layer input of i-th of word in LSTM in sentence, main Information of the word in sentence after i-th of word to front transfer is contained, is obtained respectively according in left hidden layer and right hidden layer To sentence in letter from word in the information transmitted backward of word before i-th of word and sentence after i-th of word to front transfer Breath, available h_i, the probability for the participle that as i-th of word provides in LSTM in sentence, by CRF top layer using in LSTM Obtained in probability binding transfer probability obtained final result t_i, final result is the division result to word in sentence.When When completing part-of-speech tagging using CRF, the input of bottom is the term vector of word, and term vector is to indicate word with a vector, In the training process, if input sentence is made of 120 words, each word indicates that then model is corresponding defeated by the term vector of 100 dimensions Entering is (120,100), and the hidden layer after Bi-LSTM, vector becomes (120,128), wherein 128 be Bi-LSTM in model Export dimension, it is assumed that the target labels for segmenting task are B (Begin), M (Middle), E (End), S (Single), then model is most Output dimension is the vector of (120,4) eventually, respectively indicates the probability of corresponding BMES, the label for finally taking probability big is as pre- mark Label.By largely labeled data and the continuous iteration optimization of model, this mode, which can train, learns participle model out, above-mentioned For LSTM model is only for sequence labelling task, the label of current location and previous position, the latter position have potential Relationship, in order to be learnt in entire sequence using the contextual information between this label by the CRF layer after LSTM Optimal sequence label calculates optimal annotated sequence, and by the prediction result combination tag transition probability of LSTM, upper layer CRF is defeated The probability of the possible situation of all parts of speech is marked out.

It is combined the text word handled by above-mentioned word segmentation processing and part-of-speech tagging to form effective text word Set.

In step 1025, the reverse document-frequency of the effective text word of each of effective text set of words is determined.

In some embodiments, after carrying out participle and part of speech determination to text, due to words such as stop words, conjunction, the nouns of locality The related term of topic will not be become, in order not to allow this kind of word when generating topic as disturbing factor, these words will not be into Enter to effective word range.It is completed after filtering according to part of speech, it is also necessary to remove some frequent words, there is no special meanings for this kind of word Justice is mainly counted by reverse document-frequency (IDF, inverse document frequency), then filters out IDF Higher a part divides effective word range into.The statistical formula of IDF is as follows:

In step 1026, when reverse document-frequency is greater than reverse document-frequency threshold value, reverse document-frequency is greater than Effective textual phrase of reverse document-frequency threshold value closes to form the second set of words.

In some embodiments, when reverse document-frequency is greater than reverse document-frequency threshold value, it is determined that reverse file frequency The corresponding effective text word of rate belongs to the second set of words, and obtains effective text word in the frequency of occurrence of videotext, passes through Effective text word in second set of words and each of included the corresponding frequency of occurrence of effective text word, in conjunction with step Belong to the semantic effective visible sensation word for being used to characterize video frame in the sets of video frames of the first set of words obtained in 102 And the corresponding frequency of occurrence of effective visible sensation word, it is subsequent that topic deduction is carried out to video.

It is executing the step in 1023 when frequency of occurrence is greater than effective visible sensation word threshold value, frequency of occurrence is greater than effectively view Feel that the effective visible sensation word combination of word threshold value is formed after the first set of words, step 1027 can also be performed.

In step 1027, the frequency of occurrence of the effective visible sensation word in the first set of words is mapped to and effective text word The identical value interval of frequency of occurrence, to update the frequency of occurrence of effective visible sensation word, effective text word is the second set of words In for characterizing the semantic word of videotext.

Here, effective visible sensation word and effective text word are respectively to belong to the effective visible sensation word of the first set of words and belong to the Effective text word of two set of words.

In some embodiments, the effective visible sensation word in the first set of words will appear repeatedly in sets of video frames, the Effective text word in two set of words can also occur repeatedly in videotext.When carrying out the deduction of video topic, It influences the factor that topic is inferred and is not only in that in the first set of words effective text in effective visible sensation word and the second set of words The frequency of occurrence of word, effective visible sensation word and effective text word also will affect the result of topic deduction.Here, in the first set of words Each effective visible sensation word can carry respective frequency of occurrence information, the effective text word of each of second set of words Respective frequency of occurrence information can be carried.By the frequency of occurrence of the effective visible sensation word in the first set of words be mapped to effectively The identical value interval of the frequency of occurrence of text word is in order to by effective visible sensation word to update the frequency of occurrence of effective visible sensation word Frequency of occurrence and the frequency of occurrence of effective text word are mapped to the same value interval.

In some embodiments, belong in the first set of words by the way that multiple video frames in sets of video frames are obtained Each effective visible sensation word can occur in sets of video frames repeatedly, there is multiple one of reason and is video in effective visible sensation word The number of frame is more, and included frame number per second is at least 24 frames in usual video, and HD video included frame number per second is 48 frames.For videotext, videotext is usually the title of video, and the title of video usually has number of words limitation, thus The frequency of occurrence of the effective text word gone out from video text extraction objectively can be far fewer than the effective visible sensation in the first set of words The frequency of occurrence of word leads to single visual word frequency of occurrence excessively excessively leading topic.There is frequency in single visual word in order to prevent Secondary excessive excessively leading topic, is mapped to same take for the frequency of occurrence of the frequency of occurrence of effective visible sensation word and effective text word It is worth section, here it is possible to the frequency of occurrence of effective visible sensation word is uniformly recorded as 1, it can also be according to video in sets of video frames The frequency of occurrence of effective visible sensation word is reduced presupposition multiple by the frame number of frame, and the frame number in presupposition multiple and sets of video frames is at positive It closes, so that the frequency of occurrence of effective visible sensation word and the frequency of occurrence of effective text word are mapped to the same value interval.

For example, when the frame number in sets of video frames is 20 frames, the frequency of occurrence of the effective visible sensation word A in the first set of words It is 10 times, the frequency of occurrence of effective visible sensation word A is reduced 5 times, then by the frequency of occurrence of the effective visible sensation word in sets of video frames Value after being mapped to value interval is 2 times.When the frame number in sets of video frames is 40 frames, effective view in the first set of words The frequency of occurrence for feeling word A is 18 times, and the frequency of occurrence of effective visible sensation word A is reduced 10 times, then will be effective in sets of video frames It is 1.8 times that the frequency of occurrence of visual word, which is mapped to the value after value interval, is recorded as 2 times.

In some embodiments, in order to the frequency of occurrence of the frequency of occurrence of effective visible sensation word and effective text word is mapped to The same value interval, it is ensured that the frequency of occurrence of effective text word is constant, and the frequency of occurrence of effective visible sensation word is contracted It is small, to keep frequency of occurrence and the frequency of occurrence of effective text word of effective visual word to belong to the same value interval；It can protect The frequency of occurrence for demonstrate,proving effective visible sensation word is constant, the frequency of occurrence of effective text word is amplified, to keep effective visual word Frequency of occurrence and the frequency of occurrence of effective text word belong to the same value interval；Can frequency of occurrence to effective visible sensation word and The frequency of occurrence of effective text word carries out zooming in or out for different multiples, to guarantee the frequency of occurrence of effective visible sensation word and have The frequency of occurrence of effect text word belongs to the same value interval.

Sets of video frames is extracted in step 101 from video, and obtains the corresponding videotext of video and can specifically include Step 1011-1012.

In step 1011, the corresponding videotext of video is obtained；

In step 1012, when the data volume of the videotext of acquisition is not enough to be individually used for determine the video ownership When topic, sets of video frames is extracted from video.

When the data volume of the videotext of acquisition is less than data-quantity threshold, it is meant that the videotext of acquisition is not enough to list It is solely used in the topic of determining video ownership, data-quantity threshold here is can be individually used for determining according to historical experience setting Belong to the videotext data volume of topic.

After executing the step the step 1011 in 101, step 105-107 can also be performed to replace subsequent step 102- 104:

In step 105, if the data volume of the videotext of acquisition is enough to be individually used for determining the video ownership When topic, the second set of words is extracted from videotext.

In step 106, it is based on the second set of words, obtains the probability that video belongs to each topic in topic set.

In step 107, when probability is more than topic probability threshold value, determine the corresponding topic of probability for video ownership Topic.

In step 105-107, when the data volume of the videotext of acquisition is enough to be individually used for determining the video ownership Topic when, the effective text word that can be based only upon in the second set of words, obtain respectively video belong in topic set it is each The probability of topic.When the data volume of the videotext of acquisition reaches data-quantity threshold, it is meant that the videotext of acquisition is enough It is individually used for determining the topic of video ownership, when data-quantity threshold here can be individually used for really according to historical experience setting Surely belong to the data volume of topic.

In some embodiments, when the semanteme that videotext can express is abundant enough, videotext can be used only To carry out video topic deduction.In practical applications, determine that videotext can express semanteme using data-quantity threshold Abundant degree then characterizes the language that videotext can express when the data volume of the videotext of acquisition reaches data-quantity threshold It is adopted abundant enough, video topic deduction can be then carried out using only videotext, when the data volume of the videotext of acquisition When less than data-quantity threshold, then characterize that videotext can express is semantic not abundant enough, then can in conjunction with video content and Videotext carries out video topic deduction.

In step 103, it is based on the first set of words and the second set of words, video is obtained and belongs in topic set each The probability of topic.

In some embodiments, after the extraction for completing effective visible sensation word and effective text word, it is believed that video frame With text actually on a semantic level, (LDA, Latent next are distributed using implicit Di Li Cray Dirichlet Allocation) determine topic.Referring to the principle signal that Fig. 6, Fig. 6 are the LDA model in the embodiment of the present invention Figure.First by K topic of LDA auto-building model, distributions all in this way is based on K topic expansion, and video number is The number of D, word are N, and the mode of the generation of word is as follows in LDA model: Cong Dili Cray is distributed sampling in α and generates video Topic is distributed θ_d, sampling generates the topic distribution Z of the word of video from the multinomial distribution of topic_d,n, the distribution of Cong Dili Cray Raw newsy word is sampled in η is distributed β_k, sampling ultimately generates word W from the multinomial distribution of word_d,n.In LDA model Training process in, using gibbs sampler can be inferred that the topic in LDA model distribution and word be distributed.Based on the first word Effective text word in effective visible sensation word and the second set of words in language set, effective visible sensation word and effective text word here Respective frequency of occurrence is carried, is distributed by the topic distribution of LDA model obtained in training process and word, utilizes to obtain band The method of iteration spurious count infers the probability for belonging to each topic in topic set of the video in step 101, topic probability Calculating process is opposite with the generating process of word.

In some embodiments, the topic in topic set is the effective word of view-based access control model and the effective word of text, utilizes LDA mould What type generated.It include visual word, text word and word in each topic during generating K topic using LDA model To the degree of correlation of topic, word is that above-mentioned word is distributed to the degree of correlation of topic here.The topic initially automatically generated It is not necessarily effective topic, the topic of generation may be wide in range topic, such as " star ", " teenager " etc. are beyond expression out really The topic of meaning is cut, thus these wide in range topics are not belonging to effective topic.Word distribution based on topic obtained above, can be with Determine that word is corresponded to the degree of correlation of topic when degree of correlation of the word to topic is greater than degree of correlation threshold value by word Topic be labeled as topic undetermined.Invalid topic database is previously provided in server, server is based on invalid topic data Library judges whether topic undetermined belongs to effective topic, when determining that topic undetermined is not belonging to effective topic, by the topic undetermined Removed from topic set, thus obtain include K effectively topic topic set.

At step 104, when probability is more than topic probability threshold value, if determining that the corresponding topic of probability is video ownership Topic.

In some embodiments, multiple probability values can be obtained in step 103, when probability is more than topic probability threshold value, The corresponding topic of probability is determined as to the topic of video ownership, so that the topic for realizing video is inferred.

After step 104, step 108-109 can also be performed.

In step 108, in response to the register of client, the historical viewings topic for corresponding to login account is obtained.

In some embodiments, after the topic that video ownership has been determined, the topic that can be belonged to based on video carries out video Recommend.Referring to Fig. 9, Fig. 9 is the structural schematic diagram of recommender system provided in an embodiment of the present invention.Recommender system can be divided into substantially Data Layer, recall floor, sequence layer.Data Layer includes data generation and data storage, mainly utilizes various data processing tools Original log is cleaned, the data of formatting are processed into, is landed into different types of storage system, for the algorithm in downstream It is used with model.In response to the register of client, the historical viewings topic for corresponding to login account, history here are obtained Topic is browsed as topic belonging to the historical viewings video corresponding to login account.

In step 109, when the corresponding topic of video is matched with historical viewings topic, video is determined as view to be recommended Frequently.

In some embodiments, the recall floor of recommender system be mainly from the historical behavior of user, real-time behavior angularly The Candidate Set recommended is generated using various trigger policy, Candidate Set here is the set of video to be recommended, when regarding in step 101 Frequently when corresponding topic is matched with historical viewings topic, video is determined as video to be recommended and belongs to Candidate Set.Recall floor is also Candidate Set can be carried out merging and being filtered according to product rule, general fusion and filtered Candidate Set still compare more , on primary line it is requested come after inline system so much Candidate Set can not be ranked up, so general in recall floor Thick sequence is also had, primary thick sequence is carried out to the Candidate Set of fusion, filters out thick row's lower Candidate Set of score.Sort layer master If the model using machine learning carries out smart sequence to the Candidate Set that recall floor screens.

Step 109 can be realized by step 1091 to step 1092, will be illustrated in conjunction with each step.

In step 1091, be based on the corresponding topic of video and historical viewings topic, determine the corresponding topic of video with The distance between historical viewings topic.

In step 1092, when distance is less than distance threshold, the corresponding topic of video and historical viewings topic are determined Match, and video is determined as video to be recommended.

In some embodiments, pass through using historical viewings topic as basic word using the corresponding topic of video as neologisms Smallest edit distance algorithm obtains source string and target string similarity.Distance in step 1081 be edit away from The quantity of minimum edit operation required for target string is changed to from, in particular to as source string.When video is corresponding When the distance between topic and historical viewings topic are less than distance threshold, the corresponding topic of video and historical viewings topic are determined Match, and video is determined as video to be recommended.

In some embodiments, when the data volume of the videotext of acquisition is less than data-quantity threshold, then video text is characterized This institute can express semantic not abundant enough, be not enough to be individually used for determine the topic of video ownership.View in the embodiment of the present invention The topic processing method of frequency can also be based only upon the first set of words to realize, specifically comprise the following steps.

In step 201, video frame is extracted from video form sets of video frames.

In step 202, the first semantic set of words for characterizing the video frame is obtained.

In step 203, it is based on the first set of words, obtains the probability that video belongs to each topic in topic set.

In step 204, when probability is more than topic probability threshold value, if determining that the corresponding topic of probability is video ownership Topic.

In some embodiments, step can be passed through by extracting video frame formation sets of video frames in step 202 from video 202A or 202B is realized.

In step 202A, video is carried out every one section of sampling time to take out frame processing, to form sets of video frames.

In step 202B, the crucial plot position of video is obtained, extracts the corresponding video frame in crucial plot position with shape At sets of video frames.

In some embodiments, the first semantic word for characterizing video frame in sets of video frames is obtained in step 202 Language set is realized especially by step 2021-2023.

In step 2021, the effective visible sensation word of each video frame in sets of video frames is obtained.

In step 2022, effective visible sensation word is obtained in the frequency of occurrence of the video frame of sets of video frames.

In step 2023, when effective visible sensation word is greater than effective visible sensation word in the frequency of occurrence of the video frame of sets of video frames When threshold value, the effective visible sensation word combination that frequency of occurrence is greater than the effective visible sensation word threshold value is formed into the first set of words.

In some embodiments, the effective visible sensation word that each video frame in sets of video frames is obtained in step 2021 specifically leads to Cross step 20211-20212 realization.

In step 20211, by neural network model from each video frame extraction characteristics of image, by extracted image Feature connects and is converted to the vision Word probability of corresponding multiple visual words.

In step 20212, when vision Word probability is greater than visual word probability threshold value, the corresponding view of vision Word probability is determined Feel word is effective visible sensation word.

Continue with the software module that is embodied as of the topic processing unit 255 that illustrates video provided in an embodiment of the present invention Exemplary structure, in some embodiments, as shown in Fig. 2, being stored in the topic processing unit 255 of the video of memory 250 Software module may include:

Video extraction 2551 forms sets of video frames for extracting video frame from video, and it is corresponding to obtain video Videotext.

Word extraction module 2552, for obtaining the first semantic set of words for characterizing the video frame, and from The second set of words is extracted in videotext.

Topic probability evaluation entity 2553 obtains video and belongs to for being based on the first set of words and the second set of words The probability of each topic in topic set.

Topic determining module 2554, for determining the corresponding topic of probability for view when probability is more than topic probability threshold value The topic of frequency ownership.

In some embodiments, video extraction 2551 is also used to:

Video is carried out every one section of sampling time to take out frame processing, to form sets of video frames；Or

The crucial plot position of video is obtained, extracts the corresponding video frame in crucial plot position to form sets of video frames.

In some embodiments, word extraction module 2552 is also used to:

Obtain the effective visible sensation word of each video frame in sets of video frames；

Effective visible sensation word is obtained in the frequency of occurrence of all video frames of sets of video frames；

When effective visible sensation word all video frames of sets of video frames frequency of occurrence be greater than effective visible sensation word threshold value when, will Frequency of occurrence is greater than the effective visible sensation word combination of the effective visible sensation word threshold value, to form the first set of words.

In some embodiments, word extraction module 2552 is also used to:

By neural network model from each video frame extraction characteristics of image, extracted characteristics of image is connected and converted For the vision Word probability of the multiple visual words of correspondence；

When vision Word probability is greater than visual word probability threshold value, determine that the corresponding visual word of vision Word probability is effective visible sensation Word.

In some embodiments, word extraction module 2552 is also used to:

Word segmentation processing and mark part of speech processing are carried out to videotext respectively by condition random field, to obtain effectively text This set of words；

When reverse document-frequency is greater than reverse document-frequency threshold value, the reverse document-frequency is greater than the reverse text Effective textual phrase of part frequency threshold is closed, to form the second set of words.

In some embodiments, word extraction module 2552 is also used to:

In some embodiments, the video extraction 2551 is also used to:

Obtain the corresponding videotext of the video；

The word extraction module 2552 is also used to: when the data volume of the videotext of acquisition is enough to be individually used for determining institute When stating the topic of video ownership, the second set of words is extracted from the videotext；

The topic probability evaluation entity 2553 is also used to: being based on second set of words, is obtained the video and belong to The probability of each topic in topic set；

The topic determining module 2554 is also used to: when the probability is more than topic probability threshold value, determining the probability Corresponding topic is the topic of video ownership.

In some embodiments, the topic processing unit of video further include: historical topic obtains module 2555 and recommends view Frequency determining module 2556.

Historical topic obtains module 2555, for the register in response to client, obtains and corresponds to login account Historical viewings topic；

Recommend video determining module 2556, is used for when the corresponding topic of video is matched with historical viewings topic, by video It is determined as video to be recommended.

In some embodiments, video determining module 2556 is recommended to be also used to:

Based on the corresponding topic of video and historical viewings topic, determine the corresponding topic of video and historical viewings topic it Between distance；

When distance is less than distance threshold, determine that the corresponding topic of video is matched with historical viewings topic, and video is true It is set to video to be recommended.

In the following, will illustrate exemplary application of the embodiment of the present invention in an actual application scenarios.

It is the practical application flow chart of the topic processing method of the video in the embodiment of the present invention referring to Fig. 7, Fig. 7.

In some embodiments, the main flow of the topic processing method based on video content is as follows, first acquisition video, Multi-modal information extraction is carried out to video, mode can be two class of video and text.After obtaining multi-modal information, carry out respectively Video frame visual word extracts and the effective word of video title extracts, and visual word and effective word here has respectively corresponded effective visible sensation word With effective text word.The extraction that videotext and video frame are carried out to the video received, the text of video is segmented It handles and stop words is gone to handle, check corresponding effective text set of words, if word is effective text word, corresponding statistics time Number increases by one, and effective text set of words mostlys come from data on line, after carrying out participle and removing stop words, counts all words IDF characteristic, remain the more forward text word of IDF value be effective text set of words.After getting the video frame of video, make The vision Word probability of every video frame is obtained with ResNet-101 convolutional neural networks model, when vision Word probability is greater than setting threshold When value, effective visible sensation word corresponds to statistics number and increases by one, after the frequency of occurrence for count visual word of all video frames, retains and unites The effective visible sensation word that frequency of occurrence is more than effective visible sensation word threshold value is counted, determines that the effective visible sensation word occurred in the video, this In effective visible sensation word threshold value be configured according to the frame number of video frame, when the number of video frame is larger, effective visible sensation word threshold value Higher, when the number of video frame is smaller, effective visible sensation word threshold value is lower, with prevent single visual word frequency of occurrence excessively to Excessively leading topic.The effective visible sensation word is determined as to characterize the word of video content semantic information, and is gone out occurrence Number smoothly make the frequency of occurrence of itself and effective text word in same value interval, wherein can going out effective visible sensation word Occurrence number is uniformly determined as 1.The visual word of ResNet-101 convolutional neural networks model mostlys come from ML-Images project Classification system has carried out corresponding modification to classification system according to video data on line, has retained the object type for being suitable as visual word Not.After the extraction for completing video frame effective visible sensation word and effective text word, it is believed that video frame and text are actually Through on a semantic level.The deduction of video topic is finally carried out, according to the frequency of occurrence of the effective word of the vision got and is had The topic of the frequency of occurrence and generation of imitating text word carries out similarity calculation, using the maximum topic of similarity as current video Topic.LDA model also obtains the topic distribution of visual word and effective word while generating topic, if recycling LDA model Topic distribution is inferred using the method for iteration spurious count.The generation of topic mainly uses LDA model extraction to go out several topics, The effective visible sensation word of video and effective text word obtained comprising before in each topic there are also the degree of correlation of word and topic, Here degree of correlation is word distribution, and similarity calculation is actually to use iteration spurious count using the LDA model generated Method in video request effective visible sensation word, effective text word carry out topic deduction, then obtain video request belong to it is each The topic that probability value is more than topic probability threshold value is determined as the ownership topic of video by the probability value of topic.In whole flow process knot Shu Hou carries out new video request process.

The topic processing method of video based on the embodiment of the present invention can carry out video recommendations.Referring to Fig. 8, Fig. 8 It is applied to the flow chart of video recommendations for the topic processing method of the video of the embodiment of the present invention.

After the topic processing method by the video of foregoing invention embodiment infers topic, in conjunction with acquisition The historical viewings topic of user carries out recommendation process to video using recommender system to obtain recommending results for video.By being based on The topic that the topic processing method of video obtains infers topic belonging to video on line, pushes away so as to more accurate for user Recommend video.Referring to Fig. 9, Fig. 9 is the structural schematic diagram of recommender system provided in an embodiment of the present invention.Recommender system can divide substantially For data Layer, recall floor, sequence layer.Data Layer includes data generation and data storage, mainly utilizes various data processing works Tool cleans original log, is processed into the data of formatting, lands into different types of storage system, for the calculation in downstream Method and model use.In response to the register of client, the historical viewings topic for corresponding to login account is obtained, going through here History browses topic as topic belonging to the historical viewings video corresponding to login account.When the corresponding topic of video and history are clear When topic of looking at matches, video is determined as video to be recommended.The recall floor of recommender system is mainly historical behavior from user, reality Shi Hangwei angularly generates the Candidate Set recommended using various trigger policy, and Candidate Set here is the set of video to be recommended, When the corresponding topic of video is matched with historical viewings topic, video is determined as video to be recommended and belongs to Candidate Set.It recalls Layer can also carry out Candidate Set merging and being filtered according to product rule, and general fusion and filtered Candidate Set still compare It is more, on primary line it is requested come after inline system so much Candidate Set can not be ranked up, so in recall floor Thick sequence is had, primary thick sequence is carried out to the Candidate Set of fusion, filters out thick row's lower Candidate Set of score.The layer that sorts is main It is to carry out thick sequence and essence to the Candidate Set that recall floor screens using the model of machine learning to sort.

The embodiment of the present invention provides a kind of storage medium for being stored with executable instruction, wherein it is stored with executable instruction, When executable instruction is executed by processor, the topic processing side that processor will be caused to execute video provided in an embodiment of the present invention Method, for example, the method as shown in Fig. 3 A-3D.

In some embodiments, storage medium can be FRAM, ROM, PROM, EPROM, EEPROM, flash memory, magnetic surface and deposit The memories such as reservoir, CD or CD-ROM；Be also possible to include one of above-mentioned memory or any combination various equipment.

In some embodiments, executable instruction can use program, software, software module, the form of script or code, By any form of programming language (including compiling or interpretative code, or declaratively or process programming language) write, and its It can be disposed by arbitrary form, including be deployed as independent program or be deployed as module, component, subroutine or be suitble to Calculate other units used in environment.

As an example, executable instruction can with but not necessarily correspond to the file in file system, can be stored in A part of the file of other programs or data is saved, for example, being stored in hypertext markup language (HTML, Hyper Text Markup Language) in one or more scripts in document, it is stored in the single file for being exclusively used in discussed program In, alternatively, being stored in multiple coordinated files (for example, the file for storing one or more modules, subprogram or code section).

As an example, executable instruction can be deployed as executing in a calculating equipment, or it is being located at one place Multiple calculating equipment on execute, or, be distributed in multiple places and by multiple calculating equipment of interconnection of telecommunication network Upper execution.

The embodiment of the present invention not only allows for the information of text, it also is contemplated that the visual word in video, and can be Sample belonging to video is provided in the case that text information is most deficient.It is effective using video frame mode visual word and text modality Topic distribution is generated on word, then the topic of video new in outlet is inferred with topic distribution.By will be based on video content Topic is inferred and the topic estimating method based on video title combines the topic generated with more representative and robustness.

In conclusion through the embodiment of the present invention it is contemplated that the information of video frame mode and text modality, so that Video topic can be determined by visual word when text modality is performed poor, enhances the accuracy that video topic is inferred, so that Obtaining backstage can recommend meet the video that user draws a portrait to user.

The above, only the embodiment of the present invention, are not intended to limit the scope of the present invention.It is all in this hair Made any modifications, equivalent replacements, and improvements etc. within bright spirit and scope, be all contained in protection scope of the present invention it It is interior.

Claims

1. a kind of topic processing method of video, which is characterized in that the described method includes:

Obtain the first semantic set of words for characterizing the video frame, and from the second word of video text extraction collection It closes；

Based on first set of words and second set of words, obtains the video and belong to each topic in topic set Probability；

When the probability is more than topic probability threshold value, determine that the corresponding topic of the probability is the topic of video ownership.

2. the method according to claim 1, wherein the video frame of extracting from video forms set of video It closes, comprising:

The crucial plot position of the video is obtained, extracts the corresponding video frame in the crucial plot position to form the video Frame set.

3. the method according to claim 1, wherein semantic the obtained for characterizing the video frame One set of words, comprising:

When the frequency of occurrence is greater than effective visible sensation word threshold value, the frequency of occurrence is greater than the effective visible sensation word threshold value Effective visible sensation word combination, to form first set of words.

4. according to the method described in claim 3, it is characterized in that, described obtain each video frame in the sets of video frames Effective visible sensation word, comprising:

When the vision Word probability is greater than visual word probability threshold value, determine that the corresponding visual word of the vision Word probability is described Effective visible sensation word.

5. the method according to claim 1, wherein described from second set of words of video text extraction, Include:

Word segmentation processing and mark part of speech processing are carried out to the videotext respectively by condition random field, to obtain effectively text This set of words；

When the reverse document-frequency is greater than reverse document-frequency threshold value, the reverse document-frequency is greater than the reverse text Effective textual phrase of part frequency threshold is closed, to form second set of words.

6. according to the method described in claim 3, it is characterized in that, the method also includes:

By the frequency of occurrence of the effective visible sensation word in first set of words, it is mapped to the frequency of occurrence phase with effective text word Same value interval, to update the frequency of occurrence of the effective visible sensation word；

7. the method according to claim 1, wherein the video frame of extracting from video forms set of video It closes, and obtains the corresponding videotext of the video, comprising:

Obtain the corresponding videotext of the video；

When the data volume of the videotext of acquisition is not enough to be individually used for determine the topic of the video ownership, from the video Middle extraction video frame forms the sets of video frames；

The method also includes:

When the data volume of the videotext of acquisition is enough to be individually used for determining the topic of the video ownership, from the video text The second set of words is extracted in this；

Based on second set of words, the probability that the video belongs to each topic in topic set is obtained；

8. the method according to claim 1, wherein the method also includes:

In response to the register of client, the historical viewings topic for corresponding to login account is obtained；

When the corresponding topic of the video is matched with the historical viewings topic, the video is determined as video to be recommended.

9. according to the method described in claim 8, it is characterized in that, described when the corresponding topic of the video and the history are clear When topic of looking at matches, the video is determined as video to be recommended, comprising:

Based on the corresponding topic of the video and the historical viewings topic, determine that the corresponding topic of the video is gone through with described History browses the distance between topic；

When the distance is less than distance threshold, determine that the corresponding topic of the video is matched with the historical viewings topic, and The video is determined as video to be recommended.

10. a kind of topic processing unit of video, which is characterized in that described device includes:

Video extraction forms sets of video frames for extracting video frame from video, and obtains the corresponding view of the video Frequency text；

Word extraction module, for obtaining the first semantic set of words for characterizing the video frame, and from the video The second set of words of Text Feature Extraction；

Topic probability evaluation entity obtains the video for being based on first set of words and second set of words Belong to the probability of each topic in topic set；

Topic determining module, for determining the corresponding topic of the probability for institute when the probability is more than topic probability threshold value State the topic of video ownership.