CN109325148A

CN109325148A - The method and apparatus for generating information

Info

Publication number: CN109325148A
Application number: CN201810878632.6A
Authority: CN
Inventors: 李甫; 文石磊; 孙昊
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Baidu Online Network Technology Beijing Co Ltd; Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2018-08-03
Filing date: 2018-08-03
Publication date: 2019-02-12

Abstract

The embodiment of the present application discloses the method and apparatus for generating information.One specific embodiment of the method for the generation information includes: to obtain video to be identified；Understand that technology understands the content of video to be identified using video, obtains video content label；The text that video to be identified is analyzed using text data analytical technology obtains videotext label；Based on video content label and videotext label, the semantic label of video to be identified is determined.The embodiment can be understood by video content and videotext is analyzed, it is automatic to extract video content label and videotext label, and the semantic label of video to be identified is determined based on video content label and videotext label, effectively improve the accuracy and integrality of the semantic label of video to be identified.

Description

The method and apparatus for generating information

Technical field

This application involves field of computer technology, and in particular to technical field of the computer network, more particularly to generate information Method and apparatus.

Background technique

Current short-sighted frequency has become the important channel that people obtain information, short number of videos sharp increase, source video sequence Diversification and UGC (User Generated Content, user's original content) video accounting are significantly increased, how to help User is quickly found out interested video, becomes urgent problem to be solved.Current video recommendations and video retrieval technology are main Dependent on the text information of video, since short video text message is on the low side, how to be based on video content and constructs universal description information, There are no mature solutions.

Existing video tab extraction scheme mainly utilizes the title data of video, utilizes NLP's (natural language processing) The technologies such as participle, part of speech analysis, entity analysis extract the entity candidate word in title.Later, the people for being directed to some scene is utilized After the knowledge base (usually multistage tag database) that work is established is filtered candidate word, video tab and label are obtained Upper layer classification information.

Summary of the invention

The embodiment of the present application provides the method and apparatus for generating information.

In a first aspect, the embodiment of the present application provides a kind of method for generating information, including obtain video to be identified；Using Video understands that technology understands the content of video to be identified, obtains video content label；Using text data analytical technology analysis to The text for identifying video, obtains videotext label；Based on video content label and videotext label, video to be identified is determined Semantic label.

In some embodiments, understand that technology understands video content using video, it includes following for obtaining video content label At least one of: by video input video classification model to be identified, obtain class label；The video frame of video to be identified is detected frame by frame Interior face, the face that will test are matched with the face sample in face database, and the face for obtaining and detecting matches Face sample personage's name label and the people information label that is associated with the people's name；Using motion detection mould trained in advance Type identifies the movement in the video frame of video to be identified frame by frame, obtains action message, merges the action message of each frame, is moved Make label；Using identification disaggregated model trained in advance, scene in the video frame of video to be identified and entity are identified frame by frame simultaneously Fusion recognition is as a result, obtain the scene tag and entity tag in video frame.

In some embodiments, by video input video classification model to be identified, obtaining class label includes: uniform extraction The video frame of video to be identified obtains sequence of frames of video to be identified；Using image classification network handles identify sequence of frames of video into Row feature extraction obtains the characteristics of image sequence of video to be identified；Extract the audio signal of video to be identified；By video to be identified Audio signal input Classification of Speech convolutional neural networks, feature extraction is carried out to voice per second, obtains video to be identified Phonetic feature sequence；Based on characteristics of image sequence and phonetic feature sequence, determine that video to be identified corresponds to the general of each label Rate value；The label that probability value is greater than threshold value is determined as to the class label of video to be identified.

In some embodiments, it is based on characteristics of image sequence and phonetic feature sequence, it is each to determine that video to be identified corresponds to The probability value of label includes: that the double-current shot and long term of characteristics of image sequence and the training in advance of phonetic feature sequence inputting is remembered net Network obtains the probability value that video to be identified corresponds to each label.

In some embodiments, feature of the image classification network based on the video frame modeled using timing segmented network and The corresponding label training of video sample obtains；And/or the convolutional neural networks of Classification of Speech are determined based on following steps: being extracted and regarded Meier scale filter group feature in the audio signal of frequency sample；Based on Meier scale filter group feature and audio signal pair The label answered, the convolutional neural networks of training Classification of Speech.

In some embodiments, the text of video to be identified includes at least one of the following: the title text of video to be identified； Text in the obtained video frame of video to be identified is detected using video OCR.

In some embodiments, the text that video to be identified is analyzed using text data analytical technology, obtains videotext Label includes: the text based on video to be identified, and the candidate entity tag of video to be identified is extracted from multistage tag database； Based on the part of speech and different degree of NLP technology analysis entities label, screening obtains videotext label.

In some embodiments, it is based on video content label and videotext label, determines the semantic mark of video to be identified Label include: to determine classification belonging to the label in video content label, in video based on the multistage tag database pre-established Hold the relationship in the label and multistage tag database in label between other labels；Using natural language processing technique, analysis The label of classification belonging to the label in label, video content label in video content label and based on determined by relationship The part of speech and different degree of label；Based on part of speech and different degree, the label in video content label and videotext label is carried out Sequence and screening, obtain the semantic label of video to be identified.

In some embodiments, method further include: videotext label is based on, to user's pushing video.

Second aspect, the embodiment of the present application provide a kind of device for generating information, comprising: video acquisition unit is matched It is set to and obtains video to be identified；Video understands unit, is configured to understand using video understanding technology the content of video to be identified, Obtain video content label；Video analysis unit is configured to analyze the text of video to be identified using text data analytical technology This, obtains videotext label；Tag determination unit is configured to determine based on video content label and videotext label The semantic label of video to be identified.

In some embodiments, video understands that unit includes at least one of the following: visual classification subelement, be configured to by Video input video classification model to be identified, obtains class label；Face datection subelement is configured to detect frame by frame to be identified Face in the video frame of video, the face that will test are matched with the face sample in face database, obtain and detect The face sample that matches of face personage name label and the people information label that is associated with the people's name；Action recognition is single Member is configured to be identified the movement in the video frame of video to be identified frame by frame using motion detection model trained in advance, obtained Action message merges the action message of each frame, obtains movement label；Scene and Entity recognition subelement are configured to using pre- First trained identification disaggregated model, identify scene in the video frame of video to be identified and entity frame by frame and fusion recognition as a result, Obtain the scene tag and entity tag in video frame.

In some embodiments, visual classification subelement include: video frame extract subelement, be configured to uniformly extract to The video frame for identifying video, obtains sequence of frames of video to be identified；Image characteristics extraction subelement is configured to using image classification Network handles identify that sequence of frames of video carries out feature extraction, obtain the characteristics of image sequence of video to be identified；Audio signal extracts Subelement is configured to extract the audio signal of video to be identified；Speech feature extraction subelement is configured to view to be identified The convolutional neural networks of the audio signal input Classification of Speech of frequency, carry out feature extraction to voice per second, obtain view to be identified The phonetic feature sequence of frequency；Probability value determines subelement, is configured to determine based on characteristics of image sequence and phonetic feature sequence Video to be identified corresponds to the probability value of each label；Class label determines subelement, is configured to probability value being greater than threshold value Label is determined as the class label of video to be identified.

In some embodiments, probability value determines that subelement is further configured to: characteristics of image sequence and voice is special The double-current shot and long term memory network for levying sequence inputting training in advance, obtains the probability value that video to be identified corresponds to each label.

In some embodiments, the image classification network in image characteristics extraction subelement is based on using timing segmented network The corresponding label training of the feature and video sample of the video frame modeled obtains；And/or in image characteristics extraction subelement The convolutional neural networks of Classification of Speech are determined based on following steps: the Meier scale extracted in the audio signal of video sample filters Device group feature；Based on Meier scale filter group feature and the corresponding label of audio signal, the convolutional Neural of training Classification of Speech Network.

In some embodiments, the text of the video to be identified in video analysis unit is included at least one of the following: wait know The title text of other video；Text in the obtained video frame of video to be identified is detected using video OCR.

In some embodiments, video analysis unit is further configured to: the text based on video to be identified, from multistage The candidate entity tag of video to be identified is extracted in tag database；Part of speech based on NLP technology analysis entities label and important Degree, screening obtain videotext label.

In some embodiments, tag determination unit includes: that label relationship determines subelement, is configured to be based on to build in advance Vertical multistage tag database, determine classification belonging to the label in video content label, the label in video content label with Relationship in multistage tag database between other labels；Part of speech different degree determines subelement, is configured to using natural language Processing technique analyzes the label and base of the label in video content label, classification belonging to the label in video content label The part of speech and different degree of the label determined by relationship；Tag sorting screens subelement, is configured to based on part of speech and different degree, Label in video content label and videotext label is ranked up and is screened, the semantic label of video to be identified is obtained.

In some embodiments, device further include: video push unit is configured to based on videotext label, Xiang Yong Family pushing video.

The third aspect, the embodiment of the present application provide a kind of equipment, comprising: one or more processors；Storage device is used In the one or more programs of storage；When one or more programs are executed by one or more processors, so that at one or more It manages device and realizes as above any method.

Fourth aspect, the embodiment of the present application provide a kind of computer-readable medium, are stored thereon with computer program, should As above any method is realized when program is executed by processor.

The method and apparatus provided by the embodiments of the present application for generating information, firstly, obtaining video to be identified；Later, it uses Video understands that technology understands the content of video to be identified, obtains video content label；Later, using text data analytical technology point The associated text for analysing video to be identified obtains video semanteme label；Finally, it is based on video content label and video semanteme label, Determine video tab to be identified.In this course, it can be understood by video content and videotext is analyzed, it is automatic to extract view Frequency content tab and videotext label, and determine based on video content label and videotext label the semanteme of video to be identified Label effectively improves the accuracy and integrality of the semantic label of video to be identified.

Detailed description of the invention

Non-limiting embodiment is described in detail referring to made by the following drawings by reading, other features, Objects and advantages will become more apparent upon:

Fig. 1 is that this application can be applied to exemplary system architecture figures therein；

Fig. 2 is the flow diagram according to one embodiment of the method for the generation information of the application；

Fig. 3 is an application scenarios schematic diagram according to the method for the generation information of the embodiment of the present application；

Fig. 4 a is one of the method for the class label of determining video to be identified in the method according to the generation information of the application The flow diagram of a embodiment；

Fig. 4 b is the exemplary block diagram of one embodiment of the double-current shot and long term memory network in Fig. 4 a；

Fig. 5 is the structural schematic diagram of one embodiment of the device of the generation information of the application；

Fig. 6 is adapted for the structural schematic diagram for the computer system for realizing the server of the embodiment of the present application.

Specific embodiment

The application is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched The specific embodiment stated is used only for explaining related invention, rather than the restriction to the invention.It also should be noted that in order to Convenient for description, part relevant to related invention is illustrated only in attached drawing.

It should be noted that in the absence of conflict, the features in the embodiments and the embodiments of the present application can phase Mutually combination.The application is described in detail below with reference to the accompanying drawings and in conjunction with the embodiments.

As shown in Figure 1, system architecture 100 may include terminal device 101,102,103, network 104 and server 105, 106.Network 104 between terminal device 101,102,103 and server 105,106 to provide the medium of communication link.Net Network 104 may include various connection types, such as wired, wireless communication link or fiber optic cables etc..

User 110 can be used terminal device 101,102,103 and be interacted by network 104 with server 105,106, to connect Receive or send message etc..Various telecommunication customer end applications, such as search engine can be installed on terminal device 101,102,103 Class application, shopping class application, instant messaging tools, mailbox client, social platform software, video playback class application etc..

Terminal device 101,102,103 can be the various electronic equipments with display screen, including but not limited to intelligent hand Machine, tablet computer, E-book reader, MP3 player (Moving Picture Experts Group Audio Layer III, dynamic image expert's compression standard audio level 3), MP4 (Moving Picture Experts Group Audio Layer IV, dynamic image expert's compression standard audio level 4) player, pocket computer on knee and desktop computer etc. Deng.

Server 105,106 can be to provide the server of various services, such as provide terminal device 101,102,103 The background server of support.The data that background server can submit terminal such as be analyzed, stored or be calculated at processing, and the general Analysis, storage or calculated result are pushed to terminal device.

It should be noted that generating the method for information in practice, provided by the embodiment of the present application generally by server 105, it 106 executes, correspondingly, the device for generating information is generally positioned in server 105,106.However, when terminal device Performance can satisfy this method execution condition or the equipment setting condition when, information is generated provided by the embodiment of the present application Method can also be executed by terminal device 101,102,103, generate information device also can be set in terminal device 101, 102, in 103.

It should be understood that the number of terminal vehicle, network and server in Fig. 1 is only schematical.According to realization need It wants, can have any number of terminal vehicle, network and server.

With continued reference to Fig. 2, the process 200 of one embodiment of the method for the generation information according to the application is shown.It should The method for generating information, comprising the following steps:

Step 201, video to be identified is obtained.

In the present embodiment, electronic equipment (such as the service shown in FIG. 1 of the method operation of above-mentioned generation information thereon Device or terminal) video to be identified can be obtained from video library or other terminals.

Step 202, understand that technology understands the content of video to be identified using video, obtain video content label.

In the present embodiment, to video to be identified, the content of video to be identified, example can be understood from multiple concern dimensions Such as, it personage's name and people information, action message, scene information and entity information etc. can be closed from visual classification label, video Dimension is infused, to understand the content of video to be identified.Here video understands that technology can be taken for different concern dimensions It is applicable in the video understanding technology of the angle, to understand the content of video to be identified.For example, using the identification model of artificial intelligence, To identify focus corresponding to each concern dimension.

In some optional implementations of the present embodiment, understands that technology understands video content using video, obtain video Content tab may include at least one of following: by video input video classification model to be identified, obtain class label；It examines frame by frame The face in video frame is surveyed, the face that will test is matched with the face sample in face database, the people for obtaining and detecting The personage's name label for the face sample that face matches and the people information label that is associated with the people's name；Using movement trained in advance Detection model identifies the movement in the video frame of video to be identified frame by frame, obtains action message, merges the action message of each frame, Obtain movement label；Using identification disaggregated model trained in advance, identify frame by frame scene in the video frame of video to be identified and Entity and fusion recognition are as a result, obtain the scene tag and entity tag in video frame.

In this implementation, video classification model is taken to detect video to be identified, available video tab；It takes pair Video carries out Face datection, then the face that will test out carries out to determine the people in video with the sample matches for having label Name label and the people information label for being associated with the people's name；Motion detection model is taken to identify the action message in video frame simultaneously Each frame is merged as a result, obtaining movement label；Identification disaggregated model is taken to identify the scene in video frame and entity and fusion recognition As a result, obtaining the scene tag and entity tag in video frame.By these identification and detection, can be obtained from each dimension to The video content label of video is identified, so as to improve the comprehensive and accuracy of video content label.

Here video classification model can take the method training of machine learning to obtain.Video classification model is trained Afterwards with the machine learning model of visual classification ability, visual classification result is obtained for the video according to input.Machine learning Model can be using neural network model, support vector machines or Logic Regression Models etc..Neural network model such as convolution mind Through network, reverse transmittance nerve network, Feedback Neural Network, radial base neural net or self organizing neural network etc..It is instructing When practicing video classification model, the fine granularity video general label system of a classification can be constructed in advance, and covering is most of common Videotext label and usually use video type.

It, can be using the Face datection side in the prior art or the technology of future development when detecting the face in video frame Method realizes that the application is not construed as limiting this.For example, active shape model (ASM, Active Shape can be used Models), active appearance models (AAM, Active Appearance Models), cascade posture regression model (CPR, Cascaded pose regression), depth convolutional neural networks (DCNN, Deep Convolutional Network) etc. Method for detecting human face realizes the detection to face.

Here motion detection model, can using the motion detection method in the prior art or the technology of future development come It realizes, the application is not construed as limiting this.For example, can using the recognition methods based on single frames, recognition methods based on CNN etc. come Realize motion detection.After the movement for detecting each frame, the recognition result of each frame can be merged, with obtain it is objective, Action message in comprehensive video.

Here identification disaggregated model can also take the method training of machine learning to obtain.Identification disaggregated model is instruction There is the machine learning model of identification classification capacity after white silk, obtain point of scene and entity in video for the video according to input Class result.Machine learning model can be using neural network model, support vector machines or Logic Regression Models etc..Neural network Model such as convolutional neural networks, reverse transmittance nerve network, Feedback Neural Network, radial base neural net or self-organizing mind Through network etc..

When carrying out scene and Entity recognition, a series of scenes and entity fine granularity label system can be constructed in advance, such as Vehicle, animal etc., for each vertical class training fine grit classification model.Scene and entity in Forecasting recognition video frame frame by frame, The scene extracted in each frame and entity are merged later, obtain main entity tag, scene tag in video.

Step 203, the text that video to be identified is analyzed using text data analytical technology obtains videotext label.

In the present embodiment, associated text here may include title text, additional information text (such as except title Except other brief introductions, description information), the text in video frame etc..For title text and additional information text, text data Analytical technology can using in the technology of the prior art or future development for analyzing the technology of text data, the application to this not It limits.For example, the title text and additional information text to video to be identified are segmented, are determined the part of speech of word and important Degree, obtains the information of title text and additional information text.For the text in video frame, skill is identified using videotext first Art detects text, further segments, determines part of speech and different degree of word etc., obtains detection text information.Finally, by title The information of text and additional information text, detection text information, obtain videotext label.

In some optional implementations of the present embodiment, the text of video to be identified can mainly include following at least one : the title text of video to be identified；Text in the obtained video frame of video to be identified is detected using video OCR.Pass through The text that video to be identified is arranged includes the title text of video to be identified, and detects video institute to be identified using video OCR The obtained text in video frame, it is possible to reduce the calculation amount of text data analysis.

In some optional implementations of the present embodiment, the mark of video to be identified is analyzed based on text data analytical technology Topic, obtaining title text may include: the text based on video to be identified, and video to be identified is extracted from multistage tag database Candidate entity tag；Based on the part of speech and different degree of NLP technology analysis entities label, screening obtains videotext label.

In this implementation, the candidate entity tag of corresponding associated text can be extracted from multistage tag database, And it is based further on NLP technology, the part of speech and different degree of analysis entities label, so that screening obtains videotext label.

Step 204, it is based on video content label and videotext label, determines the semantic label of video to be identified.

In the present embodiment, it from above-mentioned video content label and videotext label, is weighed based on predetermined label Label corresponding to weight, sequence and screening video content label and videotext label, obtains videotext label.In this way, can To obtain final view to be identified to above-mentioned video content label and videotext label is once reordered and postsearch screening The semantic label of frequency.

In some optional implementations of the present embodiment, be based on video content label and videotext label, determine to The semantic label for identifying video includes: to determine the label in video content label based on the multistage tag database pre-established Relationship in affiliated classification, the label in video content label and multistage tag database between other labels；Using nature Language processing techniques (NLP) analyze the mark of the label in video content label, classification belonging to the label in video content label The part of speech and different degree of label and the label based on determined by relationship；Based on part of speech and different degree, to video content label and view Label in frequency text label is ranked up and screens, and obtains the semantic label of video to be identified.

In this implementation, by being established in video content label and videotext label and multistage tag database Label between relationship, to above-mentioned video content label and videotext label is once reordered and postsearch screening, can To obtain the semantic label of video to be identified.

In some optional implementations of the present embodiment, the above method further include: videotext label is based on, to user Pushing video.The accuracy of the video pushed to user can be improved in this way.

Below in conjunction with Fig. 3, the exemplary application scene of the method for the generation information of the application is described.

As shown in figure 3, Fig. 3 shows the schematic stream of an application scenarios of the method for the generation information according to the application Cheng Tu.

As shown in figure 3, the method 300 for generating information is run in electronic equipment 310, may include:

Firstly, obtaining video 301 to be identified；

Later, understand that technology 302 understands the content of video to be identified using video, obtain video content label 303；

Later, the text that video to be identified is analyzed using text data analytical technology 304 obtains videotext label 305；

Later, it is based on video content label 303 and videotext label 305, determines the semantic label of video to be identified 306。

It should be appreciated that the application scenarios of the method for generation information shown in above-mentioned Fig. 3, only for generating information The exemplary description of method does not represent the restriction to this method.For example, each step shown in above-mentioned Fig. 3, it can be into one Step uses the implementation method of more details.

The method of the generation information of the above embodiments of the present application, available video to be identified；Technology is understood using video The content for understanding video to be identified obtains video content label；The pass of video to be identified is analyzed using text data analytical technology Join text, obtains videotext label；Based on video content label and videotext label, the semantic mark of video to be identified is determined Label.In this course, it can be understood by video content and videotext is analyzed, it is automatic to extract video content label and video Text label, and determine based on video content label and videotext label the semantic label of video to be identified, effectively improve to Identify the accuracy and integrality of the semantic label of video.In the optional implementation in part, video to be identified can be based on Semantic label, to user recommend video, be able to solve new video cold start-up recommendation problem, realize personalized recommendation, promoted to The specific aim of user's push new video.

Referring to FIG. 4, it illustrates the classification marks for determining video to be identified in the method according to the generation information of the application The flow chart of one embodiment of the method for label.

As shown in figure 4, the process 400 of the method for the generation information of the present embodiment, may comprise steps of:

In step 401, the video frame for uniformly extracting video to be identified obtains sequence of frames of video to be identified.

In the present embodiment, by uniformly extracting video frame, it can be substantially reduced the data volume of video to be identified, thus plus Speed obtains the efficiency of final result.

In step 402, feature extraction is carried out using image classification network handles identification sequence of frames of video, obtained to be identified The characteristics of image sequence of video.

In the present embodiment, image classification network is the convolutional neural networks with image classification ability after training, is used for Image classification result is obtained according to the feature of each input picture.Convolutional neural networks can using AlexNet, VGG, GoogLeNet, Resnet etc. are used as core network architecture.

In a specific example, image classification network is based on using timing segmented network (Temporal Segment Networks is abbreviated as TSN) feature of video frame that is modeled and the corresponding label training of video sample obtain.

In this implementation, TSN network is made of two-way CNN, including time convolutional neural networks and spatial convoluted mind Through network.After extracting video clip in the video frame from video sample, each video clip includes a frame image, can be incited somebody to action Video clip sequence inputs the two-way CNN of TSN respectively, and each segment obtains segment characterizations, then each segment input segment is distributed Formula consistency network (segmental consesus), the feature of the video exported.Feature and video based on the output The corresponding label of sample, can be with training image sorter network.

In step 403, the audio signal of video to be identified is extracted.

In the present embodiment, video to be identified can be extracted using the method in the prior art for extracting video/audio Audio signal, the application are not construed as limiting this.For example, the audio file or use tool of available video turn video format It is changed to audio format, to obtain audio signal.

In step 404, by the convolutional neural networks of the audio signal input Classification of Speech of video to be identified, to per second Voice carries out feature extraction, obtains the phonetic feature sequence of video to be identified.

In the present embodiment, the convolutional neural networks of Classification of Speech are the convolution minds with Classification of Speech ability after training Through network, for obtaining audio classification result according to the feature of each input audio.Convolutional neural networks can use AlexNet, VGG, GoogLeNet, Resnet etc. are used as core network architecture.

In a specific example, the convolutional neural networks of Classification of Speech are determined based on following steps: extracting video sample Meier scale filter group feature in this audio signal；It is corresponding based on Meier scale filter group feature and audio signal Label, the convolutional neural networks of training Classification of Speech.

In this implementation, the extracted feature of the convolutional neural networks of Classification of Speech is the Meier mark in audio signal Filter group (Fbank) feature is spent, using the corresponding label of the audio signal of this feature and video sample, voice point can be trained The convolutional neural networks of class.

In step 405, the double-current shot and long term by the training in advance of characteristics of image sequence and phonetic feature sequence inputting is remembered Network obtains the probability value that video to be identified corresponds to each label.

In the present embodiment, the double-current shot and long term memory network of training can be special with input picture characteristic sequence and voice in advance Sequence is levied, later for characteristics of image sequence and phonetic feature sequence, considers the feature of different time research object respectively, again The extraction of characteristic sequence is carried out, and attention is respectively adopted and merges the feature after characteristics of image sequential extraction procedures to be formed more Long vector merges phonetic feature sequence to form longer vector, and merges again to the vector after two merging Longer vector is formed together, and " the distributed nature expression " acquired finally is mapped to by sample labeling sky using full articulamentum Between, finally determine that video to be identified corresponds to the probability value of each label using classifier.

In a specific example, the double-current shot and long term memory network of training can illustrate with reference to Fig. 4 b in advance.Such as Shown in Fig. 4 b, double-current shot and long term memory network may include two-way series model, attention model, full articulamentum and sigmoid Classifier, classifier, two-way series model divide the RGB image characteristic sequence and phonetic feature sequence that input video to be identified Not carry out Recursion process, and respectively the characteristics of image sequence after Recursion process is merged to be formed more using attention model Long vector merges phonetic feature sequence to form longer vector, and the vector after two are merged merges again Longer vector is formed together, and " the distributed nature expression " acquired finally is mapped to sample mark using two full articulamentums Remember space, to improve the accuracy of final classification result, it is each finally to determine that video to be identified corresponds to using sigmoid classifier The probability value of label.Since sigmoid classifier has relatively good anti-interference, it is built up with sigmoid unit group Artificial neural network also have good robustness.

Fig. 4 a is returned to, in a specific example, the double-current shot and long term memory network of training is via following steps in advance It determines: obtaining the video sample for having video tab；Uniformly extract the video frame of video sample；Using image classification network to institute The video frame of extraction carries out feature extraction, obtains the characteristics of image sequence of video sample；Extract the audio signal in video sample； By the convolutional neural networks of the audio signal input Classification of Speech in video sample, feature extraction is carried out to voice per second, is obtained To the phonetic feature sequence of video sample；Using the characteristics of image sequence of video sample, the phonetic feature sequence of video sample as Input, using the video tab of video sample as output, training double fluid shot and long term memory network.

It, can be by being input with characteristics of image sequence, phonetic feature sequence, with video sample in this implementation Video tab is output, training double fluid shot and long term memory network, to consider that the feature of different time research object is come respectively To output as a result, improving the accuracy of the classification results of double-current shot and long term memory network.

Above-mentioned video sample can directly acquire from information flow library and mark tag set, can also be to from information The tag set of mark obtained in stream library carries out further data cleansing, obtains for trained video sample.

In a specific example, video sample can be determined based on following steps: obtain institute in message stream data library There is the mark tag set of video；It is sorted from high to low according to the frequency of occurrences and has marked label；From the mark mark after sequence The label of preset quantity is extracted in label as candidate tag set；Candidate tag set is screened, filters out and meets filtering The word of rule；Candidate label in the filtered candidate tag set of vectorization calculates similar between candidate label two-by-two Degree；Merge two candidate labels that similarity is greater than predetermined threshold；Video in candidate label after judgement merging under each label Whether there is appearance consistency and Semantic Similarity, filters out the ambiguous label of tool, the label chosen；Based on what is chosen Label constructs video sample.

In this implementation, the label chosen can also constitute multistage number of tags according to the major class and subclassification of label According to library, to adjust the label finally used according to the size of the probability of subclassification label.If some subclassification label is general Rate is relatively high, then it is assumed that and it is more credible, while its corresponding second level label and level-one label can be exported, increase label number, With label granularity；If the probability of some subclassification label is relatively low, then it is assumed that it is insincere, can by the label to second level or Level-one label mapping, on the label of coarseness, general accuracy rate can be some higher.

In a specific example of this implementation, since the video in Feed (information flow) library has million ranks Outsourcing annotation results can sort from high to low by the label frequency of occurrences after taking all label results, take out preceding 10,000 A label is as candidate tag set.

Later, this 10,000 entity tag words of artificial direct viewing can be used, the word for meeting filtering rule is filtered out Language, such as filter out adjective, verb, be unable to vision (such as tongue twister), star's name can be divided (face recognition technology can be passed through Identification, therefore be added without video tab set) etc. do not meet the word of video tab requirement.

Then, to each label, its corresponding video content is watched, judges whether the video under same label has appearance Consistency and Semantic Similarity.Such as label " koala ", it is both the pet name of a kind of animal and the daughter of certain star, there is discrimination Justice just directly filters out.

Finally, by above-mentioned steps, available 3000 labels, and each label is built into the system of three-level, Such as sport -> ball game -> football.The corresponding all video datas of these labels are retained simultaneously, amount to 1,000 ten thousand or so views Frequently, these data can be used for subsequent model training.It is trained for example, third level label can be directly used: if some Label probability is relatively high, then it is assumed that and it is more credible, while its corresponding second level label and level-one label can be exported, increase label Number and label granularity；If some label probability is relatively low, then it is assumed that it is insincere, it can be by the label to second level or one Grade label mapping, on the label of coarseness, general accuracy rate can be some higher.

In a step 406, the label that probability value is greater than threshold value is determined as to the class label of video to be identified.

In the present embodiment, after determining the probability that video to be identified corresponds to each label, probability value can be greater than The label of threshold value is determined as the class label of video to be identified as valuable label.

In some optional implementations of the present embodiment, method life described in above-mentioned Fig. 2-Fig. 4 of information is generated It is further comprising the steps of on the basis of embodiment at the method for information: to extract the full articulamentum of double-current shot and long term memory network The feature vector of output；The feature vector for comparing feature vector and video to be recommended, obtains video similarity；It is similar based on video Degree determines video recommended to the user from video to be recommended.The essence of video recommended to the user can be improved in the implementation Accuracy.

The method of the generation information of the above embodiments of the present application can utilize video using LSTM recurrent neural network Sequential organization models a complete event, also considers the double-current feature of image and voice simultaneously, so that the classification mark of output It signs more accurate abundant.

In the present embodiment, using the two characteristic sequences as the input of double-current shot and long term memory network, thus in feature Sequence stage merges, and according to the feature after merging, obtains the probability value that final video to be identified corresponds to each label.

It should be appreciated that after step 401-404 obtains characteristics of image sequence and phonetic feature sequence, it can also direct base In characteristics of image sequence and phonetic feature sequence, determine that video to be identified corresponds to the probability value of each label；Probability value is greater than The label of threshold value is determined as the class label of video to be identified.

Specifically, characteristics of image sequence is got respectively in the convolutional neural networks based on image classification network and Classification of Speech After column and phonetic feature sequence, image classification label and Classification of Speech mark can be determined according to the two characteristic sequences respectively Label, finally obtain each mark according to the default weight and default score value of each label in image classification label and Classification of Speech label The scoring of label, so that it is determined that video to be identified corresponds to the probability value of each label.Here default weight and default score value, can be with It is determined based on NLP (natural language processing) technology.

With further reference to Fig. 5, as the realization to method shown in above-mentioned each figure, this application provides a kind of generation information One embodiment of device, the Installation practice is corresponding with Fig. 2-embodiment of the method shown in Fig. 4, which can specifically answer For in various electronic equipments.

As shown in figure 5, the device 500 of the generation information of the present embodiment may include: video acquisition unit 510, it is configured At acquisition video to be identified；Video understands unit 520, is configured to understand the interior of video to be identified using video understanding technology Hold, obtains video content label；Video analysis unit 530 is configured to analyze view to be identified using text data analytical technology The text of frequency obtains videotext label；Tag determination unit 540 is configured to based on video content label and videotext Label determines the semantic label of video to be identified.

In some optional implementations of the present embodiment, video understands that unit 520 includes at least one of the following: video point Class subelement 521 is configured to video input video classification model to be identified obtaining class label；Face datection subelement 522, it is configured to detect the face in the video frame of video to be identified frame by frame, the face in face and face database that will test Sample is matched, and is obtained personage's name label of the face sample to match with the face detected and is associated with the people of the people's name Object information labels；Action recognition subelement 523 is configured to be identified frame by frame using motion detection model trained in advance wait know Movement in the video frame of other video, obtains action message, merges the action message of each frame, obtains movement label；Scene and reality Body identifies subelement 524, is configured to identify the video frame of video to be identified frame by frame using identification disaggregated model trained in advance Interior scene and entity and fusion recognition is as a result, obtain the scene tag and entity tag in video frame.

In some optional implementations of the present embodiment, visual classification subelement 521 includes (not shown): video Frame extracts subelement, is configured to uniformly extract the video frame of video to be identified, obtains sequence of frames of video to be identified；Characteristics of image Subelement is extracted, is configured to carry out feature extraction using image classification network handles identification sequence of frames of video, obtain to be identified The characteristics of image sequence of video；Audio signal extracts subelement, is configured to extract the audio signal of video to be identified；Voice is special Sign extracts subelement, is configured to by the convolutional neural networks of the audio signal input Classification of Speech of video to be identified, to per second Voice carry out feature extraction, obtain the phonetic feature sequence of video to be identified；Probability value determines subelement, is configured to be based on Characteristics of image sequence and phonetic feature sequence determine that video to be identified corresponds to the probability value of each label；Class label determines son Unit is configured to for the label that probability value is greater than threshold value being determined as the class label of video to be identified.

In some optional implementations of the present embodiment, probability value determines that subelement is further configured to: by image The double-current shot and long term memory network of characteristic sequence and phonetic feature sequence inputting training in advance, obtain video to be identified correspond to it is each The probability value of label.

In some optional implementations of the present embodiment, the image classification network in image characteristics extraction subelement is based on The corresponding label training of the feature and video sample of the video frame modeled using timing segmented network is obtained；And/or image is special The convolutional neural networks that sign extracts the Classification of Speech in subelement are determined based on following steps: extracting the audio signal of video sample In Meier scale filter group feature；Based on Meier scale filter group feature and the corresponding label of audio signal, training language The convolutional neural networks of cent class.

In some optional implementations of the present embodiment, the text of the video to be identified in video analysis unit include with It is at least one of lower: the title text of video to be identified；Text in the obtained video frame of video to be identified is detected using video OCR This.

In some optional implementations of the present embodiment, video analysis unit is further configured to: based on to be identified The text of video extracts the candidate entity tag of video to be identified from multistage tag database；Based on NLP technology analysis entities The part of speech and different degree of label, screening obtain videotext label.

In some optional implementations of the present embodiment, tag determination unit includes (not shown): label relationship It determines subelement, is configured to determine belonging to the label in video content label based on the multistage tag database pre-established Classification, the relationship in the label in video content label and multistage tag database between other labels；Part of speech different degree is true Stator unit is configured to use natural language processing technique, in the label, video content label in analysis video content label Label belonging to the label of classification and the part of speech and different degree of the label based on determined by relationship；Tag sorting screening is single Member is configured to that the label in video content label and videotext label is ranked up and is sieved based on part of speech and different degree Choosing, obtains the semantic label of video to be identified.

In some optional implementations of the present embodiment, device further include: video push unit 550 is configured to base In videotext label, to user's pushing video.

It should be appreciated that each step in the method that all units recorded in device 500 can be described with reference Fig. 2-Fig. 4 It is corresponding.It is equally applicable to device 500 and unit wherein included above with respect to the operation and feature of method description as a result, This is repeated no more.

Below with reference to Fig. 6, it illustrates the computer systems 600 for the server for being suitable for being used to realize the embodiment of the present application Structural schematic diagram.Terminal device or server shown in Fig. 6 are only an example, should not function to the embodiment of the present application and Use scope brings any restrictions.

As shown in fig. 6, computer system 600 includes central processing unit (CPU) 601, it can be read-only according to being stored in Program in memory (ROM) 602 or be loaded into the program in random access storage device (RAM) 603 from storage section 608 and Execute various movements appropriate and processing.In RAM 603, also it is stored with system 600 and operates required various programs and data. CPU 601, ROM 602 and RAM 603 are connected with each other by bus 604.Input/output (I/O) interface 605 is also connected to always Line 604.

I/O interface 605 is connected to lower component: the importation 606 including keyboard, mouse etc.；It is penetrated including such as cathode The output par, c 607 of spool (CRT), liquid crystal display (LCD) etc. and loudspeaker etc.；Storage section 608 including hard disk etc.； And the communications portion 609 of the network interface card including LAN card, modem etc..Communications portion 609 via such as because The network of spy's net executes communication process.Driver 610 is also connected to I/O interface 605 as needed.Detachable media 611, such as Disk, CD, magneto-optic disk, semiconductor memory etc. are mounted on as needed on driver 610, in order to read from thereon Computer program be mounted into storage section 608 as needed.

Particularly, in accordance with an embodiment of the present disclosure, it may be implemented as computer above with reference to the process of flow chart description Software program.For example, embodiment of the disclosure includes a kind of computer program product comprising be carried on computer-readable medium On computer program, which includes the program code for method shown in execution flow chart.In such reality It applies in example, which can be downloaded and installed from network by communications portion 609, and/or from detachable media 611 are mounted.When the computer program is executed by central processing unit (CPU) 601, limited in execution the present processes Above-mentioned function.It should be noted that computer-readable medium described herein can be computer-readable signal media or Computer readable storage medium either the two any combination.Computer readable storage medium for example can be --- but Be not limited to --- electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor system, device or device, or any above combination. The more specific example of computer readable storage medium can include but is not limited to: have one or more conducting wires electrical connection, Portable computer diskette, hard disk, random access storage device (RAM), read-only memory (ROM), erasable type may be programmed read-only deposit Reservoir (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD-ROM), light storage device, magnetic memory Part or above-mentioned any appropriate combination.In this application, computer readable storage medium, which can be, any include or stores The tangible medium of program, the program can be commanded execution system, device or device use or in connection.And In the application, computer-readable signal media may include in a base band or the data as the propagation of carrier wave a part are believed Number, wherein carrying computer-readable program code.The data-signal of this propagation can take various forms, including but not It is limited to electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be computer Any computer-readable medium other than readable storage medium storing program for executing, the computer-readable medium can send, propagate or transmit use In by the use of instruction execution system, device or device or program in connection.Include on computer-readable medium Program code can transmit with any suitable medium, including but not limited to: wireless, electric wire, optical cable, RF etc., Huo Zheshang Any appropriate combination stated.

Flow chart and block diagram in attached drawing are illustrated according to the system of the various embodiments of the application, method and computer journey The architecture, function and operation in the cards of sequence product.In this regard, each box in flowchart or block diagram can generation A part of one module, program segment or code of table, a part of the module, program segment or code include one or more use The executable instruction of the logic function as defined in realizing.It should also be noted that in some implementations as replacements, being marked in box The function of note can also occur in a different order than that indicated in the drawings.For example, two boxes succeedingly indicated are actually It can be basically executed in parallel, they can also be executed in the opposite order sometimes, and this depends on the function involved.Also it to infuse Meaning, the combination of each box in block diagram and or flow chart and the box in block diagram and or flow chart can be with holding The dedicated hardware based system of functions or operations as defined in row is realized, or can use specialized hardware and computer instruction Combination realize.

Being described in unit involved in the embodiment of the present application can be realized by way of software, can also be by hard The mode of part is realized.Described unit also can be set in the processor, for example, can be described as: a kind of processor packet Include video acquisition unit, video understands unit, video analysis unit and tag determination unit.Wherein, the title of these units exists The restriction to the unit itself is not constituted in the case of certain, for example, video acquisition unit is also described as " obtaining wait know The unit of other video ".

As on the other hand, present invention also provides a kind of computer-readable medium, which be can be Included in device described in above-described embodiment；It is also possible to individualism, and without in the supplying device.Above-mentioned calculating Machine readable medium carries one or more program, when said one or multiple programs are executed by the device, so that should Device: video to be identified is obtained；Understand that technology understands the content of video to be identified using video, obtains video content label；It adopts The text that video to be identified is analyzed with text data analytical technology obtains videotext label；Based on video content label and view Frequency text label determines the semantic label of video to be identified.

Above description is only the preferred embodiment of the application and the explanation to institute's application technology principle.Those skilled in the art Member is it should be appreciated that invention scope involved in the application, however it is not limited to technology made of the specific combination of above-mentioned technical characteristic Scheme, while should also cover in the case where not departing from foregoing invention design, it is carried out by above-mentioned technical characteristic or its equivalent feature Any combination and the other technical solutions formed.Such as features described above has similar function with (but being not limited to) disclosed herein Can technical characteristic replaced mutually and the technical solution that is formed.

Claims

1. a kind of method for generating information, comprising:

Obtain video to be identified；

Understand that technology understands the content of the video to be identified using video, obtains video content label；

The text that the video to be identified is analyzed using text data analytical technology, obtains videotext label；

Based on video content label and videotext label, the semantic label of video to be identified is determined.

2. obtaining video according to the method described in claim 1, wherein, the use video understands that technology understands video content Content tab includes at least one of the following:

By the video input video classification model to be identified, class label is obtained；

The face in the video frame of the video to be identified, the face sample in the face and face database that will test are detected frame by frame It is matched, obtain personage's name label of the face sample to match with the face detected and is associated with personage's letter of the people's name Cease label；

Using motion detection model trained in advance, the movement in the video frame of the video to be identified is identified frame by frame, is moved Make information, merge the action message of each frame, obtains movement label；

Using identification disaggregated model trained in advance, scene in the video frame of the video to be identified and entity are identified frame by frame simultaneously Fusion recognition is as a result, obtain the scene tag and entity tag in video frame.

3. it is described by the video input video classification model to be identified according to the method described in claim 2, wherein, it obtains Class label includes:

The video frame for uniformly extracting the video to be identified, obtains sequence of frames of video to be identified；

Feature extraction is carried out using image classification network handles identification sequence of frames of video, obtains the characteristics of image sequence of video to be identified Column；

Extract the audio signal of the video to be identified；

By the convolutional neural networks of the audio signal input Classification of Speech of the video to be identified, feature is carried out to voice per second It extracts, obtains the phonetic feature sequence of video to be identified；

Based on described image characteristic sequence and the phonetic feature sequence, determine that video to be identified corresponds to the probability of each label Value；

The label that probability value is greater than threshold value is determined as to the class label of the video to be identified.

4. described to be based on described image characteristic sequence and the phonetic feature sequence according to the method described in claim 3, wherein Column determine that video to be identified corresponds to the probability value of each label and includes:

By the double-current shot and long term memory network of described image characteristic sequence and the phonetic feature sequence inputting training in advance, obtain The video to be identified corresponds to the probability value of each label.

5. according to the method described in claim 3, wherein, described image sorter network is based on being modeled using timing segmented network Video frame feature and the corresponding label training of video sample obtain；And/or

The convolutional neural networks of the Classification of Speech are determined based on following steps: extracting the Meier in the audio signal of video sample Scale filter group feature；Based on Meier scale filter group feature and the corresponding label of audio signal, Classification of Speech is trained Convolutional neural networks.

6. according to the method described in claim 1, wherein, the text of the video to be identified includes at least one of the following:

The title text of the video to be identified；

Text in the obtained video frame of video to be identified is detected using video OCR.

7. according to claim 1 or method described in 6 any one, wherein described to analyze institute using text data analytical technology The text for stating video to be identified, obtaining videotext label includes:

Based on the text of the video to be identified, the candidate entity mark of the video to be identified is extracted from multistage tag database Label；

The part of speech and different degree of the entity tag are analyzed based on NLP technology, screening obtains videotext label.

8. described to be based on video content label and videotext label according to the method described in claim 1, wherein, determine to Identification video semantic label include:

Based on the multistage tag database pre-established, classification belonging to the label in the video content label, described is determined Relationship in label and multistage tag database in video content label between other labels；

Using natural language processing technique, the label in the video content label, the mark in the video content label are analyzed The label of classification belonging to label and the part of speech and different degree of the label based on determined by the relationship；

Based on the part of speech and different degree, the label in the video content label and the videotext label is ranked up And screening, obtain the semantic label of video to be identified.

9. according to the method described in claim 1, wherein, the method also includes:

Based on the videotext label, to user's pushing video.

10. a kind of device for generating information, comprising:

Video acquisition unit is configured to obtain video to be identified；

Video understands unit, is configured to understand using video understanding technology the content of the video to be identified, obtains in video Hold label；

Video analysis unit is configured to analyze the text of the video to be identified using text data analytical technology, depending on Frequency text label；

Tag determination unit is configured to determine the semanteme of video to be identified based on video content label and videotext label Label.

11. device according to claim 10, wherein the video understands that unit includes at least one of the following:

Visual classification subelement is configured to the video input video classification model to be identified obtaining class label；

Face datection subelement is configured to detect the face in the video frame of the video to be identified frame by frame, will test Face is matched with the face sample in face database, the personage's identifier for the face sample that the face for obtaining and detecting matches The people information label signed and be associated with the people's name；

Action recognition subelement is configured to identify the video to be identified frame by frame using motion detection model trained in advance Video frame in movement, obtain action message, merge the action message of each frame, obtain movement label；

Scene and Entity recognition subelement are configured to using identification disaggregated model trained in advance, and identification is described wait know frame by frame Scene and entity and fusion recognition in the video frame of other video is as a result, obtain the scene tag and entity tag in video frame.

12. device according to claim 11, wherein the visual classification subelement includes:

Video frame extracts subelement, is configured to uniformly extract the video frame of the video to be identified, obtains video frame to be identified Sequence；

Image characteristics extraction subelement is configured to carry out feature using image classification network handles identification sequence of frames of video to mention It takes, obtains the characteristics of image sequence of video to be identified；

Audio signal extracts subelement, is configured to extract the audio signal of the video to be identified；

Speech feature extraction subelement is configured to the convolution mind of the audio signal input Classification of Speech of the video to be identified Through network, feature extraction is carried out to voice per second, obtains the phonetic feature sequence of video to be identified；

Probability value determines subelement, is configured to determine based on described image characteristic sequence and the phonetic feature sequence wait know Other video corresponds to the probability value of each label；

Class label determines subelement, is configured to for the label that probability value is greater than threshold value being determined as the class of the video to be identified Distinguishing label.

13. device according to claim 12, wherein the probability value determines that subelement is further configured to:

14. device according to claim 12, wherein the image classification network base in described image feature extraction subelement It is obtained in the feature of the video frame modeled using timing segmented network and the corresponding label training of video sample；And/or

The convolutional neural networks of the Classification of Speech in described image feature extraction subelement are determined based on following steps: being extracted Meier scale filter group feature in the audio signal of video sample；Based on Meier scale filter group feature and audio signal Corresponding label, the convolutional neural networks of training Classification of Speech.

15. device according to claim 10, wherein the text of the video to be identified in the video analysis unit It includes at least one of the following:

The title text of the video to be identified；

16. device described in 0 or 15 any one according to claim 1, wherein the video analysis unit is further configured At:

17. device according to claim 10, wherein the tag determination unit includes:

Label relationship determines subelement, is configured to determine the video content based on the multistage tag database pre-established In classification belonging to label in label, the label in the video content label and multistage tag database between other labels Relationship；

Part of speech different degree determines subelement, is configured to analyze in the video content label using natural language processing technique Label, classification belonging to the label in the video content label label and the label based on determined by the relationship Part of speech and different degree；

Tag sorting screens subelement, is configured to based on the part of speech and different degree, to the video content label and described Label in videotext label is ranked up and screens, and obtains the semantic label of video to be identified.

18. device according to claim 10, wherein described device further include:

Video push unit is configured to based on the videotext label, to user's pushing video.

19. a kind of server, comprising:

One or more processors；

Storage device, for storing one or more programs,

When one or more of programs are executed by one or more of processors, so that one or more of processors are real The now method as described in any in claim 1-9.

20. a kind of computer-readable medium, is stored thereon with computer program, such as right is realized when which is executed by processor It is required that any method in 1-9.