CN109344289A

CN109344289A - Method and apparatus for generating information

Info

Publication number: CN109344289A
Application number: CN201811110667.1A
Authority: CN
Inventors: 陈日伟
Original assignee: Beijing ByteDance Network Technology Co Ltd
Current assignee: Douyin Vision Co Ltd; Douyin Vision Beijing Co Ltd
Priority date: 2018-09-21
Filing date: 2018-09-21
Publication date: 2019-02-15
Anticipated expiration: 2038-09-21
Also published as: CN109344289B

Abstract

The embodiment of the present application discloses the method and apparatus for generating information.One specific embodiment of this method includes: to obtain video to be processed；Extract the corresponding image sequence of video to be processed；Do not include people's image in response to the image determined in image sequence, obtains the corresponding audio of video to be processed；The classification information of video to be processed is determined according to audio, wherein classification information is for indicating whether video to be processed is related to target group.The embodiment helps to promote the diversity and flexibility to video processing mode to be processed.

Description

Method and apparatus for generating information

Technical field

The invention relates to field of computer technology, and in particular to the method and apparatus for generating information.

Background technique

With flourishing for multimedia information technology, countless image or video class can be all generated on network daily Information.Generally, multimedia messages display platform would generally detect multimedia messages to be audited and be classified.It is right The content for each image that the detection method of video is typically based in the corresponding image sequence of video is detected.

Summary of the invention

The embodiment of the present application proposes the method and apparatus for generating information.

In a first aspect, the embodiment of the present application provides a kind of method for generating information, this method comprises: obtaining wait locate Manage video；Extract the corresponding image sequence of video to be processed；Do not include people's image in response to the image determined in image sequence, obtains Take the corresponding audio of video to be processed；The classification information of video to be processed is determined according to audio, wherein classification information is for indicating Whether video to be processed is related to target group.

In some embodiments, in response to determine image sequence in image include people's image and comprising people's image be not inconsistent Preset condition is closed, the corresponding audio of video to be processed is obtained；The classification information of video to be processed is determined according to audio.

In some embodiments, in response to determine image sequence in image include people's image and comprising people's image meet Preset condition extracts people's image that the image in the corresponding image sequence of video to be processed includes, obtains people's image collection；According to People's image collection determines the classification information of video to be processed.

In some embodiments, the classification information of video to be processed is determined according to audio, comprising: by audio input to preparatory Trained classification detection model obtains the corresponding Genus Homo of voice in audio in the probability of target group, wherein classification detects mould Type is used to characterize the corresponding relationship of probability of the audio Genus Homo corresponding with the voice in audio in target group；In response to determining The probability arrived is greater than destination probability threshold value, and it is to be processed will to indicate that video to be processed classification information relevant to target group is determined as The classification information of video.

In some embodiments, the classification information of video to be processed is determined according to audio, further includes: obtain in response to determination Probability be less than destination probability threshold value, it is to be processed will to indicate that video to be processed is determined as with the incoherent classification information of target group The classification information of video.

In some embodiments, training obtains classification detection model as follows: it obtains initial category and determines model, Wherein, initial category determines that model includes initial category detection model and the preliminary classification mould connecting with initial category detection model Type, wherein preliminary classification model is using the output of initial category detection model as input, and will indicate the voice pair in audio Whether the people answered belongs to the markup information of target group as output；Obtain training sample set, wherein training sample includes audio Whether people corresponding with for indicating the voice in audio belongs to the markup information of target group；Using the method for machine learning, The input that model is determined using the audio in the training sample that training sample is concentrated as initial category, will be corresponding with the audio of input Markup information the desired output of model is determined as initial category, the initial category that training obtains training completion determines model； The initial category that training is completed is determined that the initial category detection model that model includes, training is completed is determined as classification detection mould Type.

Second aspect, the embodiment of the present application provide a kind of for generating the device of information, which includes: video acquisition Unit is configured to obtain video to be processed；Image sequence extraction unit is configured to obtain the corresponding image of video to be processed Sequence；Determination unit is configured in response to determine that the image in image sequence does not include people's image, obtains video pair to be processed The audio answered；The classification information of video to be processed is determined according to audio, wherein whether classification information is for indicating video to be processed It is related to target group.

In some embodiments, above-mentioned determination unit is further configured to: in response to determining the image in image sequence Comprising people's image and people's image for including does not meet preset condition, obtains the corresponding audio of video to be processed；It is determined according to audio The classification information of video to be processed.

In some embodiments, above-mentioned determination unit is further configured to: in response to determining the image in image sequence Comprising people's image and people's image for including meets preset condition, and the image extracted in the corresponding image sequence of video to be processed includes People's image, obtain people's image collection；According to people's image collection, the classification information of video to be processed is determined.

In some embodiments, above-mentioned determination unit is further configured to: by audio input to classification trained in advance Detection model obtains the corresponding Genus Homo of voice in audio in the probability of target group, wherein classification detection model is for characterizing The corresponding relationship of probability of the audio Genus Homo corresponding with the voice in audio in target group；In response to determining that obtained probability is big In destination probability threshold value, it will indicate that video to be processed classification information relevant to target group is determined as the classification of video to be processed Information.

In some embodiments, above-mentioned determination unit is further configured to: in response to determining that obtained probability is less than mesh Probability threshold value is marked, will indicate that video to be processed and the incoherent classification information of target group are determined as the classification letter of video to be processed Breath.

The third aspect, the embodiment of the present application provide a kind of electronic equipment, which includes: one or more processing Device；Storage device, for storing one or more programs；When one or more programs are executed by one or more processors, make Obtain method of the one or more processors realization as described in implementation any in first aspect.

Fourth aspect, the embodiment of the present application provide a kind of computer-readable medium, are stored thereon with computer program, should The method as described in implementation any in first aspect is realized when computer program is executed by processor.

Method and apparatus provided by the embodiments of the present application for generating information, by obtaining video to be processed；Extract to Handle the corresponding image sequence of video；Do not include people's image in response to the image determined in image sequence, obtains video to be processed Corresponding audio；The classification information of video to be processed is determined according to audio, wherein classification information is for indicating that video to be processed is It is no related to target group, so that the relevant information of people can not be extracted in video to be processed from each image of correspondence by realizing When, can determine whether video to be processed is related to target group according to the corresponding audio of video to be processed, help to be promoted To the diversity and flexibility of video processing mode to be processed.

Detailed description of the invention

By reading a detailed description of non-restrictive embodiments in the light of the attached drawings below, the application's is other Feature, objects and advantages will become more apparent upon:

Fig. 1 is that one embodiment of the application can be applied to exemplary system architecture figure therein；

Fig. 2 is the flow chart according to one embodiment of the method for generating information of the application；

Fig. 3 is the schematic diagram according to an application scenarios of the method for generating information of the embodiment of the present application；

Fig. 4 is the flow chart according to another embodiment of the method for generating information of the application；

Fig. 5 is the structural schematic diagram according to one embodiment of the device for generating information of the application；

Fig. 6 is adapted for the structural schematic diagram for the computer system for realizing the electronic equipment of the embodiment of the present application.

Specific embodiment

The application is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched The specific embodiment stated is used only for explaining related invention, rather than the restriction to the invention.It also should be noted that in order to Convenient for description, part relevant to related invention is illustrated only in attached drawing.

It should be noted that in the absence of conflict, the features in the embodiments and the embodiments of the present application can phase Mutually combination.The application is described in detail below with reference to the accompanying drawings and in conjunction with the embodiments.

Fig. 1 is shown can be using the method for generating information of the application or the implementation of the device for generating information The exemplary architecture 100 of example.

As shown in Figure 1, system architecture 100 may include terminal device 101,102,103, network 104 and server 105. Network 104 between terminal device 101,102,103 and server 105 to provide the medium of communication link.Network 104 can be with Including various connection types, such as wired, wireless communication link or fiber optic cables etc..

Terminal device 101,102,103 is interacted by network 104 with server 105, to receive or send message etc..Terminal Various client applications can be installed in equipment 101,102,103.Such as camera shooting class application, image processing class application, video Class application etc..

Terminal device 101,102,103 can be hardware, be also possible to software.When terminal device 101,102,103 is hard When part, it can be various electronic equipments, including but not limited to smart phone, tablet computer, E-book reader, on knee portable Computer and desktop computer etc..When terminal device 101,102,103 is software, above-mentioned cited electricity may be mounted at In sub- equipment.Multiple softwares or software module may be implemented into (such as providing multiple softwares of Distributed Services or soft in it Part module), single software or software module also may be implemented into.It is not specifically limited herein.

Server 105 can be to provide the server of various services, for example, terminal device 101,102,103 send to Processing video analyze etc. the processing server of processing.Processing server can extract the corresponding image sequence of video to be processed And the image in image sequence is handled.Further, processing result can also be fed back to terminal and set by processing server Standby 101,102,103.

It should be noted that above-mentioned video to be processed can also be stored directly in the local of server 105, server 105 The local video to be processed stored can directly be extracted and handled, at this point it is possible to there is no terminal device 101,102, 103 and network 104.

It should be noted that the method provided by the embodiment of the present application for generating information is generally held by server 105 Row, correspondingly, the device for generating information is generally positioned in server 105.

It may also be noted that can also be equipped with video processing class application in terminal device 101,102,103, terminal is set Standby 101,102,103 can also be based on video processing class using handling video to be processed, at this point, for generating information Method can also be executed by terminal device 101,102,103, and correspondingly, the device for generating information also can be set in terminal In equipment 101,102,103.At this point, server 105 and network 104 can be not present in exemplary system architecture 100.

It should be noted that server can be hardware, it is also possible to software.When server is hardware, may be implemented At the distributed server cluster that multiple servers form, individual server also may be implemented into.It, can when server is software It, can also be with to be implemented as multiple softwares or software module (such as providing multiple softwares of Distributed Services or software module) It is implemented as single software or software module.It is not specifically limited herein.

It should be understood that the number of terminal device, network and server in Fig. 1 is only schematical.According to realization need It wants, can have any number of terminal device, network and server.

With continued reference to Fig. 2, it illustrates the processes according to one embodiment of the method for generating information of the application 200.This be used for generate information method the following steps are included:

Step 201, video to be processed is obtained.

In the present embodiment, it can use for generating the executing subject (server 105 as shown in Figure 1) of the method for information The mode of wired connection or wireless connection obtains video to be processed from local or other storage equipment.Wherein, video to be processed can To be the video of various types, various contents.Video to be processed is also possible to the video as specified by related personnel.

Step 202, the corresponding image sequence of video to be processed is extracted.

In the present embodiment, the corresponding image sequence of video to be processed can further be extracted according to video to be processed.One As, video is made of a series of images.Therefore, a series of images for forming video to be processed is that video to be processed is corresponding Image sequence.Specifically, can use existing various videos processing class using handling video to be processed, with obtain to Handle the corresponding image sequence of video.It wherein, is to grind extensively at present to obtain corresponding image sequence to video to be processed processing The well-known technique studied carefully and applied, details are not described herein.

Step 203, do not include people's image in response to the image determined in image sequence, obtain the corresponding sound of video to be processed Frequently, and according to audio determine the classification information of video to be processed.

In the present embodiment, people's image can refer to the image of display someone.Specifically, show someone in whole or in part The image of (as there was only face part) can be seen as people's image.Generally, the corresponding audio of video can refer to that video is had Sound.The sound of video can be when recording, be recorded the sound that object and recording environment at that time are issued.Video Sound be also possible to formerly carry out video record early period, the sound after then the sound of video is handled and/or dubbed Sound.

Specifically, it can use existing various multi-media processing softwares to handle video to be processed, to obtain The corresponding audio of video to be processed.It wherein, is to study and answer extensively at present to obtain corresponding audio to video to be processed processing Well-known technique, details are not described herein.

In the present embodiment, classification information can be used to indicate that whether video to be processed is related to target group.Wherein, mesh Mark crowd can be the group being made of the preassigned people of technical staff.Target group, which may also mean that, meets preset condition Crowd.Preset condition may include that the owner in crowd has one or more same alike results (or attribute value) etc..

For example, preset condition can be the age less than 12 years old, then the group that 12 years old people below can be constituted Stereoscopic is target group.In another example preset condition can be the age less than 12 years old and gender is female, then can be by 12 years old or less The group that is constituted of girl be considered as target group.

In the present embodiment, when the information of video to be processed is related to target group, it is believed that video to be processed with Target group is related.The information of video to be processed includes but is not limited to the content (in video such as to be processed that video to be processed includes People, object, environment of appearance etc.), corresponding content of the audio of video to be processed (semanteme of the corresponding text of such as audio) etc.. For example, when occurring the people for belonging to target group in video to be processed, it is believed that video to be processed is related with target group.Again For example, the corresponding Genus Homo of voice in the corresponding audio of video to be processed is when target group, it is believed that video to be processed with Target group is related.

In the present embodiment, according to specific application demand (as target group refers to the group etc. who is constituted), The classification information of video to be processed is determined using various methods.

It is alternatively possible to determine the classification of video to be processed according to the corresponding audio of video to be processed as follows Information:

Step 1 obtains the first audio collection and the second audio collection.Wherein, the audio in the first audio collection can be target person The audio of people in group.Audio in second audio collection can be the audio of the people in non-targeted crowd.It specifically, can be preparatory The audio for acquiring or generating the people in some target groups using audio class application forms the first audio collection, acquires or utilize audio The audio that class application generates the people in some non-targeted crowds forms the second audio collection.

Step 2 chooses target numbers audio respectively from the first audio collection and the second audio collection and forms first sample sound Frequency collection and the second sample audio collection.Wherein, target numbers can be preassigned by related personnel, be also possible to according to actual conditions (as the sum that target numbers are the audio collection audio that includes ten/it is first-class) and determining.Specifically, it is selected from audio collection It takes the mode of audio can be to randomly select, be also possible to according to specific rules extraction.It for example, can be previously according to sound Audio in audio collection is divided into one or more tone subsets by the length of frequency.The length of audio in the same tone subset Degree can be located in same audio length of interval.

Step 3 calculates the similarity of each audio in the corresponding audio of video to be processed and first sample audio collection, Maximum similarity is chosen as the first similarity.Similarly, the corresponding audio of video to be processed and the second sample audio are calculated The similarity for each audio concentrated, chooses maximum similarity as the second similarity.

Specifically, the calculating of the similarity between two audios can be calculated using existing audio class application.Tool Body, different calculations can be chosen according to different application demands.Such as, it is desirable to the semantic similarity of two audios, The corresponding text of two audios can be obtained respectively, then calculated similar between corresponding two texts of two audios Degree.In another example, it is desirable to the similarity of the frequency of two audios, then the corresponding waveform of two audios can first be obtained, so The similarity between corresponding two waveforms of two audios is calculated afterwards.

Step 4 chooses the greater in the first similarity and the second similarity as target similarity, and by target The classification information of the corresponding crowd of similarity is determined as the classification information of video to be processed.It is similar to be greater than second in the first similarity When spending, can determine that video to be processed is related with target group, the second similarity be greater than the first similarity when, can determine to It is unrelated with target group to handle video.

Optionally, the class of video to be processed can also be determined as follows according to the corresponding audio of video to be processed Other information:

Step 1 obtains the corresponding Genus Homo of voice in audio by audio input to classification detection model trained in advance In the probability of target group.Wherein, classification detection model can be used for characterizing audio Genus Homo corresponding with the voice in audio in The corresponding relationship of the probability of target group.

Step 2 determines the size relation of obtained probability and destination probability threshold value.In response to determining that obtained probability is big In destination probability threshold value, it will indicate that video to be processed classification information relevant to target group is determined as the classification of video to be processed Information.In response to determining that obtained probability is less than destination probability threshold value, it will indicate that video to be processed and target group are incoherent Classification information is determined as the classification information of video to be processed.Wherein, destination probability threshold value can be preset by related personnel, (such as being calculated according to preset calculation formula) can be dynamically determined in actual process.

Wherein, classification detection model in above-mentioned steps one can be that training obtains in advance by various training methods.

It is alternatively possible to which training obtains above-mentioned classification detection model as follows:

Step 1 obtains training sample set.Wherein, each training sample includes audio and the corresponding probability of audio.Audio Corresponding probability can indicate the corresponding Genus Homo of voice in audio in the probability of target group.Specifically, in training sample Probability can be by manually marking to obtain according to audio.

Step 2 determines initial category detection model.Wherein, initial category detection model can be it is various types of without Artificial neural network trained or that training is not completed, such as deep learning model.Initial category detection model is also possible to pair The model that artificial neural network a variety of unbred or that training is not completed is combined.For example, initial category detects Model can be to unbred convolutional neural networks, unbred Recognition with Recurrent Neural Network and unbred full articulamentum The model being combined.

Step 3, using the method for machine learning, the audio in training sample that training sample is concentrated is as initial classes The input of other detection model, using probability corresponding with the audio of input as desired output, training obtains above-mentioned classification detection mould Type.

Specifically, initial category detection model can be trained based on preset loss function.Wherein, loss function Value can be used to indicate the difference degree in the reality output and training sample of initial category detection model between probability.So Afterwards, using the parameter of the method adjustment initial category detection model of backpropagation, and can met based on the value of loss function In the case where preset trained termination condition, terminate training.After the completion of training, the initial category that training can be completed detects mould Type is determined as above-mentioned initial category detection model.

Preset trained termination condition can include but is not limited at least one of following: the training time be more than preset duration, Frequency of training is more than preset times, the value of loss function less than default discrepancy threshold etc..

Optionally, it can also train as follows and obtain above-mentioned classification detection model:

Step 1 obtains initial category and determines model.Wherein, initial category determines that model includes initial category detection model The preliminary classification model being connect with initial category detection model.Wherein, preliminary classification model is by initial category detection model Output is as input, and will indicate whether the corresponding people of voice in audio belongs to the markup information of target group as defeated Out.Specifically, markup information can be by being manually labeled to obtain in advance.

Initial category detection model can be artificial neural network various types of unbred or that training is not completed, Such as deep learning model.Initial category detection model is also possible to artificial mind a variety of unbred or that training is not completed The model being combined through network.For example, initial category detection model can be to unbred convolutional neural networks, The model that unbred Recognition with Recurrent Neural Network and unbred full articulamentum are combined.Preliminary classification model can be with It is a classifier, for classifying to the information of input.

Step 2 obtains training sample set.Wherein, training sample may include audio and for indicating the voice in audio Whether corresponding people belongs to the markup information of target group.Specifically, the probability in training sample can be by manually according to audio Mark obtains.

Step 3, using the method for machine learning, the audio in training sample that training sample is concentrated is as initial classes Not Que Ding model input, the desired output of model is determined using markup information corresponding with the audio of input as initial category, The initial category that training obtains training completion determines model；

Specifically, model, which is trained, to be determined to initial category based on preset loss function.Wherein, loss function Value can be used to indicate that initial category determines whether the markup information in the reality output and training sample of model consistent.Example Such as, it uses " 0 " to indicate that initial category determines that the reality output of model is consistent with the markup information in training sample, uses " 1 " table Show that initial category determines that the markup information in the reality output and training sample of model is inconsistent.It is then possible to based on loss letter Several values is determined the parameter of model using the method adjustment initial category of backpropagation, and terminates item meeting preset training In the case where part, terminate training, the initial category for obtaining training completion determines model.

The initial category that training is completed is determined the initial category detection model that model includes, training is completed by step 4 It is determined as above-mentioned classification detection model.

With continued reference to the signal that Fig. 3, Fig. 3 are according to the application scenarios of the method for generating information of the present embodiment Figure.In the application scenarios of Fig. 3, the above-mentioned available video 301 to be processed of executing subject.Then, video 301 to be processed is extracted Corresponding image sequence 302.Then, image detection can be carried out to each image in image sequence 302, with each figure of determination It seem no comprising people's image.If not including people's image in each image, it is corresponding can further to obtain video 301 to be processed Audio 303.

Later, audio 303 is inputted to classification detection model 304 trained in advance, obtains output result 305.Export result 305 can be used to indicate that in audio 303 the corresponding people of voice be children probability.Specifically as shown in figure label 305, output It as a result include output probability, having is 0.8.

Further, it is possible to by the probability of output compared with probability threshold value 306 carries out size.As shown in figure label 306, generally Rate threshold value is specially 0.65.Since the probability 0.8 that classification detection model 304 exports is greater than probability threshold value 0.65, it can recognize It is children for the corresponding people of voice in the audio 303 of input.The testing result 307 of classification detection model 304 is obtained, and will Testing result of the testing result 307 as the corresponding video 301 to be processed of audio 303, it can think video to be processed 301 with Children are related.

The method provided by the above embodiment of the application by the image in the corresponding image sequence of video to be processed not When comprising people's image, it is whether related with target group that video to be processed is determined according to the audio of video to be processed, to effectively keep away The case where method for processing video frequency based on image can not handle the video to be processed for not including people's image is exempted from, and then has facilitated Promote the stability and robustness of the processing to video to be processed.

With further reference to Fig. 4, it illustrates the processes 400 of another embodiment of the method for generating information.The use In the process 400 for the method for generating information, comprising the following steps:

Step 401, video to be processed is obtained.

Step 402, the corresponding image sequence of video to be processed is extracted.

The specific implementation procedure of above-mentioned steps 401 and 402 can refer to step 201 in Fig. 2 corresponding embodiment and 202 Related description, details are not described herein.

Step 403, determine whether the image in image sequence includes people's image.If the image in image sequence is schemed comprising people Picture executes following step 404, if the image in image sequence does not include people's image, executes following step 405.

In this step, it can use existing image detecting method (such as various pedestrian detection algorithms) in image sequence Image detected, to determine whether the image in image sequence includes people's image.

Step 404, whether the people's image for determining that the image in image sequence includes meets preset condition.If meeting, execute Following step 4041 executes step 405 if not meeting.

In this step, preset condition can be arranged by related personnel according to specific application scenarios and demand.It is default Condition can be the restriction in terms of the quality for the people's image for including to the image in image sequence.The quality of people's image can be with employment Image is convenient for the degree for extracting desired characteristics of image to characterize.For example, the quality of people's image can with the resolution ratio of employment image, The information such as the size of the image-region for the people that people's image is shown indicate.

In some cases, the resolution ratio for people's image that the image in image sequence includes is excessively low or people's image is shown The feature that can extract of people it is less (such as only comprising foot).At this point, analyzing the people that people's image is shown based on people's image is The no accuracy for belonging to target group also will be greatly reduced.

Step 4041, people's image that the image in the corresponding image sequence of video to be processed includes is extracted, people's image is obtained Set, and according to people's image collection, determine the classification information of video to be processed.

In this step, each image in image sequence can be determined first with existing some image detecting methods It shows the image-region of people, and then extracts the corresponding people's image of image of display someone.

Further, it is possible to successively each personal images in people's image collection are analyzed, it is aobvious with each personal images of determination Whether the people shown belongs to target group.If the people that each personal images are shown is not belonging to target group, it is believed that view to be processed Frequency is unrelated with target group.If the Genus Homo for having at least one image to show is in target group, it is believed that video and mesh to be processed Mark crowd is related.

Specifically, the existing various image processing methods based on people's image be can use, determine the class of video to be processed Other information.For example, can directly be analyzed and processed to each personal images, determine whether the people that each personal images are shown belongs to mesh Mark crowd, to obtain the classification information of video to be processed.

For everyone image, can also first analyze whether the people's image includes facial image, if the people's image includes people Face image can use the existing various methods (such as based on the attribute recognition approach of face) based on Face datection and scheme to the people As comprising facial image handled, to determine whether the face that show of the facial image related to target group, and then must To the classification information of video to be processed.If the people's image does not include facial image, can use existing various based on human body inspection The human body image that the method (such as based on the attribute recognition approach of human body) of survey includes to the people's image is analyzed, to determine human body Whether the human body that image is shown is related with target group, and then obtains the classification information of video to be processed.

Step 405, the corresponding audio of video to be processed is obtained, and determines that the classification of video to be processed is believed according to audio Breath.

The specific implementation procedure of this step 405 can refer to the related description of the step 203 in Fig. 2 corresponding embodiment, This is repeated no more.

Figure 4, it is seen that the method for generating information compared with the corresponding embodiment of Fig. 2, in the present embodiment Process 400 whether include that people's image has carried out different disposal to the image in the corresponding image sequence of video to be processed, into one Whether step ground meets preset condition to subdivision the case where including facial image into one layer also according to the facial image for including, from And make it possible to execute various types of videos to be processed different processing, it is same being effectively treated to video to be processed When, further promote the accuracy of the processing result of video to be processed.

With further reference to Fig. 5, as the realization to method shown in above-mentioned each figure, this application provides for generating information One embodiment of device, the Installation practice is corresponding with embodiment of the method shown in Fig. 2, which specifically can be applied to In various electronic equipments.

As shown in figure 5, the device 500 provided in this embodiment for generating information includes video acquisition unit 501, image Sequence extraction unit 502 and determination unit 503.Wherein, video acquisition unit 501 is configured to obtain video to be processed；Image Sequence extraction unit 502 is configured to extract the corresponding image sequence of video to be processed；Determination unit 503 is configured in response to It determines that the image in image sequence does not include people's image, obtains the corresponding audio of video to be processed；It is determined according to audio to be processed The classification information of video, wherein classification information is for indicating whether video to be processed is related to target group.

In the present embodiment, in the device 500 for generating information: video acquisition unit 501, image sequence extraction unit 502 and determination unit 503 specific processing and its brought technical effect can be respectively with reference to the step in Fig. 2 corresponding embodiment 201, the related description of step 202, step 203 and step 204, details are not described herein.

In some optional implementations of the present embodiment, above-mentioned determination unit 503 is further configured to: in response to It determines that image in image sequence includes people's image and people's image for including does not meet preset condition, it is corresponding to obtain video to be processed Audio；The classification information of video to be processed is determined according to audio.

In some optional implementations of the present embodiment, above-mentioned determination unit 503 is further configured to: in response to It determines that image in image sequence includes people's image and people's image for including meets preset condition, it is corresponding to extract video to be processed People's image that image in image sequence includes, obtains people's image collection；According to people's image collection, the class of video to be processed is determined Other information.

In some optional implementations of the present embodiment, above-mentioned determination unit 503 is further configured to: by audio It is input to classification detection model trained in advance, obtains the corresponding Genus Homo of voice in audio in the probability of target group, wherein Classification detection model is used to characterize the corresponding relationship of probability of the audio Genus Homo corresponding with the voice in audio in target group；It rings Destination probability threshold value should be greater than in determining obtained probability, will indicate that video to be processed classification information relevant to target group is true It is set to the classification information of video to be processed.

In some optional implementations of the present embodiment, above-mentioned determination unit 503 is further configured to: in response to It determines that obtained probability is less than destination probability threshold value, will indicate that video to be processed and the incoherent classification information of target group determine For the classification information of video to be processed.

In some optional implementations of the present embodiment, training obtains classification detection model as follows: obtaining Initial category is taken to determine model, wherein initial category determines that model includes initial category detection model and detects with initial category Model connection preliminary classification model, wherein preliminary classification model using the output of initial category detection model as input, and Whether the corresponding people of voice indicated in audio is belonged into the markup information of target group as output；Training sample set is obtained, Wherein, training sample includes the mark letter whether audio people corresponding with for indicating the voice in audio belongs to target group Breath；Using the method for machine learning, the audio in the training sample that training sample is concentrated is determined into model as initial category Input, the desired output of model is determined using markup information corresponding with the audio of input as initial category, is trained The initial category of completion determines model；The initial category that training is completed is determined into the initial category that model includes, training is completed Detection model is determined as classification detection model.

The device provided by the above embodiment of the application obtains video to be processed by video acquisition unit；Image sequence Extraction unit extracts the corresponding image sequence of video to be processed；Determination unit is in response to determining that the image in image sequence does not include People's image obtains the corresponding audio of video to be processed；The classification information of video to be processed is determined according to audio, wherein classification letter Whether breath is related to target group for indicating video to be processed, can not be from each figure of correspondence in video to be processed to realize As in extract people relevant information when, can be determined according to the corresponding audio of video to be processed video to be processed whether with target Crowd is related, helps to promote the diversity and flexibility to video processing mode to be processed.

Below with reference to Fig. 6, it illustrates the computer systems 600 for the electronic equipment for being suitable for being used to realize the embodiment of the present application Structural schematic diagram.Electronic equipment shown in Fig. 6 is only an example, function to the embodiment of the present application and should not use model Shroud carrys out any restrictions.

As shown in fig. 6, computer system 600 includes central processing unit (CPU) 601, it can be read-only according to being stored in Program in memory (ROM) 602 or be loaded into the program in random access storage device (RAM) 603 from storage section 608 and Execute various movements appropriate and processing.In RAM 603, also it is stored with system 600 and operates required various programs and data. CPU 601, ROM 602 and RAM 603 are connected with each other by bus 604.Input/output (I/O) interface 605 is also connected to always Line 604.

I/O interface 605 is connected to lower component: the importation 606 including keyboard, mouse etc.；It is penetrated including such as cathode The output par, c 607 of spool (CRT), liquid crystal display (LCD) etc. and loudspeaker etc.；Storage section 608 including hard disk etc.； And the communications portion 609 of the network interface card including LAN card, modem etc..Communications portion 609 via such as because The network of spy's net executes communication process.Driver 610 is also connected to I/O interface 605 as needed.Detachable media 611, such as Disk, CD, magneto-optic disk, semiconductor memory etc. are mounted on as needed on driver 610, in order to read from thereon Computer program be mounted into storage section 608 as needed.

Particularly, in accordance with an embodiment of the present disclosure, it may be implemented as computer above with reference to the process of flow chart description Software program.For example, embodiment of the disclosure includes a kind of computer program product comprising be carried on computer-readable medium On computer program, which includes the program code for method shown in execution flow chart.In such reality It applies in example, which can be downloaded and installed from network by communications portion 609, and/or from detachable media 611 are mounted.When the computer program is executed by central processing unit (CPU) 601, limited in execution the present processes Above-mentioned function.

It should be noted that the computer-readable medium of the application can be computer-readable signal media or computer Readable storage medium storing program for executing either the two any combination.Computer readable storage medium for example can be --- but it is unlimited In system, device or the device of --- electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor, or any above combination.It calculates The more specific example of machine readable storage medium storing program for executing can include but is not limited to: have the electrical connection, portable of one or more conducting wires Formula computer disk, hard disk, random access storage device (RAM), read-only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD-ROM), light storage device, magnetic memory device or The above-mentioned any appropriate combination of person.In this application, computer readable storage medium can be it is any include or storage program Tangible medium, which can be commanded execution system, device or device use or in connection.And in this Shen Please in, computer-readable signal media may include in a base band or as carrier wave a part propagate data-signal, In carry computer-readable program code.The data-signal of this propagation can take various forms, including but not limited to Electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be computer-readable Any computer-readable medium other than storage medium, the computer-readable medium can send, propagate or transmit for by Instruction execution system, device or device use or program in connection.The journey for including on computer-readable medium Sequence code can transmit with any suitable medium, including but not limited to: wireless, electric wire, optical cable, RF etc. are above-mentioned Any appropriate combination.

Flow chart and block diagram in attached drawing are illustrated according to the system of the various embodiments of the application, method and computer journey The architecture, function and operation in the cards of sequence product.In this regard, each box in flowchart or block diagram can generation A part of one module, program segment or code of table, a part of the module, program segment or code include one or more use The executable instruction of the logic function as defined in realizing.It should also be noted that in some implementations as replacements, being marked in box The function of note can also occur in a different order than that indicated in the drawings.For example, two boxes succeedingly indicated are actually It can be basically executed in parallel, they can also be executed in the opposite order sometimes, and this depends on the function involved.Also it to infuse Meaning, the combination of each box in block diagram and or flow chart and the box in block diagram and or flow chart can be with holding The dedicated hardware based system of functions or operations as defined in row is realized, or can use specialized hardware and computer instruction Combination realize.

Being described in unit involved in the embodiment of the present application can be realized by way of software, can also be by hard The mode of part is realized.Described unit also can be set in the processor, for example, can be described as: a kind of processor, packet Include video acquisition unit, image sequence acquiring unit and determination unit.Wherein, the title of these units is not under certain conditions The restriction to the unit itself is constituted, for example, video acquisition unit is also described as " obtaining the unit of video to be processed ".

As on the other hand, present invention also provides a kind of computer-readable medium, which be can be Included in electronic equipment described in above-described embodiment；It is also possible to individualism, and without in the supplying electronic equipment. Above-mentioned computer-readable medium carries one or more program, when said one or multiple programs are held by the electronic equipment When row, so that the electronic equipment: obtaining video to be processed；Extract the corresponding image sequence of video to be processed；Scheme in response to determining As the image in sequence do not include people's image, obtain the corresponding audio of video to be processed；Video to be processed is determined according to audio Classification information, wherein classification information is for indicating whether video to be processed is related to target group.

Above description is only the preferred embodiment of the application and the explanation to institute's application technology principle.Those skilled in the art Member is it should be appreciated that invention scope involved in the application, however it is not limited to technology made of the specific combination of above-mentioned technical characteristic Scheme, while should also cover in the case where not departing from foregoing invention design, it is carried out by above-mentioned technical characteristic or its equivalent feature Any combination and the other technical solutions formed.Such as features described above has similar function with (but being not limited to) disclosed herein Can technical characteristic replaced mutually and the technical solution that is formed.

Claims

1. a kind of method for generating information, comprising:

Obtain video to be processed；

Extract the corresponding image sequence of the video to be processed；

Do not include people's image in response to the image determined in described image sequence, obtains the corresponding audio of the video to be processed； The classification information of the video to be processed is determined according to the audio, wherein the classification information is for indicating described to be processed Whether video is related to target group.

2. according to the method described in claim 1, wherein, the method also includes:

In response to determine described image sequence in image include people's image and comprising people's image do not meet preset condition, obtain The corresponding audio of the video to be processed；The classification information of the video to be processed is determined according to the audio.

3. according to the method described in claim 1, wherein, the method also includes:

In response to the image determined in described image sequence include people's image and comprising people's image meet preset condition, extract institute People's image that the image in the corresponding image sequence of video to be processed includes is stated, people's image collection is obtained；According to people's image Set, determines the classification information of the video to be processed.

4. described to determine that the classification of the video to be processed is believed according to the audio according to the method described in claim 1, wherein Breath, comprising:

By the audio input to classification detection model trained in advance, the corresponding Genus Homo of voice in the audio is obtained in institute State the probability of target group, wherein the classification detection model for characterize audio Genus Homo corresponding with the voice in audio in The corresponding relationship of the probability of the target group；

In response to determining that obtained probability is greater than destination probability threshold value, the video to be processed and the target person faciation will be indicated The classification information of pass is determined as the classification information of the video to be processed.

5. described to determine that the classification of the video to be processed is believed according to the audio according to the method described in claim 4, wherein Breath, further includes:

In response to determining that obtained probability is less than the destination probability threshold value, the video to be processed and the target person will be indicated The incoherent classification information of group is determined as the classification information of the video to be processed.

6. according to the method described in claim 4, wherein, training obtains the classification detection model as follows:

It obtains initial category and determines model, wherein initial category determines that model includes initial category detection model and and initial classes The preliminary classification model of other detection model connection, wherein preliminary classification model is using the output of initial category detection model as defeated Enter, and whether the corresponding people of voice indicated in audio is belonged into the markup information of target group as output；

Obtain training sample set, wherein training sample includes whether audio people corresponding with for indicating the voice in audio belongs to In the markup information of target group；

Using the method for machine learning, the audio in the training sample that the training sample is concentrated is determined into mould as initial category The input of type, the desired output of model is determined using markup information corresponding with the audio of input as initial category, and training obtains The initial category that training is completed determines model；

The initial category detection model that the initial category that training is completed determines that model includes, training is completed is determined as the class Other detection model.

7. a kind of for generating the device of information, comprising:

Video acquisition unit is configured to obtain video to be processed；

Image sequence extraction unit is configured to extract the corresponding image sequence of the video to be processed；

Determination unit is configured in response to determine that the image in described image sequence does not include people's image, obtains described wait locate Manage the corresponding audio of video；The classification information of the video to be processed is determined according to the audio, wherein the classification information is used It is whether related to target group in the expression video to be processed.

8. device according to claim 7, wherein the determination unit is further configured to:

9. device according to claim 7, wherein the determination unit is further configured to:

10. device according to claim 7, wherein the determination unit is further configured to:

11. device according to claim 10, wherein the determination unit is further configured to:

12. device according to claim 10, wherein training obtains the classification detection model as follows:

13. a kind of electronic equipment, comprising:

One or more processors；

Storage device is stored thereon with one or more programs；

When one or more of programs are executed by one or more of processors, so that one or more of processors are real Now such as method as claimed in any one of claims 1 to 6.

14. a kind of computer-readable medium, is stored thereon with computer program, wherein the realization when program is executed by processor Such as method as claimed in any one of claims 1 to 6.