CN105512348A

CN105512348A - Method and device for processing videos and related audios and retrieving method and device

Info

Publication number: CN105512348A
Application number: CN201610058764.5A
Authority: CN
Inventors: 许欣然; 印奇
Original assignee: Beijing Megvii Technology Co Ltd; Beijing Aperture Science and Technology Ltd
Current assignee: Beijing Megvii Technology Co Ltd; Beijing Aperture Science and Technology Ltd
Priority date: 2016-01-28
Filing date: 2016-01-28
Publication date: 2016-04-20
Anticipated expiration: 2036-01-28
Also published as: CN105512348B

Abstract

The embodiment of the invention provides a method and a device for processing videos and related audios and a retrieving method and device. The method for processing videos and related audios comprises the following steps: obtaining a video including one or more faces of one or more objects; carrying out face detection on each video frame in the video to identify one or more faces; obtaining an audio, which includes the voice of at least partial objects in the one or more objects, and is collected within the same time period as the video; aiming at each of at least partial faces in the one or more faces, determining the audio part, corresponding to the face, in the audio; and associating the face with the corresponding audio part, wherein at least partial faces belong to at least partial objects. According to the method and the device provided by the invention, the talking time and the talking content of the object can be determined by associating the face of the object with the voice of the face, so that examination and retrieval of the talking content of the object by a user in the later stage are facilitated.

Description

For the treatment of the method and apparatus of video and related audio and search method and device

Technical field

The present invention relates to technical field of face recognition, relate more specifically to a kind of method and apparatus for the treatment of video and related audio and search method and device.

Background technology

Under a lot of scene, need to record people's what someone said and need to know everyone content of specifically speaking.Be described for conference scenario below.Recording meeting is all necessary in many instances, and the mode of record generally includes and adopts word, audio or video etc. to carry out record.The mode adopting word to carry out recording is convenient to retrieval most, but cost is larger; The mode cost adopting video or audio frequency to carry out recording is lower, but storage and retrieval all exist no small difficulty.Specifically, the former shortcoming is mainly that human cost expends very greatly, simultaneously when the number of participant is more, proposes challenge to the ability of shorthand staff.The latter often whole meeting is registered as large section audio or a video, although meeting has intactly been recorded, but owing to not knowing the corresponding relation of every section of voice and personage, therefore cannot retrieve expediently.

Summary of the invention

Consider the problems referred to above and propose the present invention.The invention provides a kind of method and apparatus for the treatment of video and related audio and search method and device.

According to an aspect of the present invention, a kind of method for the treatment of video and related audio is provided.The method comprises:

Obtain the video comprising one or more faces of one or more object;

Face datection is carried out to each frame of video in described video, to identify described one or more face;

Obtain the audio frequency comprising the voice of at least part of object in described one or more object gathered in same time section with described video;

For each at least part of face in described one or more face,

Determine in described audio frequency, corresponding with this face audio-frequency unit;

This face is associated with corresponding audio-frequency unit,

Wherein, described at least part of face belongs to described at least part of object respectively.

Exemplarily, described determine in described audio frequency, corresponding with this face audio-frequency unit for each at least part of face in described one or more face before, described method comprises further:

For each in described at least part of face,

Mouth action according to this face carries out segmentation to described video, to obtain the initial video section corresponding with this face;

According to the phonetic feature in described audio frequency, segmentation is carried out to described audio frequency, to obtain the initial audio section corresponding with this face; And

In in described video, corresponding with this face effective video section and described audio frequency, corresponding with this face effective audio section is obtained according to the initial video section corresponding with this face and initial audio section;

Describedly determine that in described audio frequency, corresponding with this face audio-frequency unit comprises for each at least part of face in described one or more face:

For each in described at least part of face, determine that the effective audio section corresponding with this face is the audio-frequency unit corresponding with this face.

Exemplarily, described being associated with corresponding audio-frequency unit by this face for each at least part of face in described one or more face comprises:

For each in described at least part of face,

For each effective video section corresponding with this face, the frame of video selecting face the best in quality from all frame of video of this effective video section;

Selected frame of video and the effective audio section corresponding with this effective video section are associated, to form a video/audio combination.

Exemplarily, described method comprises further:

For the face corresponding to the combination of particular video frequency audio frequency, face characteristic extraction is carried out to the frame of video in described particular video frequency audio frequency combination, to obtain Given Face feature, wherein, the combination of described particular video frequency audio frequency is one of all video/audios combinations corresponding to described at least part of face;

Sound characteristic extraction is carried out, to obtain specific sound feature to the effective audio section in described particular video frequency audio frequency combination;

For each in all the other video/audios combination in described all video/audio combinations,

Calculate described Given Face feature and this video/audio combine corresponding to face characteristic between human face similarity degree;

Calculate described specific sound feature and this video/audio combine corresponding to sound characteristic between assonance degree;

Calculate the combination of described particular video frequency audio frequency and this video/audio combine between human face similarity degree and the mean value of assonance degree, with obtain described particular video frequency audio frequency combine and this video/audio combine between average similarity;

If the average similarity between described particular video frequency audio frequency combination and this video/audio combine is greater than similarity threshold, then described particular video frequency audio frequency combination is combined with this video/audio and be referred to same target.

Exemplarily, each in described at least part of face described obtains in described video, corresponding with this face effective video section and described audio frequency, corresponding with this face effective audio section according to initial video section corresponding with this face and initial audio section and comprises:

For each in described at least part of face, the initial video section corresponding with this face is defined as the effective video section corresponding with this face, and the initial audio section corresponding with this face is defined as the effective audio section corresponding with this face.

For each in described at least part of face,

Split time according to the initial video section corresponding with this face and initial audio section determines unified split time;

According to described unified split time, segmentation is unified to described video and described audio frequency, to obtain the effective video section corresponding with this face and effective audio section.

Exemplarily, described audio frequency is gathered by unified microphone,

Describedly according to the phonetic feature in described audio frequency, segmentation is carried out to described audio frequency for each in described at least part of face and comprises to obtain the initial audio section corresponding with this face:

According to the phonetic feature in described audio frequency, segmentation is carried out to described audio frequency, to obtain mixed audio piece; And

For each in described at least part of face, from described mixed audio piece, select mixed audio piece that initial video section on acquisition time and corresponding with this face is consistent as the initial audio section corresponding with this face.

Exemplarily, described audio frequency comprises one or more audio frequency gathered by one or more shotgun microphone respectively,

Described acquisition and described video gather in same time section comprise the audio frequency of the voice of at least part of object in described one or more object before, described method comprises further:

Control described one or more shotgun microphone respectively towards described at least part of object to gather one or more audio frequency described;

For each in described at least part of face, according to the phonetic feature in the road audio frequency gathered by the shotgun microphone towards the object corresponding to this face, segmentation is carried out to this road audio frequency, to obtain the initial audio section corresponding with this face.

Exemplarily, the number of described shotgun microphone is equal to or greater than the number of described one or more face.

Exemplarily, at the described one or more shotgun microphone of described control respectively towards described at least part of object with before gathering one or more audio frequency described, described method comprises further:

The priority of each face is determined according to the face characteristic of described one or more face and/or action; And

According to the priority of each face determine described one or more shotgun microphone will towards object as described at least part of object.

Exemplarily, described for each in described at least part of face according to the mouth action of this face to described video carry out segmentation according to following rule implement:

For each in described at least part of face, if the mouth of this face changes to open configuration in the first moment from closure state and continues to be in closure state in the first scheduled time slot before described first moment, then using described first moment as the video segmentation start time, if the mouth of this face changes to closure state in the second moment from open configuration and continues to be in closure state in the second scheduled time slot after described second moment, then using described second moment as the video segmentation end time

Wherein, described video, part between adjacent video segmentation start time and video segmentation end time is described initial video section.

Exemplarily, described for each in described at least part of face according to the phonetic feature in described audio frequency to described audio frequency carry out segmentation according to following rule implement:

If the voice in described audio frequency the 3rd moment never sounding state change to sounding state and continue to be in not sounding state in the 3rd scheduled time slot before described 3rd moment, then using described 3rd moment as the audio parsing start time, if the voice in described audio frequency to change to not sounding state in the 4th moment and continue to be in not sounding state in the 4th scheduled time slot after described 4th moment from sounding state, then using described 4th moment as the audio parsing end time

Wherein, described audio frequency, part between adjacent audio parsing start time and audio parsing end time is described initial audio section.

Exemplarily, described determine in described audio frequency, corresponding with this face audio-frequency unit for each at least part of face in described one or more face after, described method comprises further:

For each in described at least part of face,

Speech recognition is carried out to the audio-frequency unit corresponding with this face, to obtain the text of the representative audio-frequency unit corresponding with this face;

Described text is associated with this face.

Exemplarily, described method comprises further: export expectation information,

Wherein, what described expectation information comprised in following item is one or more: the acquisition time of described video, described audio frequency, the frame of video comprising the Given Face in described one or more face, the acquisition time comprising the frame of video of described Given Face, the audio-frequency unit corresponding with described Given Face and the audio-frequency unit corresponding with described Given Face.

According to a further aspect of the invention, provide a kind of search method, comprising:

The retrieval received for target face indicates;

From database, the relevant information of described target face is searched according to described retrieval instruction; And

Export the relevant information of described target face;

Wherein, described database for store according to the method for the treatment of video and related audio mentioned above carry out the video that processes and audio frequency and/or with each the corresponding audio-frequency unit in described at least part of face,

And what wherein, the relevant information of described target face comprised in following item is one or more: comprise the frame of video of described target face, comprise the acquisition time of the acquisition time of the frame of video of described target face, the audio-frequency unit corresponding with described target person appearance and the audio-frequency unit corresponding with described target person appearance.

According to a further aspect of the invention, a kind of device for the treatment of video and related audio is provided.This device comprises:

First acquisition module, for obtaining the video of the one or more faces comprising one or more object;

Face detection module, for carrying out Face datection to each frame of video in described video, to identify described one or more face;

Second acquisition module, for obtaining the audio frequency comprising the voice of at least part of object in described one or more object gathered in same time section with described video;

Audio-frequency unit determination module, for for each at least part of face in described one or more face, determine in described audio frequency, corresponding with this face audio-frequency unit, wherein, described at least part of face belongs to described at least part of object respectively; And

Audio frequency relating module, for for each in described at least part of face, associates this face with corresponding audio-frequency unit.

Exemplarily, described device comprises further:

Video segmentation module, for for each in described at least part of face, the mouth action according to this face carries out segmentation to described video, to obtain the initial video section corresponding with this face;

Audio parsing module, for for each in described at least part of face, carries out segmentation according to the phonetic feature in described audio frequency to described audio frequency, to obtain the initial audio section corresponding with this face; And

Effective video and audio frequency obtain module, for obtaining in described video, corresponding with this face effective video section and described audio frequency, corresponding with this face effective audio section according to the initial video section corresponding with this face and initial audio section;

Described audio-frequency unit determination module comprises determines submodule, for for each in described at least part of face, determines that the effective audio section corresponding with this face is the audio-frequency unit corresponding with this face.

Exemplarily, described audio frequency relating module comprises:

Frame of video chooser module, for for each in described at least part of face, for each effective video section corresponding with this face, the frame of video selecting face the best in quality from all frame of video of this effective video section; And

Association submodule, for selected frame of video and the effective audio section corresponding with this effective video section being associated, to form a video/audio combination.

Exemplarily, described device comprises further:

Face characteristic extraction module, for carrying out face characteristic extraction for the face corresponding to the combination of particular video frequency audio frequency to the frame of video in described particular video frequency audio frequency combination, to obtain Given Face feature, wherein, the combination of described particular video frequency audio frequency is one of all video/audios combinations corresponding to described at least part of face;

Sound characteristic extraction module, carries out sound characteristic extraction, to obtain specific sound feature to the effective audio section in described particular video frequency audio frequency combination;

Human face similarity degree computing module, for for each in all the other video/audios combination in described all video/audios combination, calculate described Given Face feature and this video/audio combine corresponding to face characteristic between human face similarity degree;

Assonance degree computing module, for for each in the combination of described all the other video/audios, calculate described specific sound feature and this video/audio combine corresponding to sound characteristic between assonance degree;

Average similarity calculation module, for for each in the combination of described all the other video/audios, calculate the combination of described particular video frequency audio frequency and this video/audio combine between human face similarity degree and the mean value of assonance degree, with obtain described particular video frequency audio frequency combine and this video/audio combine between average similarity;

Classifying module, for for each in the combination of described all the other video/audios, if the average similarity between described particular video frequency audio frequency combination and this video/audio combine is greater than similarity threshold, then described particular video frequency audio frequency combination is combined with this video/audio and be referred to same target.

Exemplarily, described effective video and audio frequency acquisition module comprise:

Effective video section determines submodule, for for each in described at least part of face, the initial video section corresponding with this face is defined as the effective video section corresponding with this face; And

Effective audio section determination submodule, for for each in described at least part of face, is defined as the effective audio section corresponding with this face by the initial audio section corresponding with this face.

Unified split time determination submodule, for for each in described at least part of face, the split time according to the initial video section corresponding with this face and initial audio section determines unified split time;

Unified segmentation submodule, for unifying segmentation according to described unified split time to described video and described audio frequency, to obtain the effective video section corresponding with this face and effective audio section.

Exemplarily, described audio frequency is gathered by unified microphone,

Described audio parsing module comprises:

First segmentation submodule, for carrying out segmentation according to the phonetic feature in described audio frequency to described audio frequency, to obtain mixed audio piece; And

Audio section chooser module, for for each in described at least part of face, from described mixed audio piece, select mixed audio piece that initial video section on acquisition time and corresponding with this face is consistent as the initial audio section corresponding with this face.

Described device comprises further:

Control module, for control described one or more shotgun microphone respectively towards described at least part of object to gather one or more audio frequency described;

Described audio parsing module comprises:

Second segmentation submodule, for for each in described at least part of face, according to the phonetic feature in the road audio frequency gathered by the shotgun microphone towards the object corresponding to this face, segmentation is carried out to this road audio frequency, to obtain the initial audio section corresponding with this face.

Exemplarily, described device comprises further:

Priority Determination module, for determining the priority of each face according to the face characteristic of described one or more face and/or action; And

Object determination module, for determine according to the priority of each face described one or more shotgun microphone will towards object as described at least part of object.

Exemplarily, described video segmentation module carries out segmentation according to following rule to described video:

Exemplarily, described audio parsing module carries out segmentation according to following rule to described audio frequency:

Exemplarily, described device comprises further:

Sound identification module, for for each in described at least part of face, carries out speech recognition to the audio-frequency unit corresponding with this face, to obtain the text of the representative audio-frequency unit corresponding with this face; And

Textual association module, for associating described text with this face.

Exemplarily, described device comprises output module further, for exporting expectation information,

According to a further aspect of the invention, provide a kind of indexing unit, comprising:

Receiver module, indicates for the retrieval received for target face;

Search module, for searching the relevant information of described target face from database according to described retrieval instruction; And

Output module, for exporting the relevant information of described target face;

Wherein, described database be used for video that the memory device for the treatment of video and related audio mentioned above carries out processing and audio frequency and/or with each the corresponding audio-frequency unit in described at least part of face,

According to the method and apparatus for the treatment of video and related audio of the embodiment of the present invention and search method and device, by the face of object and its voice association are got up, the time of speaking of object and content of speaking can be determined, thus facilitate the speak content of user in the later stage to this object to check and retrieve.

Accompanying drawing explanation

Be described in more detail the embodiment of the present invention in conjunction with the drawings, above-mentioned and other object of the present invention, Characteristics and advantages will become more obvious.Accompanying drawing is used to provide the further understanding to the embodiment of the present invention, and forms a part for instructions, is used from explanation the present invention, is not construed as limiting the invention with the embodiment of the present invention one.In the accompanying drawings, identical reference number represents same parts or step usually.

Fig. 1 illustrates the schematic block diagram of the exemplary electronic device for realizing the method and apparatus for the treatment of video and related audio according to the embodiment of the present invention;

Fig. 2 illustrates according to an embodiment of the invention for the treatment of the indicative flowchart of the method for video and related audio;

Fig. 3 illustrates in accordance with another embodiment of the present invention for the treatment of the indicative flowchart of the method for video and related audio;

Fig. 4 illustrates the indicative flowchart of classifying step according to an embodiment of the invention;

Fig. 5 illustrates in accordance with another embodiment of the present invention for the treatment of the indicative flowchart of the method for video and related audio;

Fig. 6 illustrates the indicative flowchart of search method according to an embodiment of the invention;

Fig. 7 illustrates according to an embodiment of the invention for the treatment of the schematic block diagram of the device of video and related audio;

Fig. 8 illustrates the schematic block diagram of indexing unit according to an embodiment of the invention; And

Fig. 9 illustrates according to an embodiment of the invention for the treatment of the schematic block diagram of the system of video and related audio.

Embodiment

In order to make the object, technical solutions and advantages of the present invention more obvious, describe in detail below with reference to accompanying drawings according to example embodiment of the present invention.Obviously, described embodiment is only a part of embodiment of the present invention, instead of whole embodiment of the present invention, should be understood that the present invention not by the restriction of example embodiment described herein.Based on the embodiment of the present invention described in the present invention, other embodiments all that those skilled in the art obtain when not paying creative work all should fall within protection scope of the present invention.

First, with reference to Fig. 1, the exemplary electronic device 100 for realizing the method and apparatus for the treatment of video and related audio according to the embodiment of the present invention is described.

As shown in Figure 1, electronic equipment 100 comprises one or more processor 102, one or more memory storage 104, input media 106, output unit 108, video acquisition device 110 and audio collecting device 114, and these assemblies are interconnected by bindiny mechanism's (not shown) of bus system 112 and/or other form.The assembly and the structure that it should be noted that the electronic equipment 100 shown in Fig. 1 are illustrative, and not restrictive, and as required, described electronic equipment also can have other assemblies and structure.

Described processor 102 can be the processing unit of CPU (central processing unit) (CPU) or other form with data-handling capacity and/or instruction execution capability, and other assembly that can control in described electronic equipment 100 is with the function of carry out desired.

Described memory storage 104 can comprise one or more computer program, and described computer program can comprise various forms of computer-readable recording medium, such as volatile memory and/or nonvolatile memory.Described volatile memory such as can comprise random access memory (RAM) and/or cache memory (cache) etc.Described nonvolatile memory such as can comprise ROM (read-only memory) (ROM), hard disk, flash memory etc.Described computer-readable recording medium can store one or more computer program instructions, processor 102 can run described programmed instruction, to realize the function of client functionality and/or other expectation (realized by processor) in the embodiment of the present invention hereinafter described.Various application program and various data can also be stored, the various data etc. that such as described application program uses and/or produces in described computer-readable recording medium.

Described input media 106 can be that user is used for inputting the device of instruction, and it is one or more to comprise in keyboard, mouse, microphone and touch-screen etc.

Described output unit 108 externally (such as user) can export various information (such as image and/or sound), and it is one or more to comprise in display, loudspeaker etc.

Described video acquisition device 110 can gather the video of expectation, and gathered video storage is used for other assembly in described memory storage 104.Video acquisition device 110 can adopt any suitable equipment to realize, and the shooting of such as video camera or mobile terminal is first-class.Video acquisition device 110 is optional, and electronic equipment 100 can not comprise video acquisition device 110.Electronic equipment 100 can utilize video acquisition device 110 to gather video, also can receive the video that other equipment transmits via the communication interface (not shown) between other equipment.

Described audio collecting device 114 can gather the audio frequency of expectation, and gathered audio storage is used for other assembly in described memory storage 104.Audio collecting device 114 can adopt any suitable sound pick-up outfit to realize, the built-in microphone etc. of such as independently microphone or mobile terminal.Audio collecting device 114 can also be the built-in microphone of video camera, and that is, audio collecting device 114 can integrate with video acquisition device 110.Audio collecting device 114 is optional, and electronic equipment 100 can not comprise audio collecting device 114.Electronic equipment 100 can utilize audio collecting device 114 to gather audio frequency, also can receive the audio frequency that other equipment transmits via the communication interface (not shown) between other equipment.

Exemplarily, for realizing to realize on the equipment of such as personal computer or remote server etc. according to the exemplary electronic device of the method and apparatus for the treatment of video and related audio of the embodiment of the present invention.

Below, with reference to Fig. 2, the method for the treatment of video and related audio according to the embodiment of the present invention is described.Fig. 2 illustrates according to an embodiment of the invention for the treatment of the indicative flowchart of the method 200 of video and related audio.As shown in Figure 2, the method 200 for the treatment of video and related audio comprises the following steps.

In step S210, obtain the video comprising one or more faces of one or more object.

" object " as herein described can be any people needing to record its voice, the personnel etc. that the personnel of such as conference participation or behavior needs are monitored.Same target has same face, and same the position of face in different video frame, expression action may be different, can adopt face tracking technology in continuous print frame of video, follow the tracks of the face of same target.

Under conference scenario, camera (such as independently the shooting of video camera or mobile terminal is first-class) can be utilized to be captured in the video of the personnel in session meeting-place.It is desirable that, the video gathered comprises the face of all participants or at least comprises the face of the participant that all theory is exchanged words.The video collected can be sent to server end by camera in real time, is processed in real time by server end.Certainly, together with video acquisition end and end for process also can be implemented in.Wherein, end for process may be used for the video that process video acquisition end collects.

In step S220, Face datection is carried out to each frame of video in video, to identify one or more face.

In this step, can determine whether comprise face in each frame of video of gathered video, and orient human face region in the video frame when comprising face in the video frame.Human-face detector locating human face region in gathered frame of video that training in advance is good can be utilized.Such as, algorithm such as human face detection and tracing such as Ha Er (Haar) algorithm, Adaboost algorithm etc. can be utilized in advance on the basis of a large amount of picture to train human-face detector, for gathered single frame of video, the human-face detector that this training in advance is good can orient human face region rapidly.In addition, for multiple frame of video (i.e. one section of video) of continuous acquisition, orient human face region in first frame of video after, the position of human face region in current video frame can be followed the trail of based on the position of the human face region in the last frame of video of current video frame in real time, namely can realize face tracking.

Should be appreciated that the present invention not by the restriction of the concrete method for detecting human face adopted; no matter be existing method for detecting human face or the method for detecting human face developed in the future; can be applied to according to the embodiment of the present invention for the treatment of in the method for video and related audio, and also should to be included in protection scope of the present invention.

In step S230, obtain the audio frequency comprising the voice of at least part of object in one or more object gathered in same time section with described video.

Under conference scenario, the microphone microphone etc. of microphone or mobile terminal (such as independently) can be utilized to be captured in the audio frequency of the personnel in session meeting-place, to record these personnel institute what someone saids, be i.e. their voice.In this enforcement, the audio frequency being captured in the personnel in session meeting-place and the video being captured in the personnel in session meeting-place carry out simultaneously, and namely audio & video need gather in same time section simultaneously.It is desirable that, the audio frequency gathered comprises the voice of all participants or at least comprises the voice of the participant that all theory is exchanged words.Should be appreciated that, in some cases, such as adopted number of microphone not or microphone poor quality make the audio frequency that gathers not clearly in situation, possibly cannot obtain the voice of all participants (or all theory exchange words participant).The audio frequency collected can be sent to server end by microphone in real time, is processed in real time by server end.Certainly, together with audio collection end and end for process also can be implemented in.Wherein, end for process may be used for the audio frequency that processing audio collection terminal collects.

In step S240, for each at least part of face in one or more face, determine in audio frequency, corresponding with this face audio-frequency unit, wherein, at least part of face belongs at least part of object respectively.

Video has time shaft, and each frame of video all has definite acquisition time.Because the personnel in meeting-place are when speaking, the change of its face (mainly mouth) can be detected in video, therefore can judge its time of speaking.Equally, audio frequency also has time shaft, and the acquisition time of voice data also can be known.When someone speaks, the change of sound wave can be detected in audio frequency, therefore, also can judge its time of speaking.Be appreciated that comprehensive video and voice data, the time that someone speaks and its content of speaking (i.e. its voice) can be known with comparalive ease.It is desirable that, the face of all personnel in record meeting-place and voice, especially record the face and voice of once saying the personnel exchanged words, like this, each content of speaking once saying the personnel exchanged words can be checked or retrieve to the later stage by user.But, likely video comprises the face of all participants (or all theory exchange words participant), and audio frequency does not comprise the voice of all participants (or all theory exchange words participant), or it is contrary, audio frequency comprises the voice of all participants (or all theory exchange words participant), and in video, do not comprise the face of all participants (or all theory exchange words participant), under these circumstances, the face of the part participant (or part says the participant exchanged words) in meeting-place and corresponding audio-frequency unit can be determined.

In step S250, for each at least part of face, this face is associated with corresponding audio-frequency unit.

After determining the audio-frequency unit corresponding with someone's appearance, this face can be associated with corresponding audio-frequency unit.Such as, if in the video gathered morning one day, said by Face datection discovery certain object in the frame of video of 9 o'clock to 9: 10 and exchange words, and in the audio frequency gathered at the same time, find within 9 o'clock to 9: 10, to there is voice change, then the facial image of this object detected (such as comprising the whole frame of video of the face of this object or the image only comprising the face of this object by extracting frame of video acquisition) can be associated together with the section audio gathered for 9 o'clock to 9: 10.Like this, when user checks these minutes afterwards, can inform that this object of user was said at 9 o'clock to 9: 10 and exchange words, and this object of user content of speaking during this period of time can be informed.In addition, above-mentioned interrelational form makes user to retrieve minutes very expediently.

Should be appreciated that, the enforcement order of each step shown in Fig. 2 is only exemplary and nonrestrictive, and step S210 to step S250 can have any suitable enforcement order.In one example, step S210 to step S250 can carry out in real time.Such as, Audio and Video can start to gather and obtain simultaneously, and namely step S210 and step S230 can implement simultaneously.More specifically, under conference scenario, camera continuously gathers the frame of video of participant and the frame of video collected is sent to the native processor or remote server that are connected, and microphone continuously gathers the audio frequency of participant and the voice data collected is sent to the native processor or remote server that are connected simultaneously.Native processor or remote server, whenever receiving new frame of video (i.e. implementation step S210), just carry out Face datection (i.e. implementation step S220) to frame of video.Native processor or remote server, when receiving new frame of video, also receive new voice data (i.e. implementation step S230) simultaneously.Native processor or remote server can determine the audio-frequency unit (i.e. implementation step S240) corresponding with this face according to the face identified in step S220, and are associated (i.e. implementation step S250) with corresponding audio-frequency unit by face.Above-mentioned whole method 200 is implemented continuously, in real time.In another example, the video storage about whole meeting that camera can will collect, the audio storage about whole meeting that microphone also can will collect.After meeting adjourned, complete Audio and Video can be sent to native processor or remote server by camera and microphone.By processing locality or remote server, complete Audio and Video is processed.In this case, step S210 can implement before step S230, afterwards or simultaneously, and step S220 can implement before step S230, afterwards or simultaneously.

Exemplarily, can realize in the unit with storer and processor or system according to the method for the treatment of video and related audio of the embodiment of the present invention.

The method for the treatment of video and related audio according to the embodiment of the present invention can be deployed in client.Such as, under conference scenario, the camera of mobile terminal (i.e. video acquisition device) can be utilized to gather the video of participant and utilize the microphone of mobile terminal (i.e. audio collecting device) to gather the audio frequency of participant, by the processor (namely for the treatment of the device of video and related audio) of mobile terminal, Audio and Video is processed subsequently.Under another conference scenario, video acquisition device, audio collecting device and the device section for the treatment of video and related audio are deployed in meeting-place.Such as, independently video camera (i.e. video acquisition device) can be utilized to gather the video of participant and utilize the built-in microphone of independently microphone or video camera (i.e. audio collecting device) to gather the audio frequency of participant, the Audio and Video of collection is sent to microphone the computing machine be connected by video camera subsequently, is processed Audio and Video by the processor (namely for the treatment of the device of video and related audio) of computing machine.

Alternatively, server end (or high in the clouds) and client (such as mobile terminal) place can also be deployed in distributing according to the method for the treatment of video and related audio of the embodiment of the present invention.Such as, under conference scenario, camera (such as independently the shooting of video camera or mobile terminal is first-class) can be utilized to gather the video of participant, and utilize the audio frequency of microphone (such as independently the built-in microphone of microphone, video camera or the microphone etc. of mobile terminal) acquisition target, the Audio and Video of collection is sent to server end (or high in the clouds) by camera and microphone, and server end (or high in the clouds) processes Audio and Video.

According to the method for the treatment of video and related audio of the embodiment of the present invention, by the face of object and its voice association are got up, the time of speaking of object and content of speaking can be determined, thus facilitate the speak content of user in the later stage to this object to check and retrieve.The present invention is applicable to any suitable scene needing to record object voice, such as conference scenario etc.

Fig. 3 illustrates in accordance with another embodiment of the present invention for the treatment of the indicative flowchart of the method 300 of video and related audio.The step S310 of the method 300 shown in Fig. 3, S320, S330 and S380 are corresponding with the step S210 of the method 200 shown in Fig. 2, S220, S230 and S250 respectively.Those skilled in the art are appreciated that the above-mentioned steps in Fig. 3 according to Fig. 2 and description above, for simplicity, do not repeat them here.Step S370 shown in Fig. 3 is a kind of embodiment of the step S240 shown in Fig. 2, will describe in detail below.According to the present embodiment, before step S370, method 300 may further include following steps.

In step S340, for each at least part of face, the mouth action according to this face carries out segmentation to video, to obtain the initial video section corresponding with this face.

Carry out the profile that Face datection can detect face in step s 320, orient human face region.Subsequently can locating human face's key point in located human face region further.Face key point generally includes the strong key point of some sign abilities of face, such as eyes, canthus, eye center, eyebrow, nose, nose, face and the corners of the mouth etc.In the present invention, main needs location is mouth key point.The key point steady arm that training in advance can be utilized good carrys out locating human face's key point in human face region.Such as, cascade homing method can be utilized in advance on the basis of the face picture of a large amount of artificial mark to train key point steady arm.Alternatively, also can adopt traditional face key point localization method, it is based on parametric shape model, according to the appearance features near key point, learns out a parameter model, optimizes the position of key point in use iteratively, finally obtain key point coordinate.

As mentioned above, in the present invention, main needs location is mouth key point.Such as, mouth profile can be located.The mouth action of this face can be judged by the size variation of the mouth profile of same face in a period of time (namely in continuous print frame of video).Such as, if in a period of time, the mouth of same face is becoming large gradually or is diminishing, and can think that the object corresponding to this face is spoken.If in a period of time, the mouth of same face continues to be in closure state, can think that the object corresponding to this face is not spoken.Or if in a period of time, the mouth of same face continues to be in open configuration and mouth profile variations is very little, then also can think that the object corresponding to this face is not spoken (such as may yawn).When the frame of video gathered when object can be spoken according to mouth action is silent with object, the frame of video that gathers is separated, and namely whether foundation object is spoken and carried out segmentation to video.The initial video section corresponding with face obtained can be the part video that, object determined according to the mouth action of face collects when being in and speaking state.

Although at whole video or say in each frame of video there is multiple face, the object corresponding to each face only may be said and exchange words within certain period.Such as, in whole video X, have recorded the face of object A and the face of object B.Object A and object B are said at period a and period b respectively and are exchanged words.Can the face of tracing object A separately, according to the mouth action of object A by whole video X segmentation, find out the part video gathered in period a, namely corresponding with the face of object A initial video section.For object B, according to the mouth action of object B by whole video X segmentation, the part video gathered in period b can be found out, namely corresponding with the face of object B initial video section.That is, segmentation can be carried out according to the mouth action situation of each face to whole video respectively, to obtain the initial video section corresponding with everyone appearance.

In step S350, for each at least part of face, according to the phonetic feature in audio frequency, segmentation is carried out to audio frequency, to obtain the initial audio section corresponding with this face.

Phonetic feature can comprise voice change.Voice in audio frequency change the fluctuating of the sound wave from object in namely audio frequency.Being appreciated that when there being people to speak, there is the fluctuating of voice in audio frequency, and when nobody speaks, in audio frequency, only may there is ground unrest, almost can't detect the fluctuating of voice.Therefore, can be changed by voice and judged whether that people is speaking.Phonetic feature can also comprise the feature of other types, such as voice content.If object sends such as the insignificant modal particle of " ", " " and so on continuously within a period of time, then can think that object is not during this period of time spoken.

When the voice data gathered when being spoken by object is silent with object, the voice data that gathers separates, and namely whether foundation object is spoken and carried out segmentation to audio frequency.The initial audio section corresponding with face obtained can be the part audio frequency that, object determined according to the phonetic feature in audio frequency collects when being in and speaking state.

If audio frequency adopts the unified microphone (can be called unified microphone) of the overall situation to gather, then the voice of all objects may be mingled in the audio frequency of same road.In this case, after carrying out segmentation according to phonetic feature to audio frequency, need to judge that audio section should be corresponding with the face of which object according to the acquisition time of the acquisition time of marked off audio section and the initial video section of each object further.If audio frequency adopts shotgun microphone collection, the audio frequency collected is divided into multichannel, and every road only comprises the voice of an object, in this case, can without the need to judging the corresponding relation of audio section and face, because when distributing shotgun microphone for object, this corresponding relation is determined.These embodiments will describe in further detail hereinafter, not repeat at this.

In step S360, for each at least part of face, obtain in video, corresponding with this face effective video section and audio frequency, corresponding with this face effective audio section according to the initial video section corresponding with this face and initial audio section.

Effective video section refers to that the object finally determined is in the video-frequency band collected when speaking state, and effective audio section refers to that the object finally determined is in the audio section collected when speaking state.Effective video section and effective audio section may be used for determining the audio-frequency unit corresponding with face and being associated with corresponding audio-frequency unit by face.

In one example, step S360 can comprise: for each at least part of face, the initial video section corresponding with this face is defined as the effective video section corresponding with this face, and the initial audio section corresponding with this face is defined as the effective audio section corresponding with this face.

Usually, mouth action and the voice of object are consistent, and that is object opens one's mouth to speak, and object is shut up and do not spoken.Therefore, be corresponding on a timeline substantially according to the mouth action of face the initial video section marked off and the initial audio section marked off according to phonetic feature.In this case, directly initial video section can be considered as effective video section, and initial audio section is considered as effective audio section.This mode can determine effective video-frequency band and effective audio section comparatively quickly and easily.

In another example, step S360 can comprise: for each at least part of face, and the split time according to the initial video section corresponding with this face and initial audio section determines unified split time; Segmentation is unified to Audio and Video, to obtain the effective video section corresponding with this face and effective audio section according to unified split time.

Because the differentiation standard of Audio and Video in whether speaking about object is different, therefore initial video section and initial audio section may be not quite identical on a timeline, also not necessarily identical to the division accuracy of the state of speaking.The division accuracy of initial video section to the state of speaking may be greater than initial audio section in some cases, in yet some other cases then may be contrary.Therefore, the split time that can consider initial video section and initial audio section determines a comparatively suitable unified split time.This is unified split time and may be used for universal formulation object and whether be in the state of speaking.Then, unify split time again segmentation is carried out to Audio and Video according to this, obtain effective video section and effective audio section.This mode can improve the division accuracy of effective audio section and effective video section.

Such as, suppose for object A, find that it continues to be in the state of speaking within the period 1 of 9: ten: 20 to 9: 11: 30 according to its mouth action in video, and using one section of video collecting in this period 1 as initial video section; In addition, continue to be in the state of speaking within the period 2 of 9: ten: 30 to 9: 11: 35 according to the phonetic feature Finding Object A in audio frequency, and collect a section audio as initial audio section using in this period 2, so, can consider period 1 and period 2 determines unified split time.Such as, 9: ten: 20 to 9: 11: 35 (be can be described as " period 3 ") be during this period of time considered as the object A actual time being in the state of speaking.In this case, using 9: ten: 20 as unified Segment start times, using 9: 11: 35 as the unified segmentation end time, using that section of video collecting in the period 3 as effective video section, using that section audio of collecting in the period 3 as effective audio section.In this example, the period 3 is unions of period 1 and period 2.Again such as, 9: ten: 30 to 9: 11: 30 (can be can be described as " period 4 ") during this period of time and be considered as the object A actual time being in the state of speaking.In this case, using 9: ten: 30 as unified Segment start times, using 9: 11: 30 as the unified segmentation end time, using that section of video collecting in the period 4 as effective video section, using that section audio of collecting in the period 4 as effective audio section.In this example, the period 4 is common factors of period 1 and period 2.The determination mode of above-mentioned unified split time is only example and unrestricted, and unified split time also can have other suitable determination modes, and it all should fall into protection scope of the present invention.

With merely carry out compared with segmentation, jointly coordinating to come segmentation can obtain better subsection efect by face mouth action and phonetic feature, situations such as whispering can being processed better according to phonetic feature.

In the embodiment shown in fig. 3, step S370 can specifically comprise: for each at least part of face, determines that the effective audio section corresponding with this face is the audio-frequency unit corresponding with this face.

The all effective audio section corresponding with someone appearance can be considered as needing that find out, corresponding with this face audio-frequency unit.

Whether embodiment according to Fig. 3, spoken to Audio and Video segmentation by object, thus can more adequately find out the audio-frequency unit corresponding with the face of object.

According to the embodiment of the present invention, above-mentioned steps S380 can comprise: for each at least part of face, for each effective video section corresponding with this face, and the frame of video selecting face the best in quality from all frame of video of this effective video section; Selected frame of video and the effective audio section corresponding with this effective video section are associated, to form a video/audio combination.Wherein, the frame of video that face is the best in quality can be frame of video that all frame of video intermediate-resolutions are the highest or can be face frame of video the most clearly.

The effective audio section corresponding with effective video section refers to effective audio section consistent or basically identical with effective video section on acquisition time.By step S310 to S370, for each face, can obtain several effective video sections and with effective video section effective audio section one to one.In one example, directly effective video section and corresponding effective audio section can be associated, form several videos and audio frequency pair, each video and audio frequency are to being considered as a video/audio combination.In another example, one or more more representational frame of video can be selected from each effective video section, namely, one or more frame of video that face is the best in quality can be selected.Selected frame of video associated with corresponding effective audio section, final several frame of video of formation and audio frequency pair, each frame of video and audio frequency are to being also considered as a video/audio combination.Be understandable that, frame of video is facial image, and the inside may comprise some faces.Selected frame of video can be original frame of video, wherein, can mark (such as utilizing square frame to mark) face corresponding to effective video section in this frame of video.In addition, selected frame of video also can be the frame of video of the face only comprised corresponding to effective video-frequency band.For latter event, can will be converted to the original video frame expected in the corresponding initial video section of face the new frame of video that only comprises this expectation face or can will be converted to the original video frame expected in the effective video section that face is corresponding the new frame of video that only comprises this expectation face or in step S380, can be converted to by with the original video frame expected in the effective video section that face is corresponding or selected frame of video the new frame of video only comprising this expectation face in step S360 in step S340.Relatively it is desirable that, each video/audio combination formed is a facial image only comprising certain face mate an effective audio section corresponding with this face.Like this, under conference scenario, when user expects to check minutes, can add that the form of effective audio section presents minutes with facial image, be very intuitively retrieve with being highly susceptible to like this.

According to the embodiment of the present invention, in order to compensate face metrical error, method 200 (or 300) may further include classifying step.Fig. 4 illustrates the indicative flowchart of classifying step according to an embodiment of the invention.As shown in Figure 4, classifying step can comprise the following steps.

In step S410, for the face corresponding to the combination of particular video frequency audio frequency, face characteristic extraction is carried out to the frame of video in the combination of particular video frequency audio frequency, to obtain Given Face feature, wherein, the combination of particular video frequency audio frequency is one of all video/audios combination corresponding at least part of face.

In one example, video acquisition device and audio collecting device Real-time Collection Audio and Video, the device for the treatment of video and related audio processes in real time to Audio and Video.That is, Audio and Video is processed by collection limit, limit.In this case, the combination of increasing video/audio can be obtained along with passage of time.Whenever the video/audio combination that acquisition one is new, the combination of the video/audio of current acquisition can be combined with all video/audios previously obtained and contrast, if find that the video/audio combination of current acquisition is combined with a certain video/audio previously obtained to belong to same object, then the two is referred to same target.Video/audio for current acquisition combines, and can calculate the similarity of face and the voice combined with certain video/audio previously obtained respectively.

For Audio and Video first complete collection, carry out situation about processing subsequently, any video/audio can be selected to combine as the combination of particular video frequency audio frequency, calculate the similarity of face that itself and all the other video/audios combine and voice.

The face characteristic of Given Face is mainly extracted in step S410.Such as, for the video/audio combination be made up of a frame of video and an effective audio section, frame of video only may comprise the face corresponding to video/audio combination, also may comprise other faces further.When carrying out face characteristic and extracting, need only to carry out feature extraction for the face corresponding to video/audio combination.

Face characteristic extracts, and also claim face to characterize, it is the process of face being carried out to feature modeling.Face characteristic extracts and two class methods can be adopted to realize: a kind of is method based on geometric properties; Another is the method based on algebraic characteristic or statistical learning.Based on geometric properties method mainly by extract face vitals (such as eyes, nose, mouth, chin) geometric configuration and geometric relationship as face characteristic.The position such as eyes, nose, mouth, chin of face can be called unique point.Utilize these unique points can construct the characteristic component can weighing face characteristic, characteristic component generally includes Euclidean distance, curvature and angle etc. between unique point.Face characteristic as herein described can comprise above-mentioned characteristic component.Method based on algebraic characteristic or statistical learning frame of video is regarded as a matrix, by making matrixing or linear projection, the statistical nature of face can be extracted, this is a kind of thought based on entirety, whole frame of video (i.e. facial image) is regarded as a pattern identify, therefore this method is also a kind of template matching method.Face characteristic as herein described can also comprise above-mentioned statistical nature.

The method that above face characteristic extracts is only example and unrestricted, and any other known or following face feature extraction method in the cards can be adopted to process the frame of video in the combination of particular video frequency audio frequency, to obtain Given Face feature.

The face characteristic of the face corresponding to the combination of particular video frequency audio frequency can be obtained by the way, i.e. Given Face feature.

In step S420, sound characteristic extraction is carried out, to obtain specific sound feature to the effective audio section in the combination of particular video frequency audio frequency.

Sound characteristic extracts can by extracting and selecting to have the acoustics of the characteristics such as separability is strong, stability is high to the vocal print of speaker or language feature realizes.The sound characteristic extracted can comprise: acoustic feature (as frequency spectrum, cepstrum, resonance peak, fundamental tone, reflection coefficient etc.), nasal sound, band deep breathing sound, hoarse sound, laugh etc. that (1) is relevant with the anatomical structure of the pronunciation mechanism of the mankind; (2) by semanteme, rhetoric, pronunciation, speech custom etc. that socioeconomic status, education level, birthplace etc. affect; (3) personal touch or the feature such as the rhythm, rhythm, speed, intonation, volume that affects by father and mother.

In step S430, for each in all the other video/audios combination in all video/audio combinations, the human face similarity degree between the face characteristic corresponding to calculating Given Face feature and this video/audio combine.

The face characteristic of the face characteristic corresponding to video/audio combination i.e. the face corresponding to video/audio combination.For Audio and Video by the situation of Real-time Collection and process, can each obtain new video/audio combination time, calculate the face characteristic corresponding to it, and the face characteristic calculated is stored in the storage device.Simultaneously, face characteristic corresponding to face characteristic (i.e. Given Face feature) corresponding to the video/audio of current acquisition can also being combined and that store, the previous each video/audio obtained combine compares, and calculates similarity therebetween.

For Audio and Video first complete collection, carry out situation about processing subsequently, the face characteristic of all video/audios combination can be calculated simultaneously, and select any video/audio combination wherein as the combination of particular video frequency audio frequency, calculate itself and all the other video/audios combine between human face similarity degree.

In step S440, for each in the combination of all the other video/audios, the assonance degree between the sound characteristic corresponding to calculating specific sound feature and this video/audio combine.

For Audio and Video by the situation of Real-time Collection and process, can each obtain new video/audio combination time, calculate the sound characteristic corresponding to it, and the sound characteristic calculated is stored in the storage device.Simultaneously, sound characteristic corresponding to sound characteristic (i.e. specific sound feature) corresponding to the video/audio of current acquisition can also being combined and that store, the previous each video/audio obtained combine compares, and calculates similarity therebetween.

For Audio and Video first complete collection, carry out situation about processing subsequently, the sound characteristic of all video/audios combination can be calculated simultaneously, and select any video/audio combination wherein as the combination of particular video frequency audio frequency, calculate itself and all the other video/audios combine between assonance degree.

In step S450, for each in the combination of all the other video/audios, calculate the combination of particular video frequency audio frequency and this video/audio combine between human face similarity degree and the mean value of assonance degree, with obtain particular video frequency audio frequency combine and this video/audio combine between average similarity.

For a certain video/audio combination x, itself and particular video frequency audio frequency combine between y has human face similarity degree and sound similarity, supposes to be respectively 80% and 90%.The mean value of both calculating, obtaining average similarity is 85%.That is, the average similarity that this video/audio combination x and particular video frequency audio frequency combine between y is 85%.

In step S460, for each in the combination of all the other video/audios, if the average similarity between the combination of particular video frequency audio frequency and this video/audio combine is greater than similarity threshold, then the combination of particular video frequency audio frequency is combined with this video/audio and be referred to same target.

Similarity threshold can be determined as required, and it can be any suitable value, and the present invention does not limit this.Suppose that similarity threshold is 90%, then average similarity 85% is less than similarity threshold, like this, thinks that above-mentioned video/audio combination x and particular video frequency audio frequency combination y does not belong to same target.If before by Face datection mistakenly video/audio is combined x and particular video frequency audio frequency combination y correspond to same face, then in this way, the error of Face datection can be corrected.Suppose that similarity threshold is 80%, then average similarity 85% is greater than similarity threshold, like this, thinks that above-mentioned video/audio combination x and particular video frequency audio frequency combination y belongs to same target.If before by Face datection mistakenly video/audio is combined x and particular video frequency audio frequency combination y correspond to different face, then in this way, the error of Face datection can be corrected.

By categorizing operation, the face of the same target in video can be made to be referred to together, thus significantly to promote the classification accuracy to audio frequency.In addition, adopt categorizing operation that the method for the treatment of video and related audio as herein described can be made to have better compatibility to situations such as the different tone of same target, different volumes, can reduce the situation of the different mood classifications of an object to multiple object.

Should be appreciated that Fig. 4 is only example and unrestricted, above-mentioned steps S410 to S460 can have any rational enforcement order but not be confined to the order shown in Fig. 4.

According to the embodiment of the present invention, described audio frequency is gathered by unified microphone, and above-mentioned steps S350 can comprise: carry out segmentation according to the phonetic feature in audio frequency to audio frequency, to obtain mixed audio piece; And for each at least part of face, from mixed audio piece, select mixed audio piece that initial video section on acquisition time and corresponding with this face is consistent as the initial audio section corresponding with this face.

Under conference scenario, (also may be a multiple) microphone (i.e. unified microphone) can be adopted to gather the voice of all participants.In this case, the voice of all participants will be included in the audio frequency of same road.According to phonetic feature by after audio parsing, the audio section (i.e. mixed audio piece) of acquisition may correspond to different objects.Exchanged words when saying by the known each object of initial video section.Such as, for object A, suppose that it has three initial video sections.Split time in conjunction with initial video section can find three mixed audio piece consistent with initial video section on acquisition time.These three mixed audio piece are exactly required, corresponding with the face of object A initial audio section.It should be noted that acquisition time as herein described unanimously can comprise the situation of the synchronous or basic synchronization of acquisition time, and only should not be understood as acquisition time must be identical.

When adopting unified microphone to gather audio frequency, the mode finding out corresponding voice messaging in conjunction with face information is a kind of simple and efficient mode.

According to the embodiment of the present invention, described audio frequency comprises one or more audio frequency gathered by one or more shotgun microphone respectively, before step S330, method 300 may further include: control one or more shotgun microphone respectively towards at least part of object to gather one or more audio frequency.

Shotgun microphone can be the shotgun microphone of band The Cloud Terrace.Shotgun microphone can more clearly gather its towards the voice of object, and almost do not collect the voice of other objects.Therefore, the audio collection of high s/n ratio can be realized by shotgun microphone.

Under conference scenario, first can collect the video of the face comprising all participants.Then carry out Face datection in real time, the face according to detecting distributes shotgun microphone to participant.Preferably, the number of shotgun microphone is equal to or greater than the number of one or more faces mentioned above.Like this, when one or more object is all objects in meeting-place, can ensures that all objects in meeting-place are all assigned shotgun microphone, thus can ensure that the voice of all objects all go on record, avoid out the omission of realize voice.If the number of shotgun microphone is less than the number of one or more face, shotgun microphone can be distributed neatly.Usually, an object is only had to speak at synchronization.When the current object being assigned shotgun microphone is silent, shotgun microphone can be reassigned to the object that the next one is spoken.These operations can be implemented based on Face datection result.

Certainly, also by shotgun microphone fixed allocation to object, like this, if the number of shotgun microphone is less than the number of one or more face, then the voice of partial objects can only be collected.Particularly, after the face in meeting-place being detected, at least part of object corresponding at least part of face wherein distributes one or more shotgun microphone.The number that can be assigned to the object of shotgun microphone depends on the number of shotgun microphone.Each shotgun microphone can gather a road audio frequency, therefore can obtain one or more audio frequency.

In the present embodiment, step S350 can comprise: for each at least part of face, according to the phonetic feature in the road audio frequency gathered by the shotgun microphone towards the object corresponding to this face, segmentation is carried out to this road audio frequency, to obtain the initial audio section corresponding with this face.

Due to shotgun microphone towards to as if known, therefore the corresponding relation of every road audio frequency and face is known.Such as, suppose that shotgun microphone m is towards object A, then from the voice only comprising object A in a road audio frequency of shotgun microphone m.Segmentation is carried out to the road audio frequency from shotgun microphone m, directly can obtain the initial audio section corresponding with object A.Certainly, when flexible adjustment shotgun microphone towards object, the corresponding relation of every road audio frequency and face may be change.But this change is also known, the corresponding relation of every road audio frequency and face can be determined at times, and then determine the initial audio section corresponding with each object.

Coordinated by the shotgun microphone of Face datection and band The Cloud Terrace, can obtain than wide scope microphone (unified microphone such as mentioned above) audio frequency more clearly, thus extraordinary gain effect can be produced to steps such as follow-up audio parsing, categorizing operation and speech recognitions.

According to the embodiment of the present invention, at the one or more shotgun microphone of control respectively towards at least part of object with before gathering one or more audio frequency, method 300 may further include: the priority determining each face according to the face characteristic of one or more face and/or action; And according to the priority of each face determine one or more shotgun microphone will towards object as at least part of object.

Can be that object distributes shotgun microphone according to priority, this be particularly useful when the number of shotgun microphone is less than the number of one or more face.Priority can be determined according to the face characteristic of face and/or action.Face characteristic can comprise the size of facial contour.Such as, shotgun microphone can be placed on a place with camera, when the face that camera collection arrives is larger, can think that object corresponding to this face is close to shotgun microphone, just the priority of this face can be improved, make it possible to the preferential object distributed to by shotgun microphone corresponding to this face.Face characteristic can also comprise the mouth action of face.Such as, if find that the object corresponding to the first face pipes down by the face mouth action in the some successive video frames in video, and find that the object corresponding to the second face loquiturs, then the priority of the first face can be reduced, the priority of the second face improved, the shotgun microphone making script distribute to the object corresponding to the first face can be reassigned to the object corresponding to the second face.Whether the action of face can comprise face and stablize.Such as, if can not disorderly move by the object corresponding to the some successive video frames discovery faces in video is more stable, then the priority of this face can be improved, make it possible to the preferential object distributed to by shotgun microphone corresponding to this face.

By priority can make shotgun microphone towards adjusting more neatly, can ensure when shotgun microphone quantity is inadequate the voice collecting object as much as possible.

According to the embodiment of the present invention, above-mentioned steps S340 can implement according to following rule: for each at least part of face, if the mouth of this face changes to open configuration in the first moment from closure state and continues to be in closure state in the first scheduled time slot before the first moment, then using the first moment as the video segmentation start time, if the mouth of this face changes to closure state in the second moment from open configuration and continues to be in closure state in the second scheduled time slot after the second moment, then using the second moment as the video segmentation end time, wherein, described video, part between adjacent video segmentation start time and video segmentation end time is initial video section.

In first scheduled time slot, the second scheduled time slot and the 3rd scheduled time slot hereinafter described, the 4th scheduled time slot any both can be identical or different, it can be determined as required, and the present invention does not limit this.

If the mouth of certain object opens suddenly after closed first scheduled time slot, can think that object loquiturs, time point now can be considered as the video segmentation start time.If the mouth of certain object closes suddenly after opening the second scheduled time slot, can think that object pipes down, time point now can be considered as the video segmentation end time.

Certainly, be understandable that, also can carry out segmentation according to mouth action to video according to other rules, or carry out segmentation according to other face characteristics to video, it all should fall within the scope of protection of the present invention.

According to the embodiment of the present invention, above-mentioned steps S350 can implement according to following rule: if the voice in audio frequency the 3rd moment never sounding state change to sounding state and continue to be in not sounding state in the 3rd scheduled time slot before the 3rd moment, then using the 3rd moment as the audio parsing start time, if the voice in audio frequency to change to not sounding state in the 4th moment and continue to be in not sounding state in the 4th scheduled time slot after the 4th moment from sounding state, then using the 4th moment as the audio parsing end time, wherein, audio frequency, part between adjacent audio parsing start time and audio parsing end time is initial audio section.

With video segmentation similarly, if the sounding suddenly after not sounding state continue for the 3rd scheduled time slot of the voice in audio frequency, can think and have object to loquitur, time point now can be considered as the audio parsing start time.If the voice in audio frequency are unexpected no longer sounding after sounding state continue for the 4th scheduled time slot, can think that object pipes down, time point now can be considered as the audio parsing end time.

Certainly, be understandable that, also can carry out segmentation according to phonetic feature to audio frequency according to other rules, it all should fall within the scope of protection of the present invention.

Fig. 5 illustrates in accordance with another embodiment of the present invention for the treatment of the indicative flowchart of the method 500 of video and related audio.The step S510 to S550 of the method 500 shown in Fig. 5 is corresponding with the step S210 to S250 of the method 200 shown in Fig. 2 respectively.Those skilled in the art are appreciated that the above-mentioned steps in Fig. 5 according to Fig. 2 and description above, for simplicity, do not repeat them here.In the present embodiment, after step S550, method 500 may further include following steps.

In step S560, for each at least part of face, speech recognition is carried out to the audio-frequency unit corresponding with this face, to obtain the text of the representative audio-frequency unit corresponding with this face.

In step S570, for each at least part of face, text is associated with this face.

For certain face, after obtaining the audio-frequency unit corresponding with this face, speech recognition can be carried out.Speech recognition can adopt routine techniques to realize, and does not repeat herein.The text identified is the content of speaking of the object represented with written form, and it can together with the object association of speaking.Be understandable that, in the embodiment comprising classifying step, originally the effective audio section being associated with certain object may be reclassified to another object, in this case, speech recognition can be carried out, by the text that identifies together with correct object association to the effective audio section after sorting out.

The Content Transformation of speaking of object can be become word by speech recognition, this is conducive to the storage of voice, and user can be retrieved voice conveniently by keyword.

According to the embodiment of the present invention, method 200 (300 or 500) may further include: export expectation information.It is one or more that described expectation information comprises in following item: the acquisition time of described video, described audio frequency, the frame of video comprising the Given Face in one or more face, the acquisition time comprising the frame of video of Given Face, the audio-frequency unit corresponding with Given Face and the audio-frequency unit corresponding with Given Face.

Given Face can be at least part of face such as mentioned above.Such as, under conference scenario, after processing the Audio and Video gathered in the whole session, can be informed in the session once said the participant and content of speaking thereof that exchange words.The facial image and content of speaking (audio frequency or textual form) thereof of once saying the participant exchanged words can be exported, present to the user expecting to check conferencing information.Certainly, the facial image that also can export all participants and the content of speaking (audio frequency or textual form) once saying the participant exchanged words.In addition, the whole video that the session can also be gathered or audio frequency export.

In one example, all output units 108 as shown in Figure 1 can be utilized to export the relevant information of Given Face.Such as, output unit 108 can be the output interface of server end, and expectation information can be outputted to the client of user by it.Again such as, output unit 108 can be one or more in display, loudspeaker etc., and it can show or play expectation information.When showing expectation information, can with the face of time and/or object for clue shows.Such as, under conference scenario, all participants can be shown or say the facial image of the participant exchanged words, speak time and/or content etc. of speaking.

By exporting expectation information, the object that user can be made to know in time exchange words and content of speaking thereof, such as under conference scenario, user can know the situation of whole meeting.

According to a further aspect of the invention, a kind of search method is provided.Fig. 6 illustrates the indicative flowchart of search method 600 according to an embodiment of the invention.As shown in Figure 6, search method 600 comprises the following steps.

In step S610, the retrieval received for target face indicates.

Retrieval instruction can be check the audio frequency of record and/or the user of video from expectation.Such as, under conference scenario, the whole session once can be said that the facial image of the participant exchanged words presented to user, user clicks certain facial image via interactive interface, and the retrieval inputted for certain face indicates.Search method can be implemented in server end, such as, realize on electronic equipment 100 mentioned above, and user can input retrieval instruction via input media 106.In another example, retrieval instruction can be sent to server end by the mobile terminal of user, and the information retrieved (such as corresponding with someone appearance audio-frequency unit) is sent to the mobile terminal of user by server end.In another example, search method also can be implemented in client, such as realize on the mobile terminal of user, server can by by the Audio and Video of the method process for the treatment of video and related audio mentioned above and some other information, the incidence relation etc. of such as corresponding with each face audio-frequency unit and/or face and audio-frequency unit, store in the storage device, and these stored information can be sent to the mobile terminal of user.User can retrieve the information of needs on the mobile terminal of oneself.In an example again, the method for the treatment of video and related audio mentioned above and search method may be implemented together in client, and this situation realizes at server end similar together with the two, repeats no more.

In step S620, from database, the relevant information of target face is searched according to retrieval instruction, wherein, database for store according to the method for the treatment of video and related audio mentioned above carry out the video that processes and audio frequency and/or with each the corresponding audio-frequency unit at least part of face, and wherein, it is one or more that the relevant information of target face comprises in following item: the frame of video comprising target face, comprise the acquisition time of the frame of video of target face, the acquisition time of the audio-frequency unit corresponding with target person appearance and the audio-frequency unit corresponding with target person appearance.

As described above, under conference scenario, the whole session once can be said that the facial image of the participant exchanged words presented to user, user clicks certain facial image via interactive interface, and the retrieval inputted for certain face indicates.After user clicks certain face, this face can be searched at the video section of session and audio-frequency unit from database.The frame of video comprising this face can be single frame of video, also can be continuous print frame of video (i.e. one section of video).In addition, database can also store the text represented with each the corresponding audio-frequency unit at least part of face.That is, can store with the speak content of the object of textual representation in the session.Like this, the relevant information of target face can also comprise the text corresponding with target person appearance.

In step S630, export the relevant information of target face.

The frame of video comprising target face can export via display interface (such as display etc.).The audio-frequency unit corresponding with target person appearance can export via sound play device (such as loudspeaker etc.).By exporting the frame of video or audio-frequency unit that need, information about saying object and the content of speaking thereof of exchanging words in the session can be provided to user.

For the Audio and Video carrying out processing according to the method for the treatment of video and related audio mentioned above, the incidence relation between each face and its voice is known, therefore can fast and effeciently retrieve the voice corresponding with face.

It should be noted that the present invention is not limited to above search method, other any suitable search methods are also feasible.Such as, can also grade according to time searched targets face, the frame of video comprising target face, the audio portion corresponding with target person appearance.

Fig. 7 shows according to an embodiment of the invention for the treatment of the schematic block diagram of the device 700 of video and related audio.

As shown in Figure 7, the first acquisition module 710, face detection module 720, second acquisition module 730, audio-frequency unit determination module 740 and audio frequency relating module 750 is comprised according to the device 700 for the treatment of video and related audio of the embodiment of the present invention.

First acquisition module 710 is for obtaining the video of the one or more faces comprising one or more object.The programmed instruction that first acquisition module 710 can store in processor 102 Running storage device 104 in electronic equipment as shown in Figure 1 realizes.

Face detection module 720 for carrying out Face datection to each frame of video in video, to identify one or more face.The programmed instruction that face detection module 720 can store in processor 102 Running storage device 104 in electronic equipment as shown in Figure 1 realizes.

Second acquisition module 730 is for obtaining the audio frequency comprising the voice of at least part of object in one or more object gathered in same time section with described video.The programmed instruction that second acquisition module 730 can store in processor 102 Running storage device 104 in electronic equipment as shown in Figure 1 realizes.

Audio-frequency unit determination module 740, for for each at least part of face in one or more face, determines in audio frequency, corresponding with this face audio-frequency unit, and wherein, at least part of face belongs at least part of object respectively.The programmed instruction that audio-frequency unit determination module 740 can store in processor 102 Running storage device 104 in electronic equipment as shown in Figure 1 realizes.

This face, for for each at least part of face, associates with corresponding audio-frequency unit by audio frequency relating module 750.The programmed instruction that audio frequency relating module 750 can store in processor 102 Running storage device 104 in electronic equipment as shown in Figure 1 realizes.

Exemplarily, device 700 for the treatment of video and related audio may further include: video segmentation module, for for each at least part of face, the mouth action according to this face carries out segmentation to video, to obtain the initial video section corresponding with this face; Audio parsing module, for for each at least part of face, carries out segmentation according to the phonetic feature in audio frequency to audio frequency, to obtain the initial audio section corresponding with this face; And effective video and audio frequency obtain module, for obtaining in video, corresponding with this face effective video section and audio frequency, and the corresponding effective audio section of this face according to the initial video section corresponding with this face and initial audio section.Audio-frequency unit determination module 740 can comprise determines submodule, for for each at least part of face, determines that the effective audio section corresponding with this face is the audio-frequency unit corresponding with this face.

Exemplarily, audio frequency relating module 750 can comprise: frame of video chooser module, for for each at least part of face, for each effective video section corresponding with this face, the frame of video selecting face the best in quality from all frame of video of this effective video section; And association submodule, for selected frame of video and the effective audio section corresponding with this effective video section are associated, to form a video/audio combination.

Exemplarily, device 700 for the treatment of video and related audio may further include: face characteristic extraction module, for carrying out face characteristic extraction for the face corresponding to the combination of particular video frequency audio frequency to the frame of video in the combination of particular video frequency audio frequency, to obtain Given Face feature, wherein, the combination of particular video frequency audio frequency is one of all video/audios combination corresponding at least part of face; Sound characteristic extraction module, carries out sound characteristic extraction, to obtain specific sound feature to the effective audio section in the combination of particular video frequency audio frequency; Human face similarity degree computing module, for for each in all the other video/audios combination in the combination of all video/audios, calculate Given Face feature and this video/audio combine corresponding to face characteristic between human face similarity degree; Assonance degree computing module, for for each in the combination of all the other video/audios, calculate specific sound feature and this video/audio combine corresponding to sound characteristic between assonance degree; Average similarity calculation module, for for all the other video/audios combination in each, calculate the combination of particular video frequency audio frequency and this video/audio combine between human face similarity degree and the mean value of assonance degree, with obtain particular video frequency audio frequency combine and this video/audio combine between average similarity; Classifying module, for for all the other video/audios combination in each, if the average similarity between the combination of particular video frequency audio frequency and this video/audio combine is greater than similarity threshold, then the combination of particular video frequency audio frequency is combined with this video/audio and be referred to same target.

Exemplarily, effective video and audio frequency acquisition module can comprise: effective video section determines submodule, for for each at least part of face, the initial video section corresponding with this face is defined as the effective video section corresponding with this face; And effective audio section determination submodule, for for each at least part of face, the initial audio section corresponding with this face is defined as the effective audio section corresponding with this face.

Exemplarily, effective video and audio frequency obtain module and comprise: unified split time determination submodule, for for each at least part of face, the split time according to the initial video section corresponding with this face and initial audio section determines unified split time; Unified segmentation submodule, for unifying segmentation to Audio and Video, to obtain the effective video section corresponding with this face and effective audio section according to unified split time.

Exemplarily, audio frequency is gathered by unified microphone, and audio parsing module comprises: the first segmentation submodule, for carrying out segmentation according to the phonetic feature in audio frequency to audio frequency, to obtain mixed audio piece; And audio section chooser module, for for each at least part of face, from mixed audio piece, select mixed audio piece that initial video section on acquisition time and corresponding with this face is consistent as the initial audio section corresponding with this face.

Exemplarily, audio frequency comprises one or more audio frequency gathered by one or more shotgun microphone respectively, device 700 for the treatment of video and related audio may further include control module, for control one or more shotgun microphone respectively towards at least part of object to gather one or more audio frequency; Audio parsing module can comprise the second segmentation submodule, for for each at least part of face, according to the phonetic feature in the road audio frequency gathered by the shotgun microphone towards the object corresponding to this face, segmentation is carried out to this road audio frequency, to obtain the initial audio section corresponding with this face.

Exemplarily, the number of shotgun microphone is equal to or greater than the number of one or more face.

Exemplarily, the device 700 for the treatment of video and related audio may further include: Priority Determination module, for determining the priority of each face according to the face characteristic of one or more face and/or action; And object determination module, for determine according to the priority of each face one or more shotgun microphone will towards object as at least part of object.

Exemplarily, video segmentation module carries out segmentation according to following rule to video: for each at least part of face, if the mouth of this face changes to open configuration in the first moment from closure state and continues to be in closure state in the first scheduled time slot before the first moment, then using the first moment as the video segmentation start time, if the mouth of this face changes to closure state in the second moment from open configuration and continues to be in closure state in the second scheduled time slot after the second moment, then using the second moment as the video segmentation end time, wherein, video, part between adjacent video segmentation start time and video segmentation end time is initial video section.

Exemplarily, audio parsing module carries out segmentation according to following rule to audio frequency: if the voice in audio frequency the 3rd moment never sounding state change to sounding state and continue to be in not sounding state in the 3rd scheduled time slot before the 3rd moment, then using the 3rd moment as the audio parsing start time, if the voice in audio frequency to change to not sounding state in the 4th moment and continue to be in not sounding state in the 4th scheduled time slot after the 4th moment from sounding state, then using the 4th moment as the audio parsing end time, wherein, audio frequency, part between adjacent audio parsing start time and audio parsing end time is initial audio section.

Exemplarily, device 700 for the treatment of video and related audio may further include: sound identification module, for for each at least part of face, speech recognition is carried out to the audio-frequency unit corresponding with this face, to obtain the text of the representative audio-frequency unit corresponding with this face; And textual association module, for text is associated with this face.

Exemplarily, device 700 for the treatment of video and related audio may further include output module, for exporting expectation information, wherein, expect that information comprises in following item one or more: the acquisition time of video, audio frequency, the frame of video comprising the Given Face in one or more face, the acquisition time comprising the frame of video of Given Face, the audio-frequency unit corresponding with Given Face and the audio-frequency unit corresponding with Given Face.

According to a further aspect of the invention, a kind of indexing unit is provided.Fig. 8 shows the schematic block diagram of indexing unit 800 according to an embodiment of the invention.Indexing unit 800 comprises receiver module 810, searches module 820 and output module 830.

Receiver module 810 indicates for the retrieval received for target face.

Search module 820 for searching the relevant information of target face from database according to retrieval instruction, wherein, database be used for video that the memory device for the treatment of video and related audio mentioned above carries out processing and audio frequency and/or with each the corresponding audio-frequency unit at least part of face, and wherein, it is one or more that the relevant information of target face comprises in following item: the frame of video comprising target face, comprise the acquisition time of the frame of video of target face, the acquisition time of the audio-frequency unit corresponding with target person appearance and the audio-frequency unit corresponding with target person appearance.

Output module 830 is for exporting the relevant information of target face.

Be hereinbefore described the embodiment of search method 600, those skilled in the art, according to describe above and composition graphs 6 is appreciated that the structure of indexing unit 800, the method for operation and advantage thereof etc., repeat no more.

Those of ordinary skill in the art can recognize, in conjunction with unit and the algorithm steps of each example of embodiment disclosed herein description, can realize with the combination of electronic hardware or computer software and electronic hardware.These functions perform with hardware or software mode actually, depend on application-specific and the design constraint of technical scheme.Professional and technical personnel can use distinct methods to realize described function to each specifically should being used for, but this realization should not thought and exceeds scope of the present invention.

Fig. 9 shows according to an embodiment of the invention for the treatment of the schematic block diagram of the system 900 of video and related audio.System 900 for the treatment of video and related audio comprises video acquisition device 910, audio collecting device 920, memory storage 930 and processor 940.

Video acquisition device 910 is for gathering the video of the face comprising object.Audio collecting device 920 is for gathering the audio frequency of the voice comprising object.

Described memory storage 930 stores the program code for the treatment of the corresponding steps in the method for video and related audio for realizing according to the embodiment of the present invention.

Described processor 940 is for running the program code stored in described memory storage 930, to perform the corresponding steps of the method for the treatment of video and related audio according to the embodiment of the present invention, and for realize according to the embodiment of the present invention for the treatment of the first acquisition module 710, face detection module 720, second acquisition module 730, audio-frequency unit determination module 740 and the audio frequency relating module 750 in the device 700 of video and related audio.

In one embodiment, following steps are performed when described program code is run by described processor 940: obtain the video comprising one or more faces of one or more object; Face datection is carried out to each frame of video in video, to identify one or more face; Obtain the audio frequency comprising the voice of at least part of object in one or more object gathered in same time section with described video; For each at least part of face in one or more face, determine in audio frequency, corresponding with this face audio-frequency unit; Associated with corresponding audio-frequency unit by this face, wherein, at least part of face belongs at least part of object respectively.

In one embodiment, also following steps are performed: for each at least part of face when described program code is run by described processor 940, mouth action according to this face carries out segmentation to video, to obtain the initial video section corresponding with this face; According to the phonetic feature in audio frequency, segmentation is carried out to audio frequency, to obtain the initial audio section corresponding with this face; And obtain in video, corresponding with this face effective video section and audio frequency, corresponding with this face effective audio section according to the initial video section corresponding with this face and initial audio section.Performedly when described program code is run by described processor 940 determine that the step of in audio frequency, corresponding with this face audio-frequency unit comprises for each at least part of face in one or more face: for each at least part of face, determine that the effective audio section corresponding with this face is the audio-frequency unit corresponding with this face.

In one embodiment, the performed step that this face and corresponding audio-frequency unit associate being comprised for each at least part of face in one or more face when described program code is run by described processor 940: for each at least part of face, for each effective video section corresponding with this face, the frame of video selecting face the best in quality from all frame of video of this effective video section; Selected frame of video and the effective audio section corresponding with this effective video section are associated, to form a video/audio combination.

In one embodiment, also following steps are performed: for the face corresponding to the combination of particular video frequency audio frequency, face characteristic extraction is carried out to the frame of video in the combination of particular video frequency audio frequency when described program code is run by described processor 940, to obtain Given Face feature, wherein, the combination of particular video frequency audio frequency is one of all video/audios combination corresponding at least part of face; Sound characteristic extraction is carried out, to obtain specific sound feature to the effective audio section in the combination of particular video frequency audio frequency; For each in all the other video/audios combination in all video/audio combinations, the human face similarity degree between the face characteristic corresponding to calculating Given Face feature and this video/audio combine; Assonance degree between sound characteristic corresponding to calculating specific sound feature and this video/audio combine; Calculate the combination of particular video frequency audio frequency and this video/audio combine between human face similarity degree and the mean value of assonance degree, with obtain particular video frequency audio frequency combine and this video/audio combine between average similarity; If the average similarity between the combination of particular video frequency audio frequency and this video/audio combine is greater than similarity threshold, then the combination of particular video frequency audio frequency is combined with this video/audio and be referred to same target.

In one embodiment, performedly when described program code is run by described processor 940 obtain in video for each at least part of face according to initial video section corresponding with this face and initial audio section, in the effective video section corresponding with this face and audio frequency, the step of the effective audio section corresponding with this face comprises: for each at least part of face, the initial video section corresponding with this face is defined as the effective video section corresponding with this face, and the initial audio section corresponding with this face is defined as the effective audio section corresponding with this face.

In one embodiment, when described program code is run by described processor 940, the performed step obtaining in video, corresponding with this face effective video section and audio frequency, corresponding with this face effective audio section according to initial video section corresponding with this face and initial audio section for each at least part of face comprises: for each at least part of face, and the split time according to the initial video section corresponding with this face and initial audio section is determined to unify split time; Segmentation is unified to Audio and Video, to obtain the effective video section corresponding with this face and effective audio section according to unified split time.

In one embodiment, audio frequency is gathered by unified microphone, performedly when described program code is run by described processor 940 according to the phonetic feature in audio frequency, segmentation is carried out to audio frequency for each at least part of face and comprise with the step obtaining the initial audio section corresponding with this face: according to the phonetic feature in audio frequency, segmentation is carried out to audio frequency, to obtain mixed audio piece; And for each at least part of face, from mixed audio piece, select mixed audio piece that initial video section on acquisition time and corresponding with this face is consistent as the initial audio section corresponding with this face.

In one embodiment, audio frequency comprises one or more audio frequency gathered by one or more shotgun microphone respectively, also performs following steps when described program code is run by described processor 940: control one or more shotgun microphone respectively towards at least part of object to gather one or more audio frequency; Performedly when described program code is run by described processor 940 according to the phonetic feature in audio frequency, segmentation is carried out to audio frequency for each at least part of face and comprise with the step obtaining the initial audio section corresponding with this face: for each at least part of face, according to the phonetic feature in the road audio frequency gathered by the shotgun microphone towards the object corresponding to this face, segmentation is carried out to this road audio frequency, to obtain the initial audio section corresponding with this face.

In one embodiment, the number of shotgun microphone is equal to or greater than the number of one or more face.

In one embodiment, also following steps are performed when described program code is run by described processor 940: the priority determining each face according to the face characteristic of one or more face and/or action; And according to the priority of each face determine one or more shotgun microphone will towards object as at least part of object.

In one embodiment, when described program code is run by described processor 940, performed step of video being carried out to segmentation according to the mouth action of this face for each at least part of face is implemented according to following rule: for each at least part of face, if the mouth of this face changes to open configuration in the first moment from closure state and continues to be in closure state in the first scheduled time slot before the first moment, then using the first moment as the video segmentation start time, if the mouth of this face changes to closure state in the second moment from open configuration and continues to be in closure state in the second scheduled time slot after the second moment, then using the second moment as the video segmentation end time, wherein, video, part between adjacent video segmentation start time and video segmentation end time is initial video section.

In one embodiment, when described program code is run by described processor 940, performed step of audio frequency being carried out to segmentation according to the phonetic feature in audio frequency for each at least part of face is implemented according to following rule: if the voice in audio frequency the 3rd moment never sounding state change to sounding state and continue to be in not sounding state in the 3rd scheduled time slot before the 3rd moment, then using the 3rd moment as the audio parsing start time, if the voice in audio frequency to change to not sounding state in the 4th moment and continue to be in not sounding state in the 4th scheduled time slot after the 4th moment from sounding state, then using the 4th moment as the audio parsing end time, wherein, audio frequency, part between adjacent audio parsing start time and audio parsing end time is initial audio section.

In one embodiment, also following steps are performed: for each at least part of face when described program code is run by described processor 940, speech recognition is carried out to the audio-frequency unit corresponding with this face, to obtain the text of the representative audio-frequency unit corresponding with this face; Text is associated with this face.

In one embodiment, also following steps are performed: export expectation information when described program code is run by described processor 940, wherein, expect that information comprises in following item one or more: the acquisition time of video, audio frequency, the frame of video comprising the Given Face in one or more face, the acquisition time comprising the frame of video of Given Face, the audio-frequency unit corresponding with Given Face and the audio-frequency unit corresponding with Given Face.

In addition, according to the embodiment of the present invention, additionally provide a kind of storage medium, store programmed instruction on said storage, when described programmed instruction is run by computing machine or processor for performing the corresponding steps of the method for the treatment of video and related audio of the embodiment of the present invention, and for realize according to the embodiment of the present invention for the treatment of the corresponding module in the device of video and related audio.Described storage medium such as can comprise the combination in any of the storage card of smart phone, the memory unit of panel computer, the hard disk of personal computer, ROM (read-only memory) (ROM), Erasable Programmable Read Only Memory EPROM (EPROM), portable compact disc ROM (read-only memory) (CD-ROM), USB storage or above-mentioned storage medium.

In one embodiment, described computer program instructions by each functional module of the device for the treatment of video and related audio that can realize during computer run according to the embodiment of the present invention, and/or can perform the method for the treatment of video and related audio according to the embodiment of the present invention.

In one embodiment, described computer program instructions is being performed following steps by during computer run: obtain and comprise the video of one or more faces of one or more object; Face datection is carried out to each frame of video in video, to identify one or more face; Obtain the audio frequency comprising the voice of at least part of object in one or more object gathered in same time section with described video; For each at least part of face in one or more face, determine in audio frequency, corresponding with this face audio-frequency unit; Associated with corresponding audio-frequency unit by this face, wherein, at least part of face belongs at least part of object respectively.

In one embodiment, described computer program instructions is also being performed following steps by during computer run: for each at least part of face, mouth action according to this face carries out segmentation to video, to obtain the initial video section corresponding with this face; According to the phonetic feature in audio frequency, segmentation is carried out to audio frequency, to obtain the initial audio section corresponding with this face; And obtain in video, corresponding with this face effective video section and audio frequency, corresponding with this face effective audio section according to the initial video section corresponding with this face and initial audio section.At described computer program instructions being determined that the step of in audio frequency, corresponding with this face audio-frequency unit comprises by performed during computer run for each at least part of face in one or more face: for each at least part of face, determine that the effective audio section corresponding with this face is the audio-frequency unit corresponding with this face.

In one embodiment, described computer program instructions is by performed comprising the step that this face and corresponding audio-frequency unit associate for each at least part of face in one or more face during computer run: for each at least part of face, for each effective video section corresponding with this face, the frame of video selecting face the best in quality from all frame of video of this effective video section; Selected frame of video and the effective audio section corresponding with this effective video section are associated, to form a video/audio combination.

In one embodiment, described computer program instructions is also being performed following steps by during computer run: carry out face characteristic extraction for the face corresponding to the combination of particular video frequency audio frequency to the frame of video in the combination of particular video frequency audio frequency, to obtain Given Face feature, wherein, the combination of particular video frequency audio frequency is one of all video/audios combination corresponding at least part of face; Sound characteristic extraction is carried out, to obtain specific sound feature to the effective audio section in the combination of particular video frequency audio frequency; For each in all the other video/audios combination in all video/audio combinations, the human face similarity degree between the face characteristic corresponding to calculating Given Face feature and this video/audio combine; Assonance degree between sound characteristic corresponding to calculating specific sound feature and this video/audio combine; Calculate the combination of particular video frequency audio frequency and this video/audio combine between human face similarity degree and the mean value of assonance degree, with obtain particular video frequency audio frequency combine and this video/audio combine between average similarity; If the average similarity between the combination of particular video frequency audio frequency and this video/audio combine is greater than similarity threshold, then the combination of particular video frequency audio frequency is combined with this video/audio and be referred to same target.

In one embodiment, at described computer program instructions being obtained in video for each at least part of face according to initial video section corresponding with this face and initial audio section by performed during computer run, in the effective video section corresponding with this face and audio frequency, the step of the effective audio section corresponding with this face comprises: for each at least part of face, the initial video section corresponding with this face is defined as the effective video section corresponding with this face, and the initial audio section corresponding with this face is defined as the effective audio section corresponding with this face.

In one embodiment, at described computer program instructions being comprised by the step obtaining in video, corresponding with this face effective video section and audio frequency, corresponding with this face effective audio section according to initial video section corresponding with this face and initial audio section for each at least part of face performed during computer run: for each at least part of face, the split time according to the initial video section corresponding with this face and initial audio section is determined to unify split time; Segmentation is unified to Audio and Video, to obtain the effective video section corresponding with this face and effective audio section according to unified split time.

In one embodiment, audio frequency is gathered by unified microphone, at described computer program instructions being carried out segmentation according to the phonetic feature in audio frequency to audio frequency for each at least part of face comprise with the step obtaining the initial audio section corresponding with this face by performed during computer run: carry out segmentation according to the phonetic feature in audio frequency to audio frequency, to obtain mixed audio piece; And for each at least part of face, from mixed audio piece, select mixed audio piece that initial video section on acquisition time and corresponding with this face is consistent as the initial audio section corresponding with this face.

In one embodiment, audio frequency comprises one or more audio frequency gathered by one or more shotgun microphone respectively, at described computer program instructions also being performed following steps by during computer run: control one or more shotgun microphone respectively towards at least part of object to gather one or more audio frequency; At described computer program instructions being carried out segmentation according to the phonetic feature in audio frequency to audio frequency for each at least part of face comprise with the step obtaining the initial audio section corresponding with this face by performed during computer run: for each at least part of face, according to the phonetic feature in the road audio frequency gathered by the shotgun microphone towards the object corresponding to this face, segmentation is carried out to this road audio frequency, to obtain the initial audio section corresponding with this face.

In one embodiment, at described computer program instructions also being performed following steps by during computer run: the priority determining each face according to the face characteristic of one or more face and/or action; And according to the priority of each face determine one or more shotgun microphone will towards object as at least part of object.

In one embodiment, at described computer program instructions being implemented according to following rule by step of video being carried out to segmentation according to the mouth action of this face for each at least part of face performed during computer run: for each at least part of face, if the mouth of this face changes to open configuration in the first moment from closure state and continues to be in closure state in the first scheduled time slot before the first moment, then using the first moment as the video segmentation start time, if the mouth of this face changes to closure state in the second moment from open configuration and continues to be in closure state in the second scheduled time slot after the second moment, then using the second moment as the video segmentation end time, wherein, video, part between adjacent video segmentation start time and video segmentation end time is initial video section.

In one embodiment, at described computer program instructions being implemented according to following rule by step of according to the phonetic feature in audio frequency audio frequency to be carried out to segmentation for each at least part of face performed during computer run: if the voice in audio frequency the 3rd moment never sounding state change to sounding state and continue to be in not sounding state in the 3rd scheduled time slot before the 3rd moment, then using the 3rd moment as the audio parsing start time, if the voice in audio frequency to change to not sounding state in the 4th moment and continue to be in not sounding state in the 4th scheduled time slot after the 4th moment from sounding state, then using the 4th moment as the audio parsing end time, wherein, audio frequency, part between adjacent audio parsing start time and audio parsing end time is initial audio section.

In one embodiment, at described computer program instructions also being performed following steps by during computer run: for each at least part of face, speech recognition is carried out to the audio-frequency unit corresponding with this face, to obtain the text of the representative audio-frequency unit corresponding with this face; Text is associated with this face.

In one embodiment, at described computer program instructions also being performed following steps by during computer run: export expectation information, wherein, expect that information comprises in following item one or more: the acquisition time of video, audio frequency, the frame of video comprising the Given Face in one or more face, the acquisition time comprising the frame of video of Given Face, the audio-frequency unit corresponding with Given Face and the audio-frequency unit corresponding with Given Face.

Can run by the processor of the electronic equipment of the detection for the treatment of video and related audio according to the embodiment of the present invention computer program instructions stored in memory for the treatment of each module in the system of video and related audio and realize according to the embodiment of the present invention, or the computer instruction that can store in the computer-readable recording medium of the computer program according to the embodiment of the present invention is realized by during computer run.

According to the method for the treatment of video and related audio of the embodiment of the present invention and device, search method and device, for the treatment of the system of video and related audio and storage medium, by the face of object and its voice association are got up, the time of speaking of object and content of speaking can be determined, thus facilitate the speak content of user in the later stage to this object to check and retrieve.

Although example embodiment has been described with reference to the drawings here, it has been only exemplary for should understanding above-mentioned example embodiment, and is not intended to limit the scope of the invention to this.Those of ordinary skill in the art can make various changes and modifications wherein, and do not depart from scope and spirit of the present invention.All such changes and modifications are intended to be included within the scope of the present invention required by claims.

In several embodiments that the application provides, should be understood that disclosed equipment and method can realize by another way.Such as, apparatus embodiments described above is only schematic, such as, the division of described unit, be only a kind of logic function to divide, actual can have other dividing mode when realizing, such as multiple unit or assembly can in conjunction with or another equipment can be integrated into, or some features can be ignored, or do not perform.

In instructions provided herein, describe a large amount of detail.But can understand, embodiments of the invention can be put into practice when not having these details.In some instances, be not shown specifically known method, structure and technology, so that not fuzzy understanding of this description.

Similarly, be to be understood that, in order to simplify the present invention and to help to understand in each inventive aspect one or more, in the description to exemplary embodiment of the present invention, each feature of the present invention is grouped together in single embodiment, figure or the description to it sometimes.But, this method of the present invention should be construed to the following intention of reflection: namely the present invention for required protection requires feature more more than the feature clearly recorded in each claim.Or rather, as corresponding claims reflect, its inventive point is to solve corresponding technical matters by the characteristic feature being less than single embodiment disclosed in certain.Therefore, the claims following embodiment are incorporated to this embodiment thus clearly, and wherein each claim itself is as independent embodiment of the present invention.

It will be appreciated by those skilled in the art that, except mutually repelling between feature, any combination can be adopted to combine all processes of all features disclosed in this instructions (comprising adjoint claim, summary and accompanying drawing) and so disclosed any method or equipment or unit.Unless expressly stated otherwise, each feature disclosed in this instructions (comprising adjoint claim, summary and accompanying drawing) can by providing identical, alternative features that is equivalent or similar object replaces.

In addition, those skilled in the art can understand, although embodiments more described herein to comprise in other embodiment some included feature instead of further feature, the combination of the feature of different embodiment means and to be within scope of the present invention and to form different embodiments.Such as, in detail in the claims, the one of any of embodiment required for protection can use with arbitrary array mode.

All parts embodiment of the present invention with hardware implementing, or can realize with the software module run on one or more processor, or realizes with their combination.It will be understood by those of skill in the art that the some or all functions that microprocessor or digital signal processor (DSP) can be used in practice to realize according to some modules in the article analytical equipment of the embodiment of the present invention.The present invention can also be embodied as part or all the device program (such as, computer program and computer program) for performing method as described herein.Realizing program of the present invention and can store on a computer-readable medium like this, or the form of one or more signal can be had.Such signal can be downloaded from internet website and obtain, or provides on carrier signal, or provides with any other form.

The present invention will be described instead of limit the invention to it should be noted above-described embodiment, and those skilled in the art can design alternative embodiment when not departing from the scope of claims.In the claims, any reference symbol between bracket should be configured to limitations on claims.Word " comprises " not to be got rid of existence and does not arrange element in the claims or step.Word "a" or "an" before being positioned at element is not got rid of and be there is multiple such element.The present invention can by means of including the hardware of some different elements and realizing by means of the computing machine of suitably programming.In the unit claim listing some devices, several in these devices can be carry out imbody by same hardware branch.Word first, second and third-class use do not represent any order.Can be title by these word explanations.

The above; be only the specific embodiment of the present invention or the explanation to embodiment; protection scope of the present invention is not limited thereto; anyly be familiar with those skilled in the art in the technical scope that the present invention discloses; change can be expected easily or replace, all should be encompassed within protection scope of the present invention.Protection scope of the present invention should be as the criterion with the protection domain of claim.

Claims

1., for the treatment of a method for video and related audio, comprising:

Obtain the video comprising one or more faces of one or more object;

For each at least part of face in described one or more face,

This face is associated with corresponding audio-frequency unit,

2. the method for claim 1, wherein

Described determine in described audio frequency, corresponding with this face audio-frequency unit for each at least part of face in described one or more face before, described method comprises further:

For each in described at least part of face,

3. method as claimed in claim 2, wherein, described being associated with corresponding audio-frequency unit by this face for each at least part of face in described one or more face comprises:

For each in described at least part of face,

4. method as claimed in claim 3, wherein, described method comprises further:

5. the method as described in any one of claim 2 to 4, wherein,

Each in described at least part of face described obtains in described video, corresponding with this face effective video section and described audio frequency, corresponding with this face effective audio section according to initial video section corresponding with this face and initial audio section and comprises:

6. the method as described in any one of claim 2 to 4, wherein, each in described at least part of face described obtains in described video, corresponding with this face effective video section and described audio frequency, corresponding with this face effective audio section according to initial video section corresponding with this face and initial audio section and comprises:

For each in described at least part of face,

7. method as claimed in claim 2, wherein, described audio frequency is gathered by unified microphone,

8. method as claimed in claim 2, wherein, described audio frequency comprises one or more audio frequency gathered by one or more shotgun microphone respectively,

9. method as claimed in claim 8, wherein, the number of described shotgun microphone is equal to or greater than the number of described one or more face.

10. as claimed in claim 8 or 9 method, wherein, at the described one or more shotgun microphone of described control respectively towards described at least part of object with before gathering one or more audio frequency described, described method comprises further:

11. methods as claimed in claim 2, wherein, described for each in described at least part of face according to the mouth action of this face to described video carry out segmentation according to following rule implement:

12. methods as claimed in claim 2, wherein, described for each in described at least part of face according to the phonetic feature in described audio frequency to described audio frequency carry out segmentation according to following rule implement:

13. the method for claim 1, wherein described determine in described audio frequency, corresponding with this face audio-frequency unit for each at least part of face in described one or more face after, described method comprises further:

For each in described at least part of face,

Described text is associated with this face.

14. the method for claim 1, wherein described method comprise further: export expectation information,

15. 1 kinds of search methods, comprising:

The retrieval received for target face indicates;

Export the relevant information of described target face;

Wherein, described database for the method for the treatment of video and related audio stored according to any one of claim 1 to 14 carry out the video that processes and audio frequency and/or with each the corresponding audio-frequency unit in described at least part of face,

16. 1 kinds, for the treatment of the device of video and related audio, comprising:

17. devices as claimed in claim 16, wherein,

Described device comprises further:

18. devices as claimed in claim 17, wherein, described audio frequency relating module comprises:

19. devices as claimed in claim 18, wherein, described device comprises further:

20. devices as described in any one of claim 17 to 19, wherein, described effective video and audio frequency obtain module and comprise:

21. devices as described in any one of claim 17 to 19, wherein, described effective video and audio frequency obtain module and comprise:

22. devices as claimed in claim 17, wherein, described audio frequency is gathered by unified microphone,

Described audio parsing module comprises:

23. devices as claimed in claim 17, wherein, described audio frequency comprises one or more audio frequency gathered by one or more shotgun microphone respectively,

Described device comprises further:

Described audio parsing module comprises:

24. devices as claimed in claim 23, wherein, the number of described shotgun microphone is equal to or greater than the number of described one or more face.

25. devices as described in claim 23 or 24, wherein, described device comprises further:

26. devices as claimed in claim 17, wherein, described video segmentation module carries out segmentation according to following rule to described video:

27. devices as claimed in claim 17, wherein, described audio parsing module carries out segmentation according to following rule to described audio frequency:

28. devices as claimed in claim 16, wherein, described device comprises further:

Textual association module, for associating described text with this face.

29. devices as claimed in claim 16, wherein, described device comprises output module further, for exporting expectation information,

30. 1 kinds of indexing units, comprising:

Receiver module, indicates for the retrieval received for target face;

Output module, for exporting the relevant information of described target face;

Wherein, the described database device for the treatment of video and related audio be used for described in any one of memory claim 16 to 29 carry out the video that processes and audio frequency and/or with each the corresponding audio-frequency unit in described at least part of face,