CN105095316A

CN105095316A - Video content marking and searching method

Info

Publication number: CN105095316A
Application number: CN201410219768.8A
Authority: CN
Inventors: 黄又勋
Original assignee: Ali Corp
Current assignee: Ali Corp
Priority date: 2014-05-22
Filing date: 2014-05-22
Publication date: 2015-11-25

Abstract

An audio and video content searching method in an embodiment of the present invention comprises: (A) dividing an audio effect file into sound segments; (B) marking each sound segment obtained in the step (A), and recording corresponding start time and end time of each sound segment in the audio effect file; (C) generating a correspondence relationship table of a sound, a subtitle and an image, wherein the correspondence relationship table comprises character and graphic identification features that correspond to the subtitle, and the correspond start time and the end time of each sound segment in the audio effect file; and (D) a user performing searching in the correspondence relationship table of the sound, the subtitle and the image, which are generated in the step (C), by using a required search character string or voice.

Description

A kind of video content mark and search method

Technical field

The invention relates to a kind of Video content retrieval method, and relate to one especially and utilize audio frequency (audio) to help the search method of video (video).

Background technology

As far back as twentieth century end, just prophesy 21st century will be the epoch of information to scientist, particularly along with the development of computer technology and network technology, extend route of transmission and the speed of information greatly.Early stage method of consulting a large amount of paper documents in library wastes time and energy, will progressively substitute by electronic retrieval, the Document Quantity in the library that large-scale storage server just can be in large scale more than.Meanwhile, along with the development of technology, information no longer only includes lteral data, and is mostly audio file, video file more intuitively; Wherein video file is also the combination of audio file and image file.But the sound transitions in audio or video file cannot be word by prior art.

Existing search method is very efficient and convenient at retrieval lteral data, but helpless for the audio or video file in a large amount of audio files and video file.Such as, user needs to reach in the audio or video file of a few hours at one to retrieve fragment of sound wherein (such as a word wherein or a few words), existing method is by helpless, and user can only waste the plenty of time and browse whole audio or video file in the whole text.If user cannot learn that the fragment of sound needed for oneself is arranged in which audio file or video file, in the face of a large amount of Voice & Video files, have no way of doing it especially and retrieve.Because many search methods need a large amount of processing times, far beyond practical application acceptable level when processing video content.Temporal bottleneck makes video content quick-searching progress slowly.Therefore, about the technical products phoenix feathers and unicorn horns of video content quick-searching, the product that can directly apply to productive life is rare especially.

Current video content retrieval technique majority realizes based on classical image processing and mode identification technology, is roughly divided into following a few class.

Have technology to extract from video and comprise camera lens, scene, camera lens key frame, scene key frame, key frame image information and face information etc., these information extracted take graphic form as carrier.The retrieval of video content is equivalent to the retrieval of frame of video, is the retrieval of picture in essence.When frame of video is less, this technical method can have good effect.The frame number of general video is more, and particularly the data of monitor video are all in TB, and according to the frequency of 10-20 frame per second, video frame number will be magnanimity rank.To retrieve with numerical value with text and compare, image retrieval needs the long period, and therefore, when video frame number is more, the video search based on frame of video treatment technology will meet with serious time bottleneck.

Summary of the invention

The invention provides the search method of a kind of audio frequency, video content, it can promote the search speed of video.

A labeling method for audio frequency, video content in one embodiment of this invention, comprising: the audio effect file in audio file or video file is divided into fragment of sound by (1); (2) each fragment of sound of step (1) gained is marked, obtain each fragment of sound initial time corresponding in this audio effect file and end time.

In one embodiment of this invention, above-mentioned steps (1) is specially: by quietness technology, and the every a word in audio or video file is divided into a fragment of sound.

In one embodiment of this invention, more comprise after above-mentioned steps (2) and fragment of sound is marked on time point corresponding on time shaft, video file can be skipped to and mark time point arbitrarily.

A search method for audio frequency, video content in one embodiment of this invention, comprising: audio effect file is divided into fragment of sound by (A); (B) each fragment of sound of step (A) gained is marked, and the initial time of the correspondence in this audio effect file of each fragment of sound described in record and end time; (C) generate the mapping table of sound, captions and image, in this mapping table, comprise word corresponding to captions, figure recognition feature, and the initial time of each fragment of sound correspondence in this audio effect file and end time; (D) user uses required search word string or voice, retrieves in the mapping table of the generation sound of step (C) gained, captions and image.

In one embodiment of this invention, above-mentioned steps (A) is specially: by quietness technology, and the every a word in audio or video file is divided into a fragment of sound.

In one embodiment of this invention, more comprise after above-mentioned steps (B) fragment of sound is marked on time point corresponding on time shaft.

In one embodiment of this invention, above-mentioned steps (C) is specially: the mapping table generating sound, captions and image, described mapping table comprises all fragment of sound of this audio effect file, these captions are to deserved fragment of sound time point and characteristics of image, and the initial time of each fragment of sound in this audio effect file and end time.

In one embodiment of this invention, above-mentioned steps (D) is specially: after the mapping table of the generation sound of gained, captions and image is retrieved, video file can be skipped to the time point of mark be consistent most with result for retrieval.

For above-mentioned feature and advantage of the present invention can be become apparent, special embodiment below, and coordinate institute's accompanying drawings to be described in detail below.

Accompanying drawing explanation

Fig. 1 is the process flow diagram of labeling method of one embodiment of the invention audio frequency, video content.

Fig. 2 is the process flow diagram of search method of one embodiment of the invention audio frequency, video content.

Embodiment

With detailed reference to the preferred embodiment of the present invention, the example of described preferred embodiment is described in the accompanying drawings.In addition, all may part, in graphic and embodiment, use the element/component/step of identical label to represent identical or similar portions.

Please refer to Fig. 1, Fig. 1 is the labeling method process flow diagram of one embodiment of the invention audio frequency, video content.The present invention proposes the labeling method 100 of a kind of audio frequency, video content, comprising: when playing film, playing device utilizes voice/captions/figure recognition function, marks corresponding time point on time shaft.

Wherein step 110, when receiving film data by playing device or play film data, can analyze its film data obtained in the lump, carry out voice/captions/figure recognition function.Wherein the present invention is not defined as substance film and extracts, and anyly method provided by the present invention all can be utilized to carry out gathering and searching about the data such as audio/video file and stream.

Step 120, carries out voice/captions/figure identification for film data and carries out identification.When identification success, carry out step 130, by the successful captions of institute's identification, voice, image, relay data (the ground literal information about video) be stored into multimedia database, and the time point of above-mentioned data is recorded; When identification is unsuccessful, still carry out step 130, by this failure information point writing time, and also return step 110 continuation trial identification relevant information.When identification is unsuccessful, it still can be recorded to the initial time of fragment of sound, and voice content possibly only in its fragment of sound cannot identification.

Wherein, described identification step can be divided into fragment of sound for the audio effect file in audio file or video file, and each fragment of sound of step gained is marked, and obtains each fragment of sound initial time corresponding in this audio effect file and end time.Step 120 can be again: by quietness technology, and the every a word in audio or video file is divided into a fragment of sound, and the initial time of the correspondence in this audio or video file of each fragment of sound described in record and end time.Quietness technology is existing a kind of common method of audio or video file being carried out to segmentation.Quietness technology can detect the pause of sound, presets an interval time, then think that a word in sound terminates if pause to exceed.Every a word in audio or video file can be divided into a fragment of sound with this.The method splitting fragment of sound in prior art is not limited to quietness technology, in addition a variety of in addition, does not repeat one by one at this.

Please refer to Fig. 2, Fig. 2 is the search method process flow diagram of one embodiment of the invention audio frequency, video content.The search method 200 of a kind of audio frequency, video content, comprising: audio effect file is divided into fragment of sound in one embodiment of this invention; Each fragment of sound of gained is marked, and the initial time of the correspondence in this audio effect file of each fragment of sound described in record and end time; Generate the mapping table of sound, captions and image, in this mapping table, comprise word corresponding to captions, figure recognition feature, and the initial time of each fragment of sound correspondence in this audio effect file and end time; User uses required search word string or voice, retrieves at the mapping table of the generation sound of gained, captions and image.

Wherein step 210, when receiving film data by playing device or play film data, can analyze its film data obtained in the lump, carry out voice/captions/figure recognition function.Audio effect file is divided into fragment of sound, the initial time that each fragment of sound described in record is corresponding in this audio or video file and end time.Adopt in this way, the accurate location that each fragment of sound is positioned at audio or video file can be obtained.

Step 220, marks each fragment of sound of step 210 gained, and the initial time of the correspondence in this audio effect file of each fragment of sound described in record and end time.Wherein by quietness technology, the every a word in audio or video file is divided into a fragment of sound, more comprises and fragment of sound is marked on time point corresponding on time shaft; By speech recognition software, each fragment of sound is carried out speech recognition, obtain the word corresponding with sound; Or by stenography method, each fragment of sound is carried out speech recognition, obtains the word corresponding with fragment of sound.In order to ensure the accuracy identified, can proofread after speech recognition.

Step 230, successful for obtained identification captions, voice, image, relay data (the ground literal information about video) are stored into multimedia database, and generate the mapping table of sound, captions and image, described mapping table comprises word corresponding to captions, figure recognition feature, all fragment of sound of audio effect file, and the initial time corresponding in this audio effect file of each fragment of sound and end time.

In step 240, user uses required search word string or voice, retrieves in the mapping table of the generation sound of step 230 gained, captions and image.Finally after the mapping table of the generation sound of gained, captions and image is retrieved, video file is skipped to the time point of mark be consistent most with result for retrieval.

In sum, the invention provides the search method of a kind of audio frequency, video content.Described playing device, in response to this kind of search method, when user watches news, hears unclear sentence, can fast rewind to sentence-initial place and repeat to listen to by a key.User also can input crucial lines, and playing device contrast generates sound, the time point of the captions that are consistent is listed with the mapping table of image, for user's fast browsing by captions.In units of the time point of record, (starting points of such as every lines) carry out fast forwarding and fast rewinding again, more increase the efficiency of browsing or searching.Also IP Camera (webcam)/picture can be coordinated to search the time point of special object/shadow lattice appearance.

Although the present invention discloses as above with embodiment; so itself and be not used to limit the present invention; have in any art and usually know the knowledgeable; without departing from the spirit and scope of the present invention; when doing a little change and retouching, therefore protection scope of the present invention is when being as the criterion depending on the accompanying claim person of defining.

Claims

1. a labeling method for audio frequency, video content, comprising:

(1) audio effect file in audio file or video file is divided into fragment of sound;

(2) each fragment of sound of step (1) gained is marked, obtain each fragment of sound initial time corresponding in this audio effect file and end time.

2. the labeling method of audio frequency according to claim 1, video content, is characterized in that, described step (1) is specially: by quietness technology, and the every a word in audio or video file is divided into a fragment of sound.

3. the labeling method of audio frequency according to claim 1, video content, it is characterized in that, more comprise after described step (2) and fragment of sound is marked on time point corresponding on time shaft, video file can be skipped to and mark time point arbitrarily.

4. a search method for audio frequency, video content, comprising:

(A) audio effect file is divided into fragment of sound;

(B) each fragment of sound of step (A) gained is marked, and the initial time of the correspondence in this audio effect file of each fragment of sound described in record and end time;

(C) generate the mapping table of sound, captions and image, in this mapping table, comprise word corresponding to captions, figure recognition feature, and the initial time of each fragment of sound correspondence in this audio effect file and end time;

(D) user uses required search word string or voice, retrieves in the mapping table of the generation sound of step (C) gained, captions and image.

5. the search method of audio frequency according to claim 4, video content, is characterized in that, described step (A) is specially: by quietness technology, and the every a word in audio or video file is divided into a fragment of sound.

6. the search method of audio frequency according to claim 4, video content, is characterized in that, more comprises fragment of sound is marked on time point corresponding on time shaft after described step (B).

7. the search method of audio frequency according to claim 4, video content, it is characterized in that, described step (C) is specially: the mapping table generating sound, captions and image, described mapping table comprises all fragment of sound of this audio effect file, these captions are to deserved fragment of sound time point and characteristics of image, and the initial time of each fragment of sound in this audio effect file and end time.

8. the search method of audio frequency according to claim 4, video content, it is characterized in that, described step (D) is specially: after the mapping table of the generation sound of gained, captions and image is retrieved, video file can be skipped to the time point of mark be consistent most with result for retrieval.