CN102111601A

CN102111601A - Content-based adaptive multimedia processing system and method

Info

Publication number: CN102111601A
Application number: CN2009102636148A
Authority: CN
Inventors: 寇世斌; 倪嗣尧; 蓝元宗; 林仲毅; 陈翊玮
Original assignee: Gorilla Technology Inc
Current assignee: Gorilla Technology Uk Ltd
Priority date: 2009-12-23
Filing date: 2009-12-23
Publication date: 2011-06-29
Anticipated expiration: 2029-12-23
Also published as: CN102111601B

Abstract

The invention provides a content-based adaptive multimedia processing system and method for carrying out decision processing on results such as integration video analysis, audio analysis, word analysis and the like so as to convert multimedia contents comprising videos, audios and subtitles into multimedia contents which can be preferably watched on different play devices in the processing mode of giving consideration to the videos, audios and textual contents. Through the system, when original multimedia contents subjected to processing are played on the different playing devices such as mobile devices with different display sizes or computer program windows with different display proportions, video contents in which a user is interested are maximally maintained, so that important details in a picture are not lost because a screen is narrowed or proportion of the screen is changed, simultaneously important contents in the audios are prominently displayed, and display positions and display modes of the subtitles are regulated.

Description

The multimedia processing system of content adaptive and processing method

Technical field

The present invention is about a kind of multimedia processing system and processing method of content adaptive, particularly about a kind of can be according to the content analysis result, the content of multimedia that will comprise video, audio frequency and captions is made an Edition Contains, format conversion and multimedia compressed encoding, make the content of multimedia of generation, on different playing devices, can obtain the method for preferable viewing effect.

Background technology

Along with scientific and technological progress, watch the multimedia application of film or TV more and more in action on the device, because the multimedia source has the characteristic of high-resolution, multichannel more, and running gear is of a great variety, screen display size and displaying ratio also are not quite similar, play as the screen of need on running gear, play at screen under the consideration of size, device ability to play, Network Transmission frequency range and multimedia storage area, multimedia series flow or multi-medium file need be through some conversions, with screen size that meets running gear and the fluency of keeping broadcast.

At video section, at present conventional practice with multimedia video content scaled down to the size that meets the running gear screen.Yet owing to be subject to the screen size of running gear, the user is on the small screen of running gear, and Chang Wufa obtains viewing effect and the experience on other display unit such as being equal to video screen or computer screen.The crucial object on the video pictures for example, through behind whole image equal proportion convergent-divergent, can't keep in the screen of running gear this key object details, can't present original importance.

At audio-frequency unit, be subject to the audio playing device of running gear, therefore original multimedia audio content also must be done the appropriateness conversion and play to accord with on the running gear.Conventional practice is the mode of falling mixed (downmix) that directly adopts at present, and the multichannel audio content is reduced to stereo or monaural audio content.Because the good characteristic of running gear mobility, audio frequency is subjected to the interference of place environmental noise easily, makes important audio content, is not easy to listen to clear.And running gear is because of the restriction of loudspeaker sizes and power has relatively poor frequency response, can cause as tangible distortion of background sound effect such as blast sound or background sound effect and cause prospect audio such as the too little relatively sound of dialogue too loudly, and the user needs to adjust volume often.

And be subject to the screen size of running gear, and original caption character content if all be shown on the picture, then needs and raw frames convergent-divergent in proportion, and this method will cause literal too small and crowded and be difficult for seeing clearly.As adopting caption character and raw frames different proportion convergent-divergent, it is excessive and to cover too much picture or captions long and exceed the problem of picture indication range that literal then can take place again.

Summary of the invention

Because prior art can be lost the problem of details when carrying out convergent-divergent at multimedia video, audio frequency and captions, the present invention proposes system, computer program and a correlation technique, integrate method gained results such as video analysis, audio analysis, literal analysis, utilize decision-making module analytical integration data, carry out the automatic adjustment of the content of video, audio frequency and caption character, when playing on the different playing devices to be created in, when especially playing on the running gear, still can obtain the content of multimedia of best appreciation effect.

Embodiment promptly carries out video analysis, audio analysis and literal analysis respectively at contents such as multimedia video, audio frequency and caption characters, and the result of analysis sets according to different playing devices, after decision-making is judged and is handled, and the content of multimedia that output is suitable.

Flow process content of the present invention analysis, decision-making and contents processing, behind the original multi-medium data input native system that contains video, audio frequency and caption character, native system carries out content analysis at multi-medium data earlier, result according to content analysis, the judgement of integrating and make a strategic decision, after treatment, the last suitable multi-medium data of output.The multi-medium data of wherein original multi-medium data and output can be archives or the crossfire that comprises video, audio frequency and caption character.The content analysis flow process comprises video analysis, audio analysis, literal analysis and manual analysis.The result of video analysis, audio analysis, literal analysis and manual analysis, judge according to the rule of environmental parameter and setting in advance via decision process again, with the processing mode of decision content of multimedia, for example image-zooming degree, background sound suppress degree or captions putting position.Last contents processing flow process is then according to the processing mode that decision process determined, actual treatment is also integrated video, audio frequency and caption character content, the result that output is last.

The objective of the invention is to provide system, computer program and the correlation technique of a processing content of multimedia, make content of multimedia when different playing devices is play, the less running gear screen of size for example, or the window of different displaying ratios on the computer, still can obtain preferable viewing effect.Original content of multimedia is via the mode of content analysis, the interested video content of user is done farthest to keep, picture is not dwindled or ratio changes and to lose material particular because of screen, highlight important content in the audio frequency simultaneously, adjust the display position and the display mode of captions, under the processing mode of taking into account video, audio frequency and word content, generation accords with the multi-medium data of watching on the different playing devices.

Range of application of the present invention, can be server end and adopt real-time or pretreated mode, multimedia is originated, according to method proposed by the invention, recompile is for being fit to the content of multimedia that different device is watched, for the mode of user with real-time crossfire, or the mode of non real-time download multi-medium file, on user's playing device screen, view and admire content of multimedia.Range of application of the present invention also can be server end according to the proposed method, produces the good instruction of describing in advance, transfers to multimedia processing system generation content again.The scope of application of the present invention comprises the user by running gear, personal computer or other device, with playout software or web browser or other program, plays the application of local side or remote multi-media archives.The applicable playing device of the present invention comprises mobile phone, PDA(Personal Digital Assistant), mobile computer, video frequency player, but is not subject to device mentioned above.

Description of drawings

Fig. 1 is according to Organization Chart embodiment of the present invention;

Fig. 2 is another embodiment of process structure figure of the present invention;

Fig. 3 is the Organization Chart embodiment of the present invention about the video analysis subelement;

Fig. 4 is the Application Example of the present invention about attention model (Attention model);

Fig. 5 is the Application Example of the present invention about reset object (Retarget);

Fig. 6 follows the trail of the Application Example of (Eye/gaze tracking) about the pupil position for the present invention;

Fig. 7 is the Organization Chart embodiment of the present invention about the audio analysis subelement;

Fig. 8 is the Application Example of the present invention about voice and music detecting module;

Fig. 9 analyzes the Organization Chart embodiment of subelement about literal for the present invention;

Figure 10 adopts the Application Example of keyword about caption character for the present invention;

Figure 11 adopts punctuate and the Application Example that shows fast for the present invention about caption character;

Figure 12 is the Application Example of the present invention about judgement caption character display position;

Figure 13 is the embodiment of the present invention about decision package.

[primary clustering symbol description]

Content analysis unit 11 decision packages 13

Multimedia processing unit 15 original multi-medium datas 10

Video analysis subelement 111 audio analysis subelements 113

Subelement 115 manual analyses 140 analyzed in literal

Environmental parameter 160 is handled back multi-medium data 18

Content analysis unit 21 decision packages 23

Multimedia processing unit 25 original multi-medium datas 20

Content analysis unit 21 video analysis subelements 211

Subelement 215 analyzed in audio analysis subelement 213 literal

Manual analysis 240 environmental parameters 260

Describe instruction set 28 and handle back multi-medium data 29

Analysis decision subsystem 2 multimedia processing subsystems 3

Video analysis subelement 32 decision packages 34

Scene detection module 321 attention model modules 323

Reset object module 325 pupil position tracing module 327

Original video data 30 audio analysis results 341

Literal analysis result 343 manual analysis results 345

Environmental parameter 347,747,947

Presentation content 401,403,405 attention model modules 41

Decision package 43 running gears 47

Presentation content 501,503,505,507

Decision package 53 multimedia processing units 55

Running gear 57 reset object modules 51

Presentation content 601,603,605 pupil position tracing module 61

Decision package 63 audio analysis subelements 72

Decision package 74 voice and music detecting module 721

Language person recognition module 723 original audio data 701

Video analysis result 741 literal analysis results 743

Subelement 92 analyzed in manual analysis result 745 literal

Meaning of one's words mark module 921 words and expressions segmentation modules 923

Original caption data 901 decision packages 94

Video analysis result 941 audio analysis results 943

Manual analysis result 945

Embodiment

Please refer to the Organization Chart embodiment of the multimedia processing system of content adaptive provided by the present invention shown in Figure 1, this embodiment consists predominantly of content analysis unit 11, decision package 13 and multimedia processing unit 15, wherein content analysis unit 11 is in order to analyze the content of multi-medium data, decision package 13 can and utilize 15 pairs of original multi-medium datas 10 of multimedia processing unit to carry out Edition Contains in order to the processing mode of decision-making content of multimedia, format conversion and encoding compression output are suitable for multi-medium data 18 after the processing that different playing devices or playing environment play.

Content analysis unit 11 receives original multi-medium data 10, after analyzing via content analysis unit 11 again, analysis result is sent to decision package 13 and further handles, wherein content analysis unit 11 comprises video analysis subelement 111, audio analysis subelement 113 and literal analysis subelement 115, respectively in order to analyze multimedia video content, audio content and caption character content.Decision package 13 can receive the analysis result from content analysis unit 11, and the result of manual analysis 140, and can accept environmental parameter 160.After decision package 13 receives analysis result, will be according to environmental parameter 160 and the good rule of predefined, through judging the processing mode of decision content of multimedia.Multimedia processing unit 15 carries out Edition Contains, format conversion and encoding compression with original multi-medium data 10 and becomes suitable content according to the good processing mode of decision package 13 decision, and output meets this playing device and watches multi-medium data 18 after the processing of demand.

Wherein, environmental parameter 160 comprises the player relevant parameter, as display capabilities (resolution), and sound playing ability (channel number: for example monophony, dual track), and decoding ability to play, environmental parameter 160 also can comprise the environmental aspect when playing simultaneously.Therefore environmental parameter 160 can be classified example down as: with PDA(Personal Digital Assistant), 3.5 cun VGA (resolution 640 * 480) screen is an equipment with the stereophone, watches the content of multimedia of appointment in noisy environment.And manual analysis 140 via subjective determination, is chosen the part that belongs to important or interesting in the content of multimedia for to carry out with manual type.The data of manual analysis 140 gained are to meet the required pattern of the input of decision package 13, input decision package 13.

Please refer to shown in Figure 2, another embodiment of the process structure figure of the multimedia processing system of content adaptive provided by the present invention.The framework of this embodiment consists predominantly of content analysis unit 21, decision package 23 and multimedia processing unit 25, and wherein content analysis unit 21 is in order to analyze the content of multi-medium data; The processing mode of decision package 23 decision-making content of multimedia; And carry out Edition Contains, format conversion and encoding compression output by 25 pairs of original multi-medium datas 20 of multimedia processing unit and be suitable for the multi-medium data play on the different playing devices.

After original multi-medium data 20 is analyzed via content analysis unit 21, analysis result is sent to decision package 23 and further handles, wherein content analysis unit 21 also comprises video analysis subelement 211, audio analysis subelement 213 and literal analysis subelement 215, respectively in order to analyze multimedia video content, audio content and caption character content.Decision package 23 can receive the analysis result from content analysis unit 21, and the result of manual analysis 240, and can accept environmental parameter 260.

After decision package 23 receives analysis result, will through judging, determine the processing mode of content of multimedia according to environmental parameter 260 and the good rule of predefined, different environmental parameters 260 can produce different processing modes.The good processing mode of decision package 23 decisions is then represented in the mode of describing instruction set 28.

A identical multi-medium data, can be by the different environmental parameter 260 of input, produce many parts of different processing modes and describe instruction set 28, environmental parameter is represented various playing environment, the hardware configuration that comprises different playing devices produces a specific description instruction set so decision package 23 is environmental parameters according to the specific playing environment of correspondence.Also be, difference according to playing device, system can select for use different description instruction set 28 to be sent to multimedia processing unit 25 and handle, multimedia processing unit 25 is according to the description instruction set of importing into 28, original multi-medium data 20 is carried out Edition Contains, format conversion and encoding compression become suitable content, output meets this playing device and watches multi-medium data (29) after the processing of demand.

In preferred embodiment, content analysis unit 21 forms an analysis decision subsystem 2 with decision package 23, and multimedia processing unit 25 has a multimedia processing subsystem 3 of one's own.Analysis decision subsystem 2 can produce different description instruction set 28 according to different environmental parameter 260.According to the playing device of selecting for use, multimedia processing subsystem 3 can be selected suitable description instruction set 28 and original multi-medium data 20 is carried out Edition Contains, format conversion and encoding compression produce multi-medium data (29) after the suitable processing.In one embodiment, multimedia processing subsystem 3 can be a nonlinear editing system (NLE), and this system can be according to description instruction set 28, and it is the content that is fit to that original multi-medium data 20 is compiled.

Wherein, describing instruction set 28 has write down the required keyword that presents of mode, captions of the zone that comprises pairing multi-medium data, environmental parameter, the required processing of each image and ad hoc fashion, the required processing of audio frequency paragraph and position etc. occurred.Describe the recording mode of instruction set 28, (Extensible Markup Language, XML) form write down data and the method that each medium should be handled in the mode of stratum to can be extensible markup language.Describe instruction set 28 and also can adopt the specified record format of this system in conjunction with nonlinear editing system (NLE).

Please refer to the present invention shown in Figure 3 Organization Chart embodiment about the video analysis subelement.Wherein mainly show video analysis subelement 32 and decision package 34, wherein video analysis subelement 32 also includes scene detection (Shot detection) module 321, attention model (Attention Model) module 323, reset object (Retarget) module 325 and pupil position tracking (Eyes/gaze tracking) module 327 at least.

Scene detection module 321 is in order to video data is done segmentation, and the video content of same scene after scene detection module 321 is analyzed, is classified as identical paragraph.By scene detection module 321, can analyze and obtain the time point that scene is switched in the film, this information can offer other analytical method as auxiliary judgment.

323 of attention model modules are in order to utilize various characteristic distribution in the image, set up out attention model according to the rule of thumb and human eye visual perception (Visual perception) characteristic, and then pick out relative emphasis in the film, also promptly find out the part that more easily attracts the audience.

Reset object module 325 determines in order to utilize in the image energy distributions characteristic that subject belongs to prospect or is background in the image, and the combination of pixels that more can further sort out (pixel sets) is to the importance of image integral body.

Pupil position tracing module 327 in order to record beholder's the eyes and the location track of pupil, and then is derived the beholder when ornamental film, where is that beholder's vision is watched the zone attentively in the film.

After original video data 30 input videos of content of multimedia are analyzed subelement 32, carry out video analysis via methods of video analyses such as scene detection, attention model, reset object and the trackings of pupil position respectively, analysis result will further be sent to decision package 34, integrate according to environmental parameter (347) and with audio analysis result (341), literal analysis result (343) and manual analysis result (345) by decision package 34, determine last processing mode.In the present embodiment, scene detection module 321 is in order to find out the video switch part, also is the scene switching point, is unlikely to when scene is switched when the montage convergent-divergent guaranteeing, because of inappropriate convergent-divergent, causes audience's uncomfortable property.Module and manual analyses such as attention model, reset object, the tracking of pupil position, all be to utilize its ad hoc fashion to select the zone which part may be paid attention to for the audience in the video content, and in the multimedia processing procedure that continues, cannot cover it, modification or even cutting, when avoiding layout presentation content again, pith is changed, and influenced the enjoyment of viewing and admiring.

Wherein, please see following embodiment explanation for details about attention model module, reset object module and pupil position tracing module.In the present embodiment, the video analysis subelement can comprise one of them module arbitrarily mentioned above, also can comprise a plurality of modules of combination in any.The video analysis subelement also can comprise the analysis module of other attainable cost goal of the invention not only for the mentioned module of present embodiment among the present invention.In one embodiment, the video analysis subelement also can comprise human face detection (Face detection) and mobile detection (Motion detection) waits other methods of video analyses again.

The attention model module:

Please refer to Fig. 4, this figure shows the Application Example of the present invention about the attention model module.The picture left above of Fig. 4 is original video presentation content (401), then by attention model module 41 analysis results, produces the presentation content (403) of top right plot, wherein shows the subject matter of an attentiveness of living with frame of broken lines.The lower-left figure of Fig. 4 is after decision package 43 and 45 processing of multimedia processing unit, the presentation content (405) that is suitable for broadcast on the running gear 47 that is produced.

In this embodiment, raw video (401) is after attention model module 41 is analyzed, choose the part (frame of broken lines of image 403 is lived part) that more easily attracts the audience in the image, this analysis result will be sent to decision package 43, decision package 43 is according to environmental parameter, and after integrating the result of video analysis, audio analysis, literal analysis and manual analysis, the processing mode of decision video content adopts the analysis result of attention model module 41, handle by multimedia processing unit 45 at last, produce as the video content (405) on the last running gear 47 of Fig. 4.

The method of the attention model that this case is used can be with reference to United States Patent (USP) the 7th, 260, the System and method for (System and Methods for Enhanced ImageAdaptation) of No. 261 suitable property of disclosed reinforcement image, the personnel of correlative technology field of the present invention reference according to this understand the attention model method that is applicable to video analysis subelement of the present invention.

The reset object module:

Please refer to the Application Example of the present invention shown in Figure 5 about the reset object module.The picture left above of Fig. 5 is original video presentation content (501), the right side figure of Fig. 5 is reset object module analysis result's a image (503), the lower-left figure of Fig. 5 is after

decision package

53 and 55 processing of multimedia processing unit, the presentation content (507) that is suitable for broadcast on the running gear 57 that is produced.

In the present embodiment, raw video (501) is after reset object module 51 is analyzed, find out object important in the image (505) and background (503), this analysis result will be sent to decision package 53, decision package 53 is according to environmental parameter, and after integrating the result of video analysis, audio analysis, literal analysis and manual analysis, the processing mode of decision video content adopts the analysis result of reset object module 51, handle by multimedia processing unit 55 at last, produce video content (507) as Fig. 5 lower-left figure.

By (image 403 of comparison chart 4) as can be seen in the present embodiment, the object 505 on the image is via still keeping original size after the system handles, and the background on the image is then through dwindling processing to meet the screen size of running gear 57.The convergent-divergent of different proportion relation is not lost the details of object and importance between object and background because of image dwindles.

The method of above-mentioned reset object can be consulted No. 2007/0025637 disclosed image method that resets (Retargeting Images for Small Display) that is used for the small screen demonstration of U.S. Patent Publication, and preceding by this case can be understood the reset object method that is applicable to video analysis subelement of the present invention.

Pupil position tracing module:

Figure 6 shows that the Application Example of the present invention about pupil position tracing module.The picture left above of Fig. 6 is original video presentation content (601), the right side figure of Fig. 6 is the result (603) who analyzes through pupil position tracing module 61, the lower-left figure of Fig. 6 is after

decision package

63 and 65 processing of multimedia processing unit, the presentation content (605) that is suitable for broadcast on the running gear 67 that is produced.

In the present embodiment, raw video (601) is after pupil position tracing module 61 is analyzed, find out in the image audience's vision and watch the zone attentively, this analysis result will be sent to decision package 63, decision package 63 is according to environmental parameter, and after integrating the result of video analysis, audio analysis, literal analysis and manual analysis, the processing mode of decision video content adopts the analysis result of pupil position tracing module 61, handle by multimedia processing unit 65 at last, produce video content (605) as Fig. 6 lower-left figure.

Can be with reference to United States Patent (USP) the 7th, 259, No. 785 disclosed digitized video method and apparatus (Digital Imaging Method and Apparatus Using Eye-Tracking Control) that utilize eyes to follow the trail of.The personnel in the technology of the present invention field can understand the pupil position method for tracing that is applicable to video analysis subelement of the present invention via reference.

Please refer to the Organization Chart embodiment of the present invention shown in Figure 7 about the audio analysis subelement.

According to graphic, mainly include audio analysis subelement 72 and decision package 74 in the framework, wherein audio analysis subelement 72 includes voice and music detecting module 721, in order to classification and separation done in voice in the audio frequency and music, and the person's recognition module 723 that has the language, in order to different language persons in the difference audio frequency.

Behind the original audio data of content of multimedia (701) the input audio analysis subelement 72, the audio analysis method that is provided via voice and music detecting module 721 and language person recognition module 723 is carried out audio analysis respectively, analysis result will further be sent to decision package 74, by decision package 74 according to environmental parameter (747) and with video analysis result (741), literal analysis result (743) and manual analysis result (745) integrate, determine last processing mode, then transfer to the multimedia processing unit original audio data is carried out Audio Processing, when the content of multimedia after feasible the processing is play, under noisy environment, still can keep the quality of preferable appreciation on playing device.

Wherein voice and music detecting module 721 with original audio according to various different content characteristics, as voice, dub in background music, special efficacy or according to the personage of sounding, separate to differentiate and become separately independently audio frequency.Its analysis result produces the result of decision after decision package 74 is judged, for example reduce excessive alarm bell special efficacy audio volume of certain period, and improves the volume with period voice dialogue audio frequency.The analysis result of language person recognition module 723, integrate from video analysis result (741), literal analysis result (743), manual analysis result 745 with environmental parameter (747) and after comprehensively judging through decision package 74, produce the result of decision, the person's that for example only presents the language merely video pictures.

In another embodiment, decision package 74 can be integrated the analysis result of the human face detection module (not shown) in audio analysis result and the video analysis subelement 72, when detecting people's face, can constrain background audio or strengthen the language audio number, highlighting voice dialogue audio frequency, otherwise then need not revise voice data.

In one embodiment, audio analysis subelement 72 also can comprise the plosive detecting module, be complete video and the audio frequency special efficacy that presents in the blasting process to be produced, decision package 74 can be according to the plosive detecting analysis result of audio analysis subelement 72, and whether decision is considered as the plosive audio frequency main voice data and keeps the original video picture and will not cut.

About voice and music method for detecting and language person discrimination method can be consulted following file:

Abdullah?I.Al-Shoshan，“Speech?and?Music?Classification?and?Separation：AReview”，Journal?of?King?Saud?University.Engineering?Sciences.Volume?19，No1.，2007。

Joseph?P.Campbell，JR.，“Speaker?Recognition：A?Tutorial”，Proceedings?ofthe?IEEE?Volume?85，Issue?9，Sep?1997Page(s)：1437-1462。

Association area personnel of the present invention can understand several via above-mentioned reference and be applicable to audio content analytical method of the present invention.In the present embodiment, audio analysis subelement 72 can comprise one of them module arbitrarily mentioned above, also can comprise a plurality of modules of combination in any.Sound intermediate frequency of the present invention is analyzed subelement 72 not only for the mentioned module of present embodiment, also can comprise the audio analysis module of other attainable cost goal of the invention.In one embodiment, audio analysis subelement 72 can comprise other audio analysis methods such as speech recognition, quiet detecting, keyword detecting, the detecting of blast sound.

Fig. 8 shows the Application Example of the present invention about voice and music detecting module.(A) figure of Fig. 8 shows the original contents before music and language audio number are untreated, the audio number of wherein speaking (solid line) mixes with music signal (dotted line), when on running gear, listening to, easily because of the good characteristic of running gear mobility, make audio frequency because of the interference of background music, and the important content of leakage voice.

Fig. 8 (B) figure and (C) figure are respectively voice and music detecting module analysis result, capture respectively after detecting, and voice are separated with the signal of music.Then for strengthening the back audio result, analysis result amplifies the language audio number and weakens music signal, the result that will speak audio number and music signal merge to the end again after decision package judgement and multimedia processing unit processes (D) figure of Fig. 8.Therefore the audio frequency of handling can decrease in background music suffered when appreciating multimedia on the running gear and disturb owing to strengthened the signal of voice.

Figure 9 shows that the present invention analyzes the Organization Chart embodiment of subelement about literal.Disclose a literal among the embodiment and analyze subelement 92, wherein comprise meaning of one's words mark (Semantic tagging) module 921 and words and expressions segmentation (Text segmentation) module 923 at least.Meaning of one's words mark module 921 is in order to the meaning of one's words of mark caption character, and words and expressions segmentation module 923 is in order to analyze the relation between the captions words and expressions and to give segmentation.

The original caption data (901) of content of multimedia after meaning of one's words mark module 921 and 923 analyses of words and expressions segmentation module, can obtain comprising the analysis result of a plurality of keywords and caption character waypoint.This analysis result will be passed to decision package 94, by decision package 94 according to environmental parameter (947), and integration video analysis result (941), audio analysis result (943) and manual analysis result (945), and avoid covering the video important area, satisfy the audio frequency captions synchronously and the rule that defines according to the display capabilities decision captions size of player etc., the display mode and the display position of decision caption character.This part embodiment can be with reference to figures 10 to the embodiment of Figure 12, further to understand the execution mode of relevant literal analysis provided by the present invention.

The method of relevant meaning of one's words mark and words and expressions segmentation among relevant the present invention can be with reference to United States Patent (USP) the 6th, 311, No. 152 disclosed Chinese meaning of one's words recognition systems (System for Chinese Tokenization andNamed Entity Recognition).The technology of the present invention field related personnel can be applicable to meaning of one's words mark of the present invention and words and expressions segmentation method via the reference understanding.The present invention's Chinese word analysis subelement 92 also can comprise the literal analysis module of other attainable cost goal of the invention not only for the mentioned module of present embodiment.In one embodiment, literal analysis subelement can comprise the keyword detecting.

Please refer to shown in Figure 10, wherein show the embodiment that the present invention adopts literal to analyze about caption character, after caption character is analyzed via the literal analytic unit, capture several keywords in the caption character, present caption character in the mode that shows keyword, wherein keyword also can be keyword and keyword sets.(A) figure of Figure 10 is that captions show raw video and its captions, and (B) figure of Figure 10 then shows improved results for captions.By among the figure as can be known, when not adopting keyword when showing whole captions, the caption character of demonstration is too small and crowded, is not easy to see clearly.When adopting the present invention, when showing complete captions to show the captions keyword, to replace, the font of caption character is scalable, identification easily, and the user can hammer out the original meaning of complete captions easily from keyword.

Figure 11 shows that another embodiment that the present invention adopts literal to analyze about caption character, caption character is cut caption character after analyzing via the literal analytic unit according to the meaning of one's words, present with punctuate and the mode that shows fast.

(A) figure of Figure 11 is that captions show the original display state, and (B) of Figure 11 (C) (D) (E) schemes then to show improved results for captions.By among the figure as can be known, if caption character is long, then can cover too much picture, and literal is too small and crowded, is not easy to see clearly.The mode that the present invention then adopts literal to analyze is cut to several sections with the long captions of script, shows respectively at the picture that is associated.

Please refer to shown in Figure 12ly, another embodiment that it adopts literal to analyze for the present invention about caption character, caption character in conjunction with the result of video analysis, are judged the display position of caption character after analyzing via the literal analytic unit by decision package.(A) figure of Figure 12 is that captions show baseline results, and (B) figure of Figure 12 then shows improved results for captions.

By among Figure 12 (A) as can be known, under the prerequisite that still can clearly easily distinguish on the device screen, the font after the amplification usually can be covered object important on the picture in action.Present embodiment is then in conjunction with scene detection in the video analysis and human face detection method, the rule good according to predefined, avoid people's face or mobile objects such as object, display position with caption character, from originally immobilizing, change into and being presented on resting or the background,, drop to minimum to the influence of image caption character as Figure 12 (B).

Please refer to the embodiment of the present invention shown in Figure 13 about decision package.

Among this embodiment, analysis result and environmental parameter (100) input decision package (130) with content analysis unit after data integration (step S131), will give each signal different decision-making treatment according to the rule that defines.

Video section will be judged the condition (step S132) that whether meets special clips.For example judge whether object and background ratio great disparity, or user's sight focus only is whole the part in the image.

If do not meet special clips condition (denying), then according to environmental parameter decision scaling (step S136); (otherwise being) then carries out montage arrangement (step S135), and for example object and background adopt the different proportion convergent-divergent, remove the edge of background video etc.

To the captions part, then good according to predefined rule is given decision caption character size (step S133) earlier, judges whether captions displaying location is overlapping with the object of video again, and whether decision rearranges display position (step S137).

If do not influence object (denying), then captions are shown in traditional location (step S139), and majority is to be positioned at the screen below; If influence object (being), auxiliary reference video analysis result then is shown in ad-hoc location (step S138) in the image, for example the background place in the image with captions.

Audio-frequency unit, then good according to predefined rule judges whether to carry out voice and strengthens (step S134).

If do not need voice to strengthen (denying), then audio-frequency unit will not carry out any correction; (otherwise being), decision package (130) will determine voice and music convergent-divergent multiplying power, divide with regard to voice and musical portions to give different convergent-divergent instructions (step S140).

Above-mentioned respectively at the result that handling procedure produced of video, captions and audio frequency, after describing the instruction integration (step S141), end product is exported (120) in the mode of describing instruction set.

In sum, the multimedia processing system of content adaptive provided by the present invention, via in conjunction with video analysis, audio analysis and literal analysis, handle content of multimedia at specific playing device, the enjoyment of viewing and admiring that makes the content of multimedia of generation possess former content of multimedia, the characteristic of still have that emphasis is not lost simultaneously, sound stable and caption character being read easily.

The above only is a preferred implementation of the present invention; should be pointed out that for those skilled in the art, under the prerequisite that does not break away from the principle of the invention; can also make some improvements and modifications, these improvements and modifications also should be considered as protection scope of the present invention.

Claims

1. the multimedia processing system of a content adaptive is characterized in that, described system comprises:

One content analysis unit, receiving multimedia data, and analyze the content of this multi-medium data, this content analysis unit comprises:

One video analysis subelement is in order to analyze the video content of this multi-medium data;

Subelement analyzed in one literal, in order to analyze the word content of this multi-medium data;

One audio analysis subelement is in order to analyze the audio content of this multi-medium data;

One decision package determines a processing mode according to this content analysis unit at the analysis result of this multi-medium data; And

One multimedia processing unit becomes to be suitable for the multi-medium data of specific playing environment with original multimedia data contents editor, format conversion and multimedia compressed encoding according to this processing mode;

Wherein, this content analysis unit is transmitted video analysis result, audio analysis result and literal analysis result to this decision package, and this decision package is integrated video analysis result, audio analysis result and literal analysis result, determines this processing mode.

2. the multimedia processing system of content adaptive as claimed in claim 1 is characterized in that, described decision package also receives the result of a manual analysis, cooperates the analysis result of this content analysis unit to determine this processing mode.

3. the multimedia processing system of content adaptive as claimed in claim 1, it is characterized in that, described decision package also receives an environmental parameter, the analysis result that cooperates this content analysis unit, determine this processing mode, wherein this environmental parameter comprises display size and resolution, the sound playing ability of a playing device and the decoding ability to play of a playing device of a playing device, and this processing mode that this decision package determined is described instruction set with one and presented.

4. the multimedia processing system of content adaptive as claimed in claim 3, it is characterized in that, described multimedia processing unit is also described instruction set according to this that imports into, original multimedia data contents editor, format conversion and multimedia compressed encoding are become suitable content, and output meets the multi-medium data of specific playing environment.

5. the multimedia processing system of content adaptive as claimed in claim 1 is characterized in that, described video analysis subelement comprises the combination with one of lower module or a plurality of modules:

One scene detection module, in order to the video data of multi-medium data is done segmentation, the video content of same scene behind the scene detection module analysis, is classified as identical paragraph;

One attention model module is utilized various characteristic distribution in the image of this multi-medium data, sets up out an attention model according to a human eye visual perception characteristic, and then picks out the relative emphasis in this multi-medium data;

One reset object module utilizes in the image of this multi-medium data the energy distributions characteristic to determine that subject belongs to prospect or background in this multi-medium data;

One pupil position tracing module in order to the eyes that write down a beholder and the location track of pupil, and then is derived the vision of beholder when viewing and admiring this multi-medium data and is watched the zone attentively; And

One human face detection module, in order to the people's face in the detecting video, the gained result will provide a kind of processing mode of this decision package to this multi-medium data.

6. the multimedia processing system of content adaptive as claimed in claim 1 is characterized in that, described literal analysis subelement comprises the combination with one of lower module or a plurality of modules:

One meaning of one's words mark module is in order to the meaning of one's words of caption character in this multi-medium data of mark; And

One words and expressions segmentation module is in order to analyze the relation between this captions words and expressions and to give segmentation;

After wherein the caption data of this content of multimedia is analyzed via this meaning of one's words mark module and this words and expressions segmentation module, obtain comprising the analysis result of a plurality of keywords and caption character waypoint, and be passed to this decision package, determine the display mode and the display position of caption character in this multi-medium data.

7. the multimedia processing system of content adaptive as claimed in claim 1 is characterized in that, described audio analysis subelement comprises the combination with one of lower module or a plurality of modules:

One voice and music detecting module, in order to the voice of the audio frequency of this multi-medium data and music are done classification and separate, wherein the analysis result of these voice and music detecting module is strengthened the speech sound signal in the original multi-medium data or music signal is weakened by this multimedia processing unit after this decision package is judged; And

One language person recognition module, in order to distinguish different language persons in this audio frequency, analysis result that wherein should language person recognition module is after this decision package is judged, this multimedia processing unit switches the focus of video in the original multi-medium data between different language persons.

8. the multi-media processing method of a content adaptive is characterized in that, described method comprises:

Import the analysis result of this content analysis unit and environmental parameter to this decision package, use according to the rule that defines and produce different processing modes;

Judge whether the video section in this multi-medium data meets the condition of a special clips;

If do not meet the special clips condition, then according to this environmental parameter decision scaling;

If meet the special clips condition, then carry out a montage arrangement;

The rule good according to predefined determines captions literal size partly in this multi-medium data;

Judge this captions displaying location whether with this multi-medium data in the object of video overlapping, whether rearrange display position with decision;

If do not influence object through judging, then these captions are shown in traditional location;

If through the decision influence object,, these captions are shown in a ad-hoc location in the image then with reference to this video analysis result;

By this, at the result that handling procedure produced of the video section in this multi-medium data, audio-frequency unit and captions part, after a description instruction is integrated, export respectively in this mode of describing instruction set.

9. the multi-media processing method of content adaptive as claimed in claim 8 is characterized in that, described multi-media processing method also comprises:

The rule good according to predefined judges whether the audio-frequency unit in this multi-medium data need carry out the voice reinforcement;

If do not need voice to strengthen, then this audio-frequency unit will not carry out any correction;

If need voice to strengthen, this decision package decision voice and music convergent-divergent multiplying power divide with regard to voice and musical portions to give different convergent-divergent instructions.

10. the multi-media processing method of content adaptive as claimed in claim 8 is characterized in that, described montage arrangement comprises:

Object in the image of this multi-medium data and background are adopted the different proportion convergent-divergent; And

Remove the edge of background video in the image of this multi-medium data.