WO2016107965A1

WO2016107965A1 - An apparatus, a method, a circuitry, a multimedia communication system and a computer program product for selecting field-of-view of interest

Info

Publication number: WO2016107965A1
Application number: PCT/FI2015/050861
Authority: WO
Inventors: Marja Salmimaa; Toni JÄRVENPÄÄ; Miikka Vilermo; Arto Lehtiniemi
Original assignee: Nokia Technologies Oy
Priority date: 2014-12-31
Filing date: 2015-12-08
Publication date: 2016-07-07
Also published as: GB201423368D0; GB2533924A

Abstract

In the invention a salient highlights are detected in the audio-visual content. The invention uses an apparatus comprising a camera, a microphone and a means for detecting the movement of the apparatus. The content is captured with said apparatus and ambient sound is detected from audio-visual content. Sound features are searched from ambient sound, and correspondence between said feature and key- words are searched. Visual patterns are searched from said content, and correspondence between said patterns and keywords are searched. Movement information of the apparatus is gathered. Based on the contextual data relating the con- tent capturing situation, movement information of the apparatus,and the keywords corresponding features found from ambient sound and the visual patterns, it is determined if there is the salient highlight in the audio-visual content. When the salient highlight is found, it is labelled based on the keywords connected to the recognized highlight event.

Description

An apparatus, a method, a circuitry, a multimedia communication system and a computer program product for selecting field-of-view of interest

Technical field

Various embodiments relate to imaging devices, audio-visual capture and multimedia representation.

Background

Portable imaging devices have become more common recently. That kind of devices are for example game consoles, personal computers, tablet computers and smart phones. However capturing images and video stream with these devices is not always easy. Controlling these devices may be slow and cumbersome leading to a situation that when a user is trying to take an image or a video, it is focused on a wrong place or result is otherwise not optimal. This is evident when shooting video from an event where action is fast and can change place very fast. Many sport events are this kind.

It is known in the art how focusing can be based on movement and pattern of e.g. players on a field and content is labelled afterwards. Also, it is known in the art that video content may be displayed on device so as to provide a zoomed view of the content in a way that the zooming window follows an object which could be for example a ball or a person. However these methods could demand active participation from the user and could not necessary react quickly to changing situations. In the known art it is a problem that the capability or preference of the receivers to decode a full-resolution video is not known, and therefore efficient delivery and representation of the video may require for example an adaptation in a gateway to meet the receiver's capabilities or an adjustment to the network throughput. Also, it is a problem that rudimentary scaling of the media resolution to the display resolution does not maintain the subjective quality. Also, rudimentary scaling of the media resolution to the display resolution may result in a content where essential parts are cropped. Also the known art does not provide an efficient way to recognize and label salient highlights in a media stream.

Summary Various embodiments are achieved through a method, an apparatus, a circuitry, a multimedia communication system, a computer program product and a non-transitory computer readable storage medium characterized in what is disclosed in the independent claims. Some embodiments of the invention are disclosed in the dependent claims.

A method for detecting a salient highlight in the captured content according to an em- bodiment uses an apparatus comprising a camera, a microphone and an arrangement for detecting the movement and the direction of the apparatus. According to an embodiment, the method comprises steps where an audio-visual content is captured and ambient sound is detected from said audio-visual content. The method further comprises steps where at least one feature is searched and recognized from said ambient sound, and correspondence between said feature and at least one keyword is searched, at least one visual pattern is searched and recognized from said audiovisual content, and a correspondence between said pattern and the at least one keyword is searched and movement and directional information of the apparatus used for capturing said audio-visual content is gathered and analyzed. The method further comprise a step comprising determining if there is a salient highlight in the audio-visual content based on at least one of contextual data relating to a situation, where the content is to be captured, movement of the apparatus, directional information of the apparatus, the at least one keyword corresponding features found from ambient sound, or the at least one visual pattern. If the salient highlight is found and recognized, said audio-visual content is labelled based on the at least one keyword connected to the recognized salient highlight event.

In an embodiment, determining if there is the salient highlight in the audio-visual content corresponding to at least one keyword that is predetermined based on the contextual data. User may also add keywords. In a second embodiment, the said labelled salient highlight of audio-visual content is encoded and/or compressed using different rules than the audio-visual content that is not labelled.

In a third embodiment, the ambient audio feature comprises speech, music, rhythm, musical genre, timbre, instrument, loudness, spectrum, zero-crossing rate or other musical or acoustic characteristics or change of sound pressure or any combination of thereof.

In a fourth embodiment, the visual pattern comprises at least one stationary construction of the area of the interest or the visual pattern is comprised of moving objects or the visual pattern is a combination of both. In a fifth embodiment, the certain predetermined movement and directional information of the apparatus indicates a possibility of the salient highlight event.

In a sixth embodiment, the target area of imaging comprises an object or objects that corresponds to a certain key word. In a seventh embodiment, the contextual data comprises at least one of the following: information of the event, location, time or preference of the user.

An apparatus according to an embodiment comprises a camera, a microphone, an accelerometer or a movement sensing sensor for detecting the movement of the apparatus, an arrangement for detecting the direction of the apparatus, at least one memory and at least one processor, and the apparatus is arranged to capture an audio-visual content with the camera and the microphone. According to an embodiment, the apparatus further comprises an arrangement for detecting ambient sound and an arrangement for detecting at least one visual pattern recognized from said audio-visual content. The apparatus is arranged to identify at least one keyword that correspond to the detected ambient sound and at least one visual pattern, and contextual data relating the situation, where the content is to be captured, is arranged to be stored in the memory of the apparatus, and the apparatus is arranged based on at least one of contextual data, the movement and directional information and the at least one keyword corresponding features found from ambient sound or the at least one visual pat- tern, to determine if there is the salient highlight in the audio-visual content. If the salient highlight is found and recognized, it is labelled based on the at least one keyword connected to the recognized salient highlight event. The movement sensing sensor may be a 9 degrees-of-freedom sensor.

In an embodiment the arrangement for detecting ambient sound comprises at least partly a computer program instructions configured to, with the least one processor, cause the apparatus to detect the ambient sound.

In a second embodiment, the arrangement for detecting ambient sound is arranged to recognize at least one of following: speech, music, rhythm, musical genre, timbre, instrument, loudness, spectrum, zero-crossing rate or other musical or acoustic charac- teristics or change of sound pressure or any combination of thereof.

In a third embodiment, the arrangement for detecting at least one visual pattern comprises at least partly a computer program instructions configured to, with the least one processor, cause the apparatus to detect visual patterns. In a fourth embodiment, the arrangement for detecting at least one visual pattern is arranged to recognize visual patterns that are stationary or moving or combination of both.

In a fifth embodiment, the arrangement for detecting the direction of the apparatus comprise at least one of following: a compass, a magnetometer, a visual direction finder, a beacon data receiver or other direction finder device.

In a sixth embodiment, the apparatus is arranged to focus or zoom to a particular point of the area of the interest when capturing said audio-visual content based on the recognized salient highlight event that relates to said point. In a seventh embodiment, the apparatus comprises a portable computing device comprising a smart phone, a tablet computer or a gaming console.

A salient highlight detection circuitry according to an embodiment is configured as follows. The salient highlight detection circuitry is configured to detect when an audiovisual content is captured, detect an ambient sound from said audio-visual content, search and recognize at least one feature from said ambient sound, and search correspondence between said at least one feature and at least one keyword and search and recognize at least one visual pattern from said audio-visual content, and search correspondence between said feature and at least one keyword. The salient highlight detection circuitry is further configured to gather and analyze movement and direc- tional information of the apparatus that has captured said audio-visual content and store contextual data relating the situation, where the content is to be captured. The salient highlight detection circuitry is further configured to determine if there is the salient highlight in the audio-visual content based on at least one of contextual data, the movement and directional information of the apparatus, and the at least one keyword corresponding features found from ambient sound and the at least one visual pattern and if the salient highlight is found and recognized, label it based on the at least one keyword connected to the recognized salient highlight event.

A multimedia communication system according to an embodiment comprises a source, which is arranged to provide a signal captured by an imaging sensor and a microphone, an encoder, which is arranged to encode the captured signal into a coded media bit stream, a sender, which is arranged to send the coded media bit stream using a communication protocol stack, a receiver, which is arranged to modify the transferred coded media bit stream to a coded media stream, a decoder, which is arranged to process the coded media stream to one or more uncompressed media stream and a renderer, which is arranged to render the uncompressed media stream. According to an embodiment the multimedia communication system is arranged to perform the method steps according to the method defined earlier between the source and the encoder or partly of wholly in the encoder. In an embodiment, the coded bit stream is arranged to be stored in a storage.

In a second embodiment, the multimedia communication system comprise a gateway which is arranged to perform functions for transferring the coded media bit stream between the sender and the receiver.

In a second embodiment, the multimedia communication system comprise a recording storage where the coded media stream is stored.

In a third embodiment, the format of the coded media bit stream is an elementary self- contained bit stream, a packet stream format or one or more coded media bit stream encapsulated into a container file.

In a fourth embodiment, the encoder is arranged to process the captured signal by using two or more different bitrates thus producing two or more different coded media bit streams.

In a fifth embodiment, different portions from the said coded media bit streams are chosen based on the labeled data i.e. the salient highlight recognition.

In a sixth embodiment, the renderer is arranged to process the uncompressed media stream based on the labeled data i.e. the salient highlight.

In a seventh embodiment, the processing include at least one of following: focusing, choosing field of view or zooming.

A computer program product comprising program instructions which, when executed on an apparatus comprising at least one processor and at least one memory causes the apparatus to capture an audio-visual content and detect an ambient sound from said audio-visual content. The computer program product further comprises the computer program instructions which causes the apparatus to search and recognize a feature or features from said ambient sound, and searching correspondence between said feature and keywords, search and recognize a visual pattern or patterns from said audio-visual content, and searching correspondence between said pattern and keywords and choose the keyword or keywords based on the correspondence between said ambient sound and patterns and keywords. The computer program product further comprises the computer program instructions which causes the apparatus to receive and store contextual data relating the situation, where the content is to be captured and gather movement and directional information of an apparatus that is ar- ranged to capture said audio-visual content. The computer program product further comprises the computer program instructions which causes the apparatus to determine if there is a salient highlight in the audio-visual content based on at least one of the contextual data, movement and directional information of the apparatus, at least one keyword corresponding features found from the ambient sound or at least one keyword corresponding visual patterns found from the audio-visual content.

A non-transitory computer readable storage medium according to an embodiment comprises computer-executable components. According to an embodiment the computer-executable components comprise a computer readable code for capturing an audio-visual content, a computer readable code for detecting an ambient sound from said audio-visual content, computer readable code for searching and recognizing at least one feature from said ambient sound, and searching correspondence between said at least one feature and at least one keyword, and computer readable code for searching and recognizing at least one visual pattern from said audio-visual content, and searching correspondence between said pattern and at least one keyword. The computer-executable components further comprise a computer readable code for choosing at least one keyword based on the correspondence between said ambient sound and patterns and keywords, computer readable code for receiving and storing contextual data relating the situation, where the content is to be captured, and a computer readable code for gathering a movement and directional information of an appa- ratus that is arranged to capture said audio-visual content. The computer-executable components further comprise a computer readable code for determining if there is a salient highlight in the audio-visual content based on at least one of contextual data, the movement and directional information of the apparatus, the at least one keyword corresponding features found from the ambient sound or the at least one keyword corresponding visual patterns found from the audio-visual content.

An apparatus according to an embodiment comprises at least one processor and at least one memory comprising computer program code. The at least one memory and the computer program code are configured to, with the at least one processor, cause the apparatus to: capture an audio-visual content, detect ambient sound from said audio-visual content, search and recognize at least one feature from said ambient sound, search a correspondence between said at least one feature and at least one keyword, search and recognize at least one visual pattern from said audio-visual content, search a correspondence between said at least one pattern and the at least one keyword. The at least one memory and the computer program code are further configured to, with the at least one processor, cause the apparatus to: gather and analyze movement and directional information of an apparatus used for capturing said audiovisual content, determine if there is the salient highlight in the audio-visual content, based on at least one of contextual data relating to a situation where the content is to be captured, movement of the apparatus, directional information of the apparatus, the at least one keyword corresponding features found from ambient sound, or the at least one visual pattern, and label based on the at least one keyword connected to the recognized salient highlight event if the salient highlight is found and recognized.

Various embodiments will become apparent from the detailed description given hereafter. However, it should be understood that the detailed description and specific examples, while indicating preferred embodiments of the invention, are given by way of illustration only, since various changes and modifications within the spirit and scope of the invention will become apparent to those skilled in the art from this detailed description.

Description of the figures

In the following, embodiments will be described in detail. In the description, reference is made to the enclosed drawings, in which:

Figure 1 shows a system, in accordance with an embodiment;

Figure 2 shows an apparatus, in accordance with an embodiment;

Figure 3 shows an example of an apparatus according to another embodiment;

Figure 4 shows an example of a computing unit according to figure 3; Figure 5 shows flow chart according to an embodiment;

Figure 6 shows a second example of a method according to an embodiment as a flow chart;

Figure 7 shows an example of a multimedia communication system according to an embodiment. Detailed description of Figures

The embodiments in the following description are given as examples only, and a person skilled in the art may realise the basic idea of the invention also in some other way than what is described in the description. Though the description may refer to a certain embodiment or embodiments in several places, this does not mean that the reference would be directed towards only one described embodiment or that the described characteristic would be usable only in one described embodiment. The individual characteristics of two or more embodiments may be combined and new embodiments of the invention may thus be provided. Figure 1 shows a system in accordance with an embodiment. In the system is a camera 101 and a microphone that produces an audio-visual content. To achieve a reproduced content 105 that utilizes information about salient highlights of the audio-visual content, the audio-visual content is pre-processed in the phase 102 and post-processed in the phase 103. In this example the method according an embodiment is done on the pre-processing phase 102. In the post-processing phase 103 the audiovisual content is processed based on the information about the salient highlights. In this example in the salient highlight selection process an additional data 104 is used. In figure 1 is presented some examples of what said additional data could contain. In the additional data there may be metadata, camera parameters, speech recognition results, audio recognition results, pattern recognition results, device movement and direction, and contextual data. These data could be used in both processing phases.

Metadata and camera parameters relate to the audio-visual content produced by the camera. These could be varied depending for example what format is used. Speech recognition results are recognized words. Audio recognition results could be music, song or other ambient sound like cheering or similar. Also, audio recognition results could be rhythm, musical genre, timbre, instrument, loudness, spectrum, zero-crossing rate or other characteristics. Also changes in the sound pressure can be detected. Pattern recognition results are visual features that relate to the event that is imaged by the camera. These features may include for example elements and patterns relating to events like lines, structures, players, speed of objects, position of a player etc. It must be noted that these visual features may be stationary or moving objects or combination of both. Device movement and direction are obtained with accelerometer or other sensors. Also positioning methods and compass may be used. Contextual data relates to the event which is imaged. These could include all kind of information of the event, location, time, user preference and other information. Usually the contextual data is not extracted from the audio-visual content but it is obtained by some other means.

How this additional data 104 is obtained and how it is used in various embodiments is explained in more detail in following examples. In Figure 2 is shown an apparatus in accordance with an embodiment. An imaging device 201 produces video stream which is processed digitally 203. A microphone 202 produces audio stream which is processed digitally 204. The block 205 in the figure 2 describes the gathering and processing the data that is used for determining if there is a salient highlight in the audio-visual content. The salient highlights are determined in the block 215. The recognized segment, for example, the highlight can be further processed in a block 216. In some embodiments the block 205 could be a combination of a processor and a memory and the processor can be configured to perform the functions performed by the block 205. This applies to the block 215 and the further processing block 216. There can, of course, be several processors. The block 205 for gathering and processing the data that is used for determining a salient highlight comprises a visual feature determination block 206, an ambient audio feature determination block 208, a contextual feature determination block 212, a movement and directional information determination block 213 and a keyword identification block 214. The keyword could be a word or a combination of words or a sen- tence. The keywords represent some incidence, object, happening, situation or some other entity. The keyword may or may not directly describe the entity it relates to i.e. a goal may have a keyword 'goal' but it is not necessary. The keyword could also refer to some abstract situation.

The visual feature determination block 206 comprises a pattern recognition block 207. The pattern recognition is used to recognize places or incidents. Visual pattern is a stationary construction or constructions of the area of the interest or the visual pattern is comprised of moving objects or the visual pattern is a combination of both. Patterns could be colors, change of contrasts, structures e.g. goals, lines, players, performers. Recognized patterns are connected to a keyword or keywords that describes the pat- tern. The pattern recognition could be a learning process or searching correspondences between imaged pattern and database pattern or some combination of them. Other methods could be also applied. In one example embodiment, in the area what is the target of imaging, there is an object or objects that correspond to a certain key word. The ambient audio feature determination block 208 detects ambient sounds from the captured audio-visual content. The ambient audio feature determination comprises a speech recognition block 209, music recognition block 210 and sound pressure recognition block 21 1 . The analyzed sounds are ambient sounds, so there is not necessarily any direct correspondence to the captured video i.e. the source of sound could be outside of field of view of the imaging device.

The speech recognition block 209 finds and extracts words from an ambient speech that could be for example commentary during the event or similar. The extracted word are used as keywords or keywords are searched from the speech. Also some recog- nized words could correspond to a keyword, for example in a sport event some player could have a nickname that corresponds to a keyword that is the real name of the player.

The music recognition block 210 recognizes an ambient music which could be song or tune. The name of song or tune could be a keyword. Also some word from the lyrics could be a keyword. Some song or tune could correspond a certain keyword like when some song is played when a goal is done. In that case the keyword for the song could be 'goal'.

The sound pressure recognition block 21 1 recognizes changes in sound pressure. Increasing sound pressure in an event and especially in a sporting event usually means that something interesting is occurring. These kind of incidents could be for example the cheering of the audience or passing of the racing cars. The sound pressure change could correspond to a keyword. The keyword in that case could be an indication that something interesting is going on.

It must be noted that the ambient audio feature in addition to the speech, the music or the change of sound pressure could be any combination of those.

The contextual feature determination block 212 searches the contextual data information that is relevant when determining if there is a salient highlight in the captured audio-visual content. The contextual data comprises at least one of the following: information of the event, location, time or preference of the user. The location could be the location of the event or the location of the user i.e. where the user is situated in the spectator stand or both. The contextual data can be used for selecting the keywords to be used. The corresponding keyword or keywords are predetermined based on the contextual data. The movement and directional information determination block 213 acquires its data from an apparatus that contains the imaging device 213. Usually an accelerometer or a sensor like a 9 degrees-of-freedom (9DOF) sensor is used for detecting the movements of the imaging device. Also other equipment can be used. For direction deter- mination a compass or a similar device can be used. Also, use of other methods indicating direction is possible.

Typically when taking video of many sports events the user swings the camera (i.e. the imaging apparatus) from one end of the field to the other end and back as the ends of the field often contain important places for action like for example goals in ice hockey, football and soccer or baskets in basketball. Video analysis or contextual information may be used to verify the type of the sports event. The apparatus or the system for determining the salient highlight can be arranged to learn the long term movement of the apparatus and thus it learns the size of the bounding box (limits to device tilt azimuth and elevation angles) that contains the important action in the sports and the ends of the bounding box that represent the directions of the goals. This information may then be used for post processing the video in many ways taking advantage of the knowledge of the current shooting angle i.e. the tilt of the imaging device. In one example of an embodiment certain predetermined movement and directional information of the apparatus indicates a possibility of a salient highlight event. The keyword identification block 214 search correspondence between features recognized from audio-visual content and keywords. If a correspondence is found the keyword is applied or connected to the feature. The keyword could be a word or an expression that is extracted from the feature like a word from a song or a name of the visual pattern that was recognized like 'goal' or 'circle'. In one embodiment the used keyword or keywords are predetermined based on the contextual data. For example an ice hockey game and a baseball game may use different keywords.

The salient highlights are determined in the block 215. It receives the data produced in the block 205. It compares the extracted keywords and other information like the movement and direction data and the contextual data. Based on these it determines if there is a salient highlight. In the decision different weighting functions can be used for different situations. For example if the keyword 'goal' is extracted and the direction of the apparatus (i.e. the direction of the imaging device) is towards a goal, the part is recognized as a salient highlight. For some other incident more data (keywords and other indications) may be needed for recognizing a part of the audio-visual content as a salient highlight. In some example the keywords are predetermined based on the contextual data. The determining process may change based on what event is the object of the imaging and what are the preferences of the user. The recognized salient highlight is labelled. An example technique for labelling is to mark the recognized part as salient highlight with some marking. This could be done by modifying the metadata of the audio-visual content. It could also be executed by creating a separate descrip- tion file that is transferred with the audio-visual content to the further processing. The separate description file contains the information about the recognized salient highlights. When labelling the recognized salient highlight it could be given a descriptive name. This could include for example name, place and time of the event and what happens in the highlight. For example some highlight could be named as 'first goal at the Arena 1^st December'. Of course, there are many possibilities when labelling the recognized salient highlights depending how the highlights are to be used. Thus an embodiment provides an efficient way to recognize and label salient highlights in a media stream. For example this provides for the user better result when imaging an event having a lot of action at different places. The recognized segment, for example the highlight can be further processed in the block 216. These could include adaptive data compression, smart rendering or focusing or zooming. Usually labelled salient highlight of audio-visual content is encoded and/or compressed or processed using different rules than the audio-visual content that is not labelled according to embodiments. In some embodiments a processor or a combination of processors can be configured to perform the functions performed by the previously described blocks. In some embodiment a special processor may be assigned to do some functions that are common for several blocks. This kind of processor could for example be a signal processing processor. Here are presented some examples of how various embodiments can be applied. It should be noted that these examples are simplified.

If the apparatus containing the imaging device points to a goal and if a keyword 'goal' or ambient sound interpreted as 'cheering' is detected, the reconstructed media stream may focus or zoom in to the goal. If the apparatus points to midfield, the appa- ratus may zoom the video out. If the apparatus points outside the bounding box i.e. most likely towards the audience and ambient voice 'cheering' is detected, the apparatus may search for faces in the audience and zoom to those consecutively.

If the goal song and ambient voice 'cheering' is detected, reconstructed media stream focuses to the goal area of the rink or field. If the ambient voice 'cheering' is detected and visual feature / pattern in individual and team games / sport is recognized, the focus and field of view of reconstructed media stream may be adjusted to the visual feature or pattern.

If a keyword, or keywords in a song is detected and corresponds to a detected visual feature, the focus and field of view of reconstructed media stream may be adjusted accordingly.

The camera motion is tracked. The motion typically goes back and forth from one goal to the other. The turning points are where the goals are. An average of both the turning points is calculated. The zooming may be done to the center of the average turning point direction.

In an embodiment the apparatus according figure 2 executes a method that comprises capturing an audio-visual content, detecting ambient sound from said audio-visual content, searching and recognizing at least one feature from said ambient sound, searching a correspondence between said at least one feature and at least one key- word, searching and recognizing at least one visual pattern from said audio-visual content. The method comprises further searching a correspondence between said at least one visual pattern and the at least one keyword, gathering and analyzing movement and directional information of an apparatus used for capturing said audio-visual content. The method further comprises determining if there is a salient highlight in the audio-visual content based on at least one of contextual data, movement of the apparatus, directional information of the apparatus, the at least one keyword corresponding features found from ambient sound, or the at least one visual pattern. If the salient highlight is found and recognized the method comprises labelling based on the at least one keyword connected to the recognized salient highlight event. The said contextual data relates to a situation, where the content is captured.

In figure 3 is described an example of an apparatus 300 according to another embodiment. The apparatus comprises a camera 301 , a microphone 302, a computing unit 303, a means 306 for detecting and storing the movement and the direction of the apparatus, a means 307 for receiving and storing the contextual data and a means for labelling the salient highlights 308. The computing unit 303 comprises at least one processor 304 and at least one memory 305. The camera and the microphone are arranged to capture audio-visual content. The means for detecting the movement and/or the direction of the apparatus produce information how the apparatus is moved and/or what is the direction of the field of view. Examples of means of detecting the movement and/or the direction of the apparatus includes but are not limited to are accelerometers or sensors like position / orientation sensors or camera together with processing pipeline able to detect recognizable shapes and changes in the audio-visual content (visual tracking), or magnetic sensors. The direction information can be achieved for example by a compass or a similar device or the information can be cal- culated or estimated by some additional data that could be for example magnetic data, visual data or beacon data. The means for receiving and storing the contextual data is arranged to receive at least part of the contextual data relating to the event to be imaged before or during the event. The contextual data could be at least partly inputted manually and/or the apparatus is arranged to fetch at least part of the contextual data from some outside source. This could be done automatically or the user may initialize the data fetching. In some embodiments a processor, a memory and a communication devices of the apparatus could be configured to perform the functions of the means for receiving and storing the contextual data. The computing unit 303 is arranged to receive information from the camera 301 , the microphone 302, the computing unit 303. The computing unit 303 is also arranged to receive information from the means 306 for detecting and storing the movement and the direction of the apparatus and the means 307 for receiving and storing the contextual data. Needed data and the instructions are stored in the memory 305. The instructions are performed in the processor and the apparatus then executes the method according to an embodiment. The salient highlights are determined and labelled in the means for labelling the salient highlights 308. It receives the data produced in the computing unit 303. It then compares the extracted keywords and other information like the movement and direction data and the contextual data. Based on these it determines if the audio-visual content comprises a salient highlight. If a salient highlight is found the means for labelling the salient highlights then labels the salient highlight. The apparatus then could send the salient highlight segment for further processing and then for example either save or display the reproduced content. In some embodiments a processor could be configured to perform the functions of the means for labelling the salient highlights. The said processor could be the processor 308 but, of course it could be a separate entity. In one example the apparatus is arranged to focus or zoom to a particular point of the area of interest when capturing said audio-visual content based on the recognized salient highlight event that relates to said point. The zooming based on the salient highlights decreases the possibility of cropping essential parts of content. Also, determining the salient highlights can be used for maintaining the subjective quality of the reproduced content.

In this example the apparatus is a portable computing device e.g. a smart phone, a tablet computer or a gaming console or a wearable device. In figure 4 is described an example of a computing unit 303 according to figure 3. The computing unit comprise at least one processor 304 and memory 305. The computing unit further comprise a means for detecting ambient sound 401 , a means for detecting visual pattern or patterns 402 and a means for keyword identification 403. The means for detecting ambient sound 401 is arranged to detect and recognize from the ambient sound at least one of following: speech, music or a change of sound pressure. In the speech recognition the result is word or words. The recognized word could be for example 'goal' or 'penalty'. In the music recognition the result could be the name of the music, some word or words from the lyrics or some meaning of the played music. The meaning of a certain song could be some incident like when a certain team makes a goal then the song is played. In the sound pressure detection usually a growing sound pressure indicates that something is happening. The cause of the sound pressure growing could for example be a cheering audience or a sound of a racing car motor. The means for detecting visual pattern or patterns 402 is arranged to search and rec- ognize from the video data visual patterns. The visual patterns are stationary or moving or combination of both. The recognized pattern relates to some definition which could be a word or combination of words. This definition could be a keyword or it could relate to a keyword.

In some embodiments the functions performed by the means for detecting ambient sound 401 , the means for detecting visual pattern or patterns 402 and the means for keyword identification 403 can be performed by a processor or a combination of processors. In some embodiments the functions of said means could be performed at least partly by the processor 304.

The keyword identification 403 is arranged to search correspondence between fea- tures recognized from audio-visual content and keywords. If a correspondence is found the keyword is applied or connected to the feature. List of used keywords is stored in the memory 305. The list could be a part of the instructions i.e. a computer program product, or the list is based on the contextual data or the list is part of the contextual data or any combination of these. The computing unit 303 can be realized with a circuitry arrangement comprising at least one processor and one memory. Naturally there could be many different ways to realize the circuitry according to an embodiment. Of course, other parts needed to realize the embodiment could be included in this circuitry.

As used in this application, the term 'circuitry' refers to all of the following: (a) hardware-only circuit implementations (such as implementations in only analog and/or digital circuitry) and

(b) to combinations of circuits and software (and/or firmware), such as (as applicable): (i) to a combination of processor(s) or (ii) to portions of processor(s)/software (includ- ing digital signal processor(s)), software, and memory(ies) that work together to cause an apparatus, such as a mobile phone or server, to perform various functions) and

(c) to circuits, such as a microprocessor(s) or a portion of a microprocessor(s), that require software or firmware for operation, even if the software or firmware is not physically present. This definition of 'circuitry' applies to all uses of this term in this application, including in any claims. As a further example, as used in this application, the term "circuitry" would also cover an implementation of merely a processor (or multiple processors) or portion of a processor and its (or their) accompanying software and/or firmware. The term "circuitry" would also cover, for example and if applicable to the particular claim element, a baseband integrated circuit or applications processor integrated circuit for a mobile phone or a similar integrated circuit in server, a cellular network device, or other network device.

Figure 5 shows a flow chart according to an embodiment. Hereby the method is described step by step. The method for recognizing the salient highlight is started at the step 501 .

In this example the contextual data is read at the step 502. The contextual data could contain information of the event, location, time or preference of the user. At the step 503 an audio-visual content is captured. At this phase also the movement and directional data of the apparatus comprising the imaging device that is used for capturing the said content is gathered.

From the captured audio-visual content ambient sound is detected at the step 504. The ambient sound is recognized at the step 507. From the ambient sound are recognized speech, music or change of sound pressure or any combination of those. An indication for a salient highlight could be recognized word or words from speech, name of the music or word or words from the lyrics or change of sound pressure or any combination of those. Visual patterns are searched from the visual content at the step 505. Visual patterns are recognized at the step 508. Movement and direction are analysed at the step 506. An indication for a salient highlight could be a target or a form or a specific place which is recognized. An indication for a salient highlight could be a specific movement or direction of the apparatus that is used for capturing the audio-visual content.

At the step 509 it is decided if there is a salient highlight. The decision process uses results acquired at the steps 506, 507 and 508. It there is enough indication, for example a word 'goal' is detected and the apparatus is directed towards a goal, it is decided that a salient high-light is occurring or occurred. The decision may be weighted, i.e. some results are more important than others. For example a risen sound pressure may not be enough to select a segment of the audio-visual content as a salient highlight, but if there is in addition a recognized goal song, the segment is selected as a salient highlight.

It there is enough indication about that a segment is a salient highlight it is marked as a salient highlight at the step 510. The labelled segment or segments can be processed depending on the way they are to be used.

At the step 51 1 it is checked if the audio-visual content capturing is stopped. If the answer is 'YES' the method is stopped at the step 512. If the answer is 'NO' the capturing continues at the step 503. Because pre-existing parts of the apparatus could be used at many cases the method according this embodiment can be applied easily to portable devices.

In figure 6 is described a second example of a method according to an embodiment as a flow chart. In this flow chart an example of a detailed recognition flow according to an embodiment is shown and how it results in a labelled multimedia bit stream. Hereby the method is described step by step.

In an embodiment the method is divided in three parts. There is an input part 601 , a salient highlight determination part 605 and an output part 624.

The input part 601 comprises steps where input is gathered and fed to the salient highlight determination part 605. At the step 602 contextual data is read and trans- ferred to the determination part. At the step 603 an audio-visual content is captured. Necessary processing is done to the audio-visual content and then it is transferred to the determination part. At the step 604 movement of the device that is used for capturing the audio-visual content is detected. Information about the bounding box is calculated or estimated from these movement. Also a direction information can be achieved in this step. The information gathered at the step 604 is transferred to the determination part.

The following steps are situated in the determination part 605.

At the step 605 an ambient sound is detected from the inputted audio-visual content. An audio feature or features are searched and recognized from said ambient sound. At the step 609 it is checked if there is a song that could be identified. If the song is identified then at the step 613 it is checked if there is a keyword to find. If there is no found keyword at the step 616 the context is compared to the identified song if the song is for example a so called goal song.

If there is no identified song at the step 609 then at the step 610 overall ambient voices are detected. If there is none, then recognition is rejected at the step 608. This means that there was no ambient voices detected that had any indication for a salient highlight. If at the step 610 ambient voices were detected that may have indication for a salient highlight then at the step 614 is checked if there is any identified keywords by speech recognition. If no keywords were found then sound pressure level is determined at the step 615. At the step 618 it is checked if the sound pressure is over a threshold. Then the sound pressure information is moved to the step 622. If at the step 614 a keyword is identified then at the step 617 the identified keyword is extracted. At the step 619 the context and the identified keyword are compared.

If at the step 613 there is an identified keyword in the identified song then at the step 617 the identified keyword is extracted, and at the step 619 the context and the iden- tified keyword are compared.

At the step 620 it is checked if there is a correspondence between comparisons made at the step 616 and / or step 619. If a comparison produces no correspondence, the recognition is rejected. This means that the context and the extracted keyword do not seem to relate to each other. For example, if the extracted keyword is 'goal' but there is no increased action near either goal then there is no correspondence. If there is found correspondence at the step 620 then at the step 622 the confidence data is modified. The modification of the confidence data is based on time elapsed between identified song, keyword, exceeded ambient sound pressure threshold, recognized pattern occurrences and any combination of these. The confidence data indicates if there is a salient highlight on the audio-visual content. Additionally or alternatively, the confidence data may be based on a combined confidence or strength of the detected features such as the amount of change in the sound pressure level, the number of times a keyword is recognized from a song or from ambient audio. At the step 607 it is checked if there is a visual feature or pattern on the inputted audiovisual content. The checking is done by some pattern recognition method. The found visual features are extracted and they may relate to a keyword that describes the pattern. At the step 61 1 the extracted visual feature or pattern is compared to the extracted keyword that was extracted at the step 617. The correspondence between the compared features at the step 61 1 is checked at the step 612. If there is no relation between the compared features then the recognition process is rejected at the step 608. If there is found a correspondence at the step 612 then at the step 623 the confidence data is modified. The confidence data is modified based on whether the extracted keyword is determined to correspond to the recognized pattern. If the confidence data indicates that the there is a salient highlight on the audio-visual content it is labeled accordingly. The labeled data is transferred to the output part 624.

The labeled audio-visual content is at the step 625 send to an application which for example displays the content according to instructions. The labeled audio-visual could for example be zoomed. The labeled audio-visual content could also be send to an encoder at the step 626. The encoder encodes the audio-visual content in to a coded media bit stream.

By the keyword identification here is meant that if there is a keyword that relates to the recognized audio feature or visual pattern. By the keyword extraction it is meant here that the identified keyword is stored for the use of determining a salient highlight. It must be noted that the flow chart presented on the figure 6 is an example of one salient highlight determination process. In another process the relations between steps may be different.

In the figure 7 is presented an example of a multimedia communication system according to an embodiment. The multimedia communication system can be a device or it is decentralized i.e. the parts of the system can be in a different devices.

The multimedia communication system 700 is presented as a block diagram. In this example it comprises nine blocks: a source 701 , an encoder 702, a storage 703, a sender 704, a gateway 705, a receiver 706, a recording storage 707, a decoder 708 and a renderer 709.

The source 701 provides a source signal, i.e. an audio-visual content, captured by an imaging device and a microphone. The encoder 702 encodes the captured source signal into a coded media bit stream. The format of the coded media bit stream may be an elementary self-contained bit stream, a packet stream format, or one or more coded media bit stream encapsulated into a container file. If the container file approach is used, a file generator (not shown in the figure) is needed to store the coded media bit streams and create file format metadata.

The method according to an embodiment is arranged to perform between the source 701 and the encoder 702 or partly of wholly in the encoder. The encoder therefore receives a labelled audio-visual content or the encoder labels audio-visual content. The encoder may process the captured content by using two or more different bitrates. From these streams different portions may be chosen to be encapsulated in the container file. Selection can be done based on the labeled data, i.e. based on understanding on the salient highlights.

The coded media bit stream is transferred to the storage 703.

The sender 704 sends the coded media bit stream using a communication protocol stack e.g. Internet Protocol (IP), User Datagram Protocol (UDP), Real-Time Transport Protocol (RTP). If the media content is encapsulated into a container file the sender may include or be attached to a sending file parser (not in the figure). The sender 704 is connected to the gateway 705. The gateway may perform different types of functions, like packet stream translations, merging and dividing data streams, and manip- ulation of the data stream according to the downlink and / or receiver capabilities. Uncompressed / loss-less / high bit rate encoded media stream with the metadata is manipulated by the gateway 705 according to the downlink / receiver capabilities and the labeling information.

The receiver 706 receives, de-modulates, and de-capsulates the transmitted signal in to a coded media stream, which is moved to the recording storage 707. The format of the coded media stream may be an elementary self-contained bit stream, a packet stream format, or one or more coded media stream encapsulated into a container file. Alternatively, the system may operate live, or maintaining only the most recent excerpts of the recorded stream in the recording storage. The decoder 708 receives the media bit stream next. The coded media bit stream may consist of an audio stream and a video stream associated with each other and encapsulated into a container file, or a single media bit stream encapsulated into a container file. In those cases a file parser (not visualized in the block diagram) may be used for the de-capsulation. The coded media stream is usually processed further in the decoder, which produces one or more uncompressed media streams. High bit rate media stream is processed further by the decoder according to the labeling information to enable focus, field of view, zoom of the content based on the salient highlights. This helps in efficient delivery and representation of the media bit stream. Also, this can be used for providing an information for the gateway for adaptation if the media bit stream needs to be processed to meet the receiver's capabilities. An embodiment can be used for adjusting the network throughput.

The rendered 709 renders the uncompressed media bit stream.

It must be noted that the multimedia communication system according to an embodi- ment can be realized in a different way.

Above, some preferred embodiments according to the invention have been described. The invention is not limited to the solutions described above, but the inventive idea can be applied in numerous ways within the scope of the claims.

Claims

1 . A method comprising:

- capturing an audio-visual content;

- detecting ambient sound from said audio-visual content;

- searching and recognizing at least one feature from said ambient sound;

- searching a correspondence between said at least one feature and at least one keyword;

- searching and recognizing at least one visual pattern from said audio-visual content;

- searching a correspondence between said at least one visual pattern and the at least one keyword;

- gathering and analyzing movement and directional information of an apparatus used for capturing said audio-visual content;

- determining if there is a salient highlight in the audio-visual content based on at least one of contextual data relating to a situation, where the content is to be captured, movement of the apparatus, directional information of the apparatus, the at least one keyword corresponding features found from ambient sound, or the at least one visual pattern; and

- labelling said audio-visual content based on the at least one keyword con- nected to the recognized salient highlight event if the salient highlight is found and recognized.

2. The method according to claim 1 , further comprising determining if there is the salient highlight in the audio-visual content corresponding to at least one keyword that is predetermined based on the contextual data.

3. The method according to any of the claims 1-2, wherein the said labelled salient highlight of audio-visual content is encoded and/or compressed using different rules than the audio-visual content that is not labelled.

4. The method according to any of the claims 1-3, wherein the ambient sound feature comprises speech, music, rhythm, musical genre, timbre, instrument, loudness, spectrum, zero-crossing rate or other musical or acoustic characteristics or change of sound pressure or any combination of thereof.

5. The method according to any of the claims 1-4, wherein the visual pattern comprises at least one stationary construction of the area of the interest or the visual pattern is comprised of moving objects or the visual pattern is a combination of both.

6. The method according to any of the claims 1-5, wherein the certain predeter- mined movement and directional information of the apparatus indicates a possibility of the salient highlight event.

7. The method according to any of the claims 1-6, wherein the target area of imaging comprises an object or objects that corresponds to a certain key word.

8. The method according to any of the claims 1-7, wherein the contextual data comprises at least one of the following: information of the event, location, time or preference of the user.

9. An apparatus comprising:

- a camera;

- a microphone;

- an accelerometer or a movement sensing sensor for detecting the movement of the apparatus;

- an arrangement for detecting the direction of the apparatus;

- at least one memory;

- at least one processor;

- an arrangement for detecting ambient sound;

- an arrangement for detecting at least one visual pattern recognized from said audio-visual content; and

the apparatus is arranged to identify at least one keyword that correspond to the detected ambient sound and the at least one visual pattern, and contextual data relating the situation, where the content is to be captured, is arranged to be stored in the memory of the apparatus;

the apparatus is arranged based on at least one of contextual data, the movement and directional information and the at least one keyword corresponding features found from ambient sound or the at least one visual pattern, to determine if there is the salient highlight in the audio-visual content; and

if the salient highlight is found and recognized, the apparatus is arranged to label the salient highlight based on the at least one keyword connected to the recognized salient highlight event.

10. The apparatus according to claim 9 wherein the arrangement for detecting ambient sound comprises at least partly a computer program instructions configured to, with the least one processor, cause the apparatus to detect the ambient sound.

1 1 . The apparatus according to any of the claims 9-10 wherein the arrangement for detecting ambient sound is arranged to recognize at least one of following: speech, music, rhythm, musical genre, timbre, instrument, loudness, spectrum, zero-crossing rate or other musical or acoustic characteristics or change of sound pressure or any combination of thereof.

12. The apparatus according to any of the claims 9-1 1 wherein the arrangement for detecting at least one visual pattern comprises at least partly a computer program instructions configured to, with the least one processor, cause the apparatus to detect visual patterns.

13. The apparatus according to any of the claims 9-12, wherein the arrangement for detecting at least one visual pattern is arranged to recognize visual patterns that are stationary or moving or combination of both.

14. The apparatus according to any of the claims 9-13, wherein the arrangement for detecting the direction of the apparatus comprise at least one of following a compass, a magnetometer, a visual direction finder, a beacon data receiver or other direction finder device.

15. The apparatus according to any of the claims 9-14, wherein the apparatus is arranged to focus or zoom to a particular point of the area of the interest when capturing said audio-visual content based on the recognized salient highlight event that relates to said point.

16. The apparatus according to any of the claims 9-15, wherein the apparatus com- prises a portable computing device comprising a smart phone, a tablet computer or a gaming console.

17. A salient highlight detection circuitry configured to:

- detect when an audio-visual content is captured;

- detect an ambient sound from said audio-visual content;

- search and recognize at least one feature from said ambient sound, and search correspondence between said at least one feature and at least one keyword; - search and recognize at least one visual pattern from said audio-visual content, and search correspondence between said pattern and the at least one keyword;

- gather and analyze movement and directional information of the apparatus that has captured said audio-visual content;

- store contextual data relating the situation, where the content is to be captured;

- determine if there is the salient highlight in the audio-visual content based on at least one of contextual data, the movement and directional information of the apparatus, and the at least one keyword corresponding features found from ambient sound or the at least one visual pattern; and

- if the salient highlight is found and recognized, label it based on the at least one keyword connected to the recognized salient highlight event.

18. A multimedia communication system, which comprises

- a source, which is arranged to provide a signal captured by an imaging sensor and a microphone;

- an encoder, which is arranged to encode the captured signal into a coded media bit stream;

- a sender, which is arranged to send the coded media bit stream using a com- munication protocol stack;

- a receiver, which is arranged to modify the transferred coded media bit stream to a coded media stream;

- a decoder, which is arranged to process the coded media stream to one or more uncompressed media stream;

- a renderer, which is arranged to render the uncompressed media stream; and

- said multimedia communication system is arranged to perform the method steps according to any of the claims 1 to 8 between the source and the encoder or partly of wholly in the encoder.

19. The multimedia communication system according to claim 18 wherein the coded bit stream is arranged to be stored in a storage.

20. The multimedia communication system according to any of the claims 18-19 wherein the multimedia communication system comprise a gateway which is arranged to perform functions for transferring the coded media bit stream between the sender and the receiver.

21 . The multimedia communication system according to any of the claims 18-20 wherein the multimedia communication system comprise a recording storage where the coded media stream is stored.

22. The multimedia communication system according to any of the claims 18-21 wherein the format of the coded media bit stream is an elementary self-contained bit stream, a packet stream format or one or more coded media bit stream encapsulated into a container file.

23. The multimedia communication system according to any of the claims 18-22 wherein the encoder is arranged to process the captured signal by using two or more different bitrates thus producing two or more different coded media bit streams.

24. The multimedia communication system according to claim 23 wherein from the said coded media bit streams different portions are chosen based on the labeled data i.e. the salient highlight recognition.

25. The multimedia communication system according to any of the claims 18-24 wherein the renderer is arranged to process the uncompressed media stream based on the labeled data i.e. the salient highlight.

26. The multimedia communication system according to claim 25 wherein the pro- cessing include focusing, choosing field of view or zooming.

27. A computer program product comprising program instructions which, when executed on an apparatus comprising at least one processor and at least one memory causes the apparatus to

- capture an audio-visual content;

- detect an ambient sound from said audio-visual content;

- search and recognize at least one feature from said ambient sound, and searching correspondence between said at least one feature and at least one keyword;

- search and recognize at least one visual pattern from said audio-visual con- tent, and searching correspondence between said at least one pattern and the at least one keyword; - choose the at least one keyword based on the correspondence between said ambient sound and patterns and keywords;

- receive and store contextual data relating the situation, where the content is to be captured;

- gather a movement and directional information of an apparatus that is arranged to capture said audio-visual content; and

- determine if there is a salient highlight in the audio-visual content based on at least one of contextual data, movement and directional information of the apparatus, the at least one keyword corresponding features found from the ambient sound or the at least one keyword corresponding visual patterns found from the audio-visual content.

28. A non-transitory computer readable storage medium having computer-executable components comprising:

- computer readable code for capturing an audio-visual content;

- computer readable code for detecting an ambient sound from said audio-visual content;

- computer readable code for searching and recognizing at least one feature from said ambient sound, and searching correspondence between said feature and at least one keyword;

- computer readable code for searching and recognizing at least one visual pattern from said audio-visual content, and searching correspondence between said pattern and at least one keyword;

- computer readable code for choosing the at least one keyword based on the correspondence between said ambient sound and at least one pattern and at least one keyword;

- computer readable code for receiving and storing contextual data relating the situation, where the content is to be captured;

- computer readable code for gathering a movement and directional information of an apparatus that is arranged to capture said audio-visual content; and

- computer readable code for determining if there is a salient highlight in the audio-visual content based on at least one of contextual data, the movement and directional infornnation of the apparatus, the at least one keyword corresponding features found from the ambient sound and the at least one keyword corresponding visual patterns found from the audio-visual content.

An apparatus comprising:

- at least one processor; and

- at least one memory comprising computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus to:

- capture an audio-visual content;

- detect ambient sound from said audio-visual content;

- search and recognize at least one feature from said ambient sound;

- search a correspondence between said at least one feature and at least one keyword;

- search and recognize at least one visual pattern from said audio-visual content;

- search a correspondence between said at least one pattern and the at least one keyword;

- gather and analyze movement and directional information of an apparatus used for capturing said audio-visual content;

- determine if there is the salient highlight in the audio-visual content, based on at least one of contextual data relating to a situation where the content is to be captured, movement of the apparatus, directional information of the apparatus, the at least one keyword corresponding features found from ambient sound, or the at least one visual pattern; and

- label based on the at least one keyword connected to the recognized salient highlight event if the salient highlight is found and recognized.