CN107534797B

CN107534797B - Method and system for enhancing media recording

Info

Publication number: CN107534797B
Application number: CN201680023726.4A
Authority: CN
Inventors: H.M.斯托克金; M.普林斯; O.A.尼亚穆特; R.科恩恩; E.托马斯
Original assignee: Nederlandse Organisatie voor Toegepast Natuurwetenschappelijk Onderzoek TNO; Koninklijke KPN NV
Current assignee: Nederlandse Organisatie voor Toegepast Natuurwetenschappelijk Onderzoek TNO; Koninklijke KPN NV
Priority date: 2015-04-24
Filing date: 2016-04-22
Publication date: 2020-08-21
Anticipated expiration: 2036-04-22
Also published as: US20180091860A1; CN107534797A; WO2016170123A1; EP3286922B1; EP3286922A1

Abstract

Systems and methods are provided for enhancing media recordings including camera recordings of scenes, where the scenes include screens displaying visual content. In camera recording, the visual content as displayed on the screen is often of poor quality. An enhanced media recording is obtained by analyzing the camera recording, accessing the original version of the visual content, and replacing the visual content displayed on the screen with the original version of the visual content in the camera recording. That is, in enhanced media recording, a "digital-to-light-to-digital" conversion of visual content is avoided, which is at least one reason visual content has poor quality in camera recordings.

Description

Method and system for enhancing media recording

Technical Field

The present invention relates to a system and method for enhancing media recording. The invention also relates to a transmitter device or a receiver device for use in a system. The invention also relates to a computer program product comprising instructions for causing a processor system to perform the method.

Background

Due to the popularity of digital cameras and screens, it can often happen that a camera recording of a scene includes a screen that displays visual content that is part of the camera recording. This may happen coincidentally. For example, when a home video is recorded in a living room of a person using a digital video camera, there may be a television that shows a television program in the background. Thus, a home video may include a television and a camera recording of visual content that is played on the television at the time of recording.

The media recording may further more structurally include a camera recording of a screen displaying visual content. Here and in the following, the term "screen" refers to displays such as those included in televisions, monitors, tablet devices, smart phones, etc., including two-dimensional, three-dimensional, light-field, and holographic displays, but also to projection screens and other types of surfaces on which visual content may be rendered, and to other types of visual renderings of visual content.

Non-limiting examples of more structured recordings of screens displaying visual content can be found in the field of video conferencing systems and mobile video communication applications (e.g., Skype, Lync, WebRTC, FaceTime), which allow remotely located people to have real-time conversations by recording audio via a microphone and video via a camera and sending the resulting media recording to the participants. Initially, videoconferencing systems focused on recording only the people involved in the conversation, as people would typically be sitting in front of the camera. Advances in camera recording technology (such as increased resolution and larger viewing angles) make it possible to record far more than just a person; the camera may also record his/her environment, such as a living room or office cubicle, including any screens that may be present, such as a television screen that is showing television content or a tablet device that is showing visual media. In addition, video conferencing technology is increasingly being used for non-mediated sharing experiences, where participants use video conferencing to share their activities and environments for others to see and join. For example, in a social television experience, participants will share their experience of watching a television content item, enabling others to see their room and their television screen. As another example, users may also intentionally record their television screens in order to comment on content being displayed and share the resulting recording with other users.

Therefore, today's camera recordings often include screens displaying visual content. However, a significant disadvantage is that in such camera recordings, the visual content displayed on the screen is often represented poorly in the recording; other parts of the scene generally look better, or even much better.

There may be various reasons for this, including but not limited to:

interference between the sensor grating (raster) of the camera and the screen grating, causing the moire effect (spatial interference);

a mismatch between the refresh rate of the visual content on the screen and the sampling rate of the camera (temporal interference);

dynamic range of scenes and lighting conditions (when indoors, the screen is usually much brighter than the environment, which leads to overexposure, when outdoors during the day, the opposite may happen, i.e. underexposure);

-movement of the camera relative to the screen;

-a quality of the camera for camera recording;

-recording artifacts (artifacts) (tearing; aliasing; interleaving);

-encoder settings in case of encoding media recordings;

-angle of view of camera relative to screen

To improve the quality of the visual content in the camera recording, one may choose to increase the quality of the camera recording, for example by increasing the recording resolution, frame rate and/or video quality. Disadvantageously, this may result in a much larger and smaller camera recording. This may be undesirable or intolerable due to bandwidth or storage limitations, and may not be possible when using today's recording devices that are commonly available, such as smart phones or tablets that do not contain such high quality camera functionality. Further, even when feasible, an increase in recording quality cannot solve problems such as a dynamic range problem and the like.

Disclosure of Invention

It would be advantageous to obtain a system or method for enhancing a media recording comprising a camera recording of a scene to obtain an enhanced media recording, wherein the scene comprises a screen displaying visual content.

The following aspects of the invention relate to replacing visual content shown on the screen with an originally recorded or generated version in the camera recording. Thus, a "digital-to-light-to-digital" conversion step can be avoided, which is at least one reason for the poor quality of the visual content in the camera recording. That is, in camera recording, the visual content is shown after conversion from the digital domain to the optical domain by display and then back into the digital domain by camera recording.

According to a first aspect of the present invention, there may be provided a method for enhancing media recording, the method may comprise:

-accessing the media recording, the media recording comprising a camera recording of a scene, the scene comprising a screen displaying visual content;

-analyzing the camera recording to determine coordinates of the screen in the camera recording;

-accessing an original version of the visual content; and

-replacing, in the camera recording and using the coordinates of the screen, the visual content displayed on the screen with the original version of the visual content, thereby obtaining an enhanced media recording.

According to another aspect of the invention, a computer program for causing a processor system to perform the method may be provided.

According to another aspect of the invention, a system for enhancing media recording may be provided, the system may comprise:

-a first input interface for accessing the media recording, the media recording comprising a camera recording of a scene, the scene comprising a screen displaying visual content;

-an analysis subsystem for analyzing the camera recording to determine coordinates of the screen in the camera recording;

-a second input interface for accessing an original version of the visual content; and

-a replacement subsystem for replacing, in the camera recording and using the coordinates of the screen, the visual content displayed on the screen with the original version of the visual content, thereby obtaining an enhanced media recording.

According to other aspects of the invention, a transmitter device and a receiver device may be provided for use in the system.

The above measures relate to accessing a media recording comprising at least a camera recording of a scene. For example, a media stream may be accessed that represents an encoded version of a media recording. Another example is that still images made by a camera may be accessed. The camera recording is of a scene including a screen displaying visual content. Thus, for example, a camera recording may at least intermittently show a screen or a portion thereof displaying visual content if the screen is only partially included in the recording frame of the camera recording, or if a portion of the screen is covered by another object in the scene.

The camera recording may be analyzed to determine the location of the screen in the camera recording. The position may be expressed as a coordinate. For example, in the case of a rectangular screen, the coordinates may represent one or more corners (corner) of the screen. The coordinates may take any suitable form, such as image grid coordinates (column number, row number) or normalized image coordinates.

The original version of the visual content may then be accessed. Herein, the term "original version" refers to a version obtained without indirection (indirection) of camera recording of a screen displaying visual content. Rather, the original version represents the version originally recorded or generated. A non-limiting example is if the visual content shown on the screen is obtained by the playout of a media stream, then the same media stream is accessed. As another example, a television may show a particular television channel, and a TV signal containing that same television channel or a recorded version of that television channel may be accessed as the original version of the visual content. Yet another example is if the visual content shown on the screen represents a slide from a presentation, then the computer file of the presentation is accessed. The original version of the content may be of higher quality compared to the camera recording of the visual content, since one or more of the reasons for the poor quality of the visual content in the media recording, as listed in the background section, may be avoided. In particular, the original version may avoid the "digital-to-light-to-digital" conversion step of converting the visual content from the digital domain to the optical domain by displaying and then back into the digital domain by recording with a camera.

The visual content displayed on the screen may then be replaced in the camera recording with the original version of the visual content. For this purpose, the coordinates of the screen can be utilized. For example, the original version of the visual content may be overlaid on top of the screen in the camera recording, thereby replacing the recorded version of the visual content in the camera recording. Enhanced media recording may be obtained since the original version of the visual content may be better in quality than the visual content shown in the camera recording.

The present inventors have recognized that with the ever-increasing aspect of digitization, when a camera recording of a screen displaying visual content is obtained, the original version of the visual content is typically available in digital form and can be accessed. Such an original version may be used to replace visual content as shown on the screen in the camera recording. By replacing the visual content recorded by such a camera with the original version of the visual content, the quality of the visual content may be improved. Another advantage is that it may not be necessary to otherwise improve the quality of the camera recording in order to better capture the visual content shown on the screen. Yet another advantage of replacing visual content in a camera recording is that the original version may not need to be displayed in a separate window, for example as an inserted picture-in-picture or side-by-side with the camera recording, which may otherwise affect the composition of the scene. For example, if a camera recording shows that the presenter points at the visual content, such pointing is preserved, and otherwise the visual content is lost if shown separately. Yet another advantage may be that one or more or even all of the problems associated with recording screens as listed in the background section may be avoided.

In an embodiment, accessing the original version of the visual content may comprise:

-identifying visual content displayed on the screen;

-based on the displayed visual content having been identified, identifying a resource location comprising an original version of the visual content; and

-accessing an original version of the visual content from the resource location.

While there are several possibilities for accessing the original version of the visual content, it may sometimes be necessary or desirable to identify the visual content displayed on the screen in order to access the original version of the visual content. For example, if there are multiple media streams available at the resource location, each representing different visual content, the appropriate media stream may be retrieved after the visual content displayed on the screen has been identified. Thus, the visual content may be identified first, and based thereon, a resource location may be identified that includes an original version of the visual content. Herein, the term "resource" may refer to a server, a storage medium, a broadcast channel, etc., while the "resource location" may represent information that allows access to the resource, such as an internet address, e.g., a Universal Resource Locator (URL) address.

In an embodiment, identifying the visual content displayed on the screen may include:

-identifying content data of a camera recording associated with visual content displayed on the screen;

-applying an automatic content recognition technique to the content data to identify the visual content.

Visual content may be identified by applying automatic content recognition techniques to the media recording. Such automatic content recognition is known per se. An advantage of using automatic content recognition may be that additional information may not need to be obtained from the recording location (such as playout information from a media device that plays out the visual content on the screen) to identify the visual content. In fact, no additional information may be needed from such a media device. Note that automatic content recognition may still involve information exchange with other entities such as a content recognition database.

In an embodiment, the automatic content recognition technique may include determining at least one of: an audio watermark, a video watermark, or a fingerprint (fingerprint) of the content data. For example, when using video watermarking, automatic content recognition techniques may be applied only on areas of the screen as shown in the camera recording, for example using the coordinates of the screen. Any suitable automatic content identification technique may be used, as is known per se from the field of automatic content identification, including those based on watermarking and/or fingerprinting. Note that the content identification may take into account additional or other information in addition to the visual data. For example, the visual content may be associated with audio content that may be identifiable by utilizing an audio watermark embedded in the audio content.

In an embodiment, the visual content displayed on the screen may represent a playout by the media device, and identifying the visual content displayed on the screen may include obtaining playout information indicative of the visual content from the media device. The visual content displayed on the screen may represent a playout by a media device, such as a connected media player. Thus, the visual content may be identified by means of the media device. In particular, playout information generated by the media device and indicative of the visual content may be used. For example, the playout information may identify the media stream, including the resource location at which the media stream is available. As another example, the broadcast information may identify a program title.

In an embodiment, obtaining the playout information may comprise:

-querying a media device via a network for the playout information; or

-the media device sending said playout information via the network.

With the popularity of connected media devices, it has become possible to obtain playout information from such media devices via a (local) network. For example, a media device may broadcast or otherwise transmit its current activity, e.g., using multicast DNS, DLNA, DIAL, or other media protocols. The media devices may be queried, for example, using the same or similar protocols for playout information.

In an embodiment, the replacement of the visual content in the camera recording of the scene may include adjusting one or more visual properties of the original version of the visual content. The original version of the visual content may have a different appearance than the visual content in the camera recording of the scene, and may typically not match the appearance of the entire camera recording. Thus, one or more visual properties of the original version of the visual content may be adjusted at or before its insertion into the camera recording. This may provide a more pleasing, natural experience for the viewer of the media recording.

In an embodiment, the one or more visual properties may include one or more of the following: contrast, brightness, white balance, dynamic range, frame rate, spatial resolution, geometry, focus, 3D angle, 3D depth. The geometry of the visual content in the camera recording of the scene may be non-rectangular, e.g., due to camera distortion, camera misregistration (align) with respect to the screen (e.g., the screen is not directly recorded), etc. Thus, the geometry of the original version of the visual content may be adjusted at or before its insertion into the camera recording. Similarly, other visual properties may be adjusted to better match the appearance of the entire camera recording. If the camera recording is a three-dimensional (3D) recording, 3D parameters such as 3D angle or 3D depth may also be adjusted.

In an embodiment, the media recording may be obtained by a sender device for sending to a receiver device, the replacing of the visual content in the camera recording of the scene may be performed by the receiver device, and the method may further comprise:

-a sender device retrieving and subsequently sending an original version of the visual content to a receiver device; or

-the sender device sending metadata to the receiver device indicating a resource location from which the original version of the visual content is accessible, and the receiver device retrieving the original version of the visual content from the resource location based on the metadata.

The method may also be performed not by a single device but using several devices, such as those of a sender/receiver system, where a media recording may be obtained by a sender device for sending to a receiver device, where the receiver device then replaces the visual content in the camera recording of the scene with the original version of the visual content. An example of such a system is a video conferencing system. In this particular example, each video conference client may act both as a sender device for sending locally recorded media streams and as a receiver device for receiving remotely recorded media stream(s). However, there may also be a unidirectional transmission of the media recording from the sender device to the receiver device. In general, there are several possibilities that a receiver device can retrieve an original version of visual content from a resource location. For example, the sender device may retrieve and then send the original version of the visual content to the receiver device, or may send metadata to the receiver device indicating a location of a resource from which the original version of the visual content may be accessed. In general, the receiver device may be a playout device for playing out enhanced media recordings. However, the receiver device may also be an intermediate device that further sends the enhanced media recording to one or more playout devices.

In an embodiment, a transmitter device may include:

-a first input interface for accessing the media recording; and

-an analysis subsystem for analyzing the camera recording to determine coordinates of the screen in the camera recording.

In an embodiment, a receiver device may comprise:

In an embodiment, the method may further comprise the sender device including coordinates of the screen in the camera record in the metadata. Thus, the receiver device may no longer need to determine the coordinates of the screen in the camera recording, as such coordinates may be determined and made available by the sender device. Metadata for the effect may be provided.

In an embodiment, the receiver device may comprise, in addition to the second input interface and the replacement subsystem:

-a first input interface for accessing the media recording; and

Thus, the receiver device may perform all of the claimed operations. For example, the receiver device may use automatic content recognition techniques to identify the visual content to be replaced, retrieve an original version of the visual content, and insert the original version into the camera record.

Those skilled in the art will appreciate that two or more of the above-described embodiments, implementations and/or aspects of the invention can be combined in any manner that is deemed useful.

Modifications and variations of the method and/or of the computer program product, which correspond to the described modifications and variations of the system, can be carried out by a person skilled in the art on the basis of the present description.

Drawings

These and other aspects of the invention are apparent from and will be elucidated with reference to the embodiments described hereinafter. In the drawings, there is shown in the drawings,

FIG. 1A illustrates a recording device in the form of a video camera that records a scene including a person and a screen displaying visual content;

FIG. 1B shows the resulting camera recording, where the visual content as displayed on the screen is shown to be of sub-optimal quality;

FIG. 2 illustrates a method of enhancing a media recording, wherein visual content displayed on a screen is replaced with an original version of the visual content, thereby obtaining an enhanced media recording;

FIG. 3 shows a computer program product comprising instructions for causing a processor system to perform the method;

FIG. 4 illustrates a system for enhancing a media recording, wherein visual content displayed on a screen is replaced with an original version of the visual content, thereby obtaining an enhanced media recording;

FIG. 5 illustrates a recording device making available a media recording of a scene including a screen displaying visual content and a transmitter device using the media recording to generate metadata indicating a location of a resource including an original version of the visual content;

FIG. 6 shows a receiver device receiving metadata from a sender device, wherein the metadata is used to access an original version of visual content to replace content displayed on a screen in a media recording;

FIG. 7 illustrates a system for enhancing media recording in which a media device playing out visual content provides visual content directly to the system;

FIG. 8A illustrates an example of a system actively polling a network for the presence of a media device in the network; and

FIG. 8B illustrates an example of a media device multicasting its presence to the system via a notification message;

it should be noted that items which have the same reference numbers in different figures, have the same structural features and the same functions, or are the same signals. Where the function and/or structure of such an item has been explained, it is not necessary to repeat its explanation in the detailed description.

List of reference numerals

The following list of reference numerals is provided for ease of explanation of the figures and should not be construed as limiting the claims.

010 screen for displaying visual content

012 media device

015 human being

020 recording apparatus

022 field of view of a recording device

030 media recording

Media stream for 030X media recording

032 Camera recording of scene

034 visual content recorded by a camera as displayed on a screen

040 enhanced media recording

Enhanced camera recording of 042 scenes

050 to replacement subsystem communication

052 metadata

060 original version of visual content

060X denotes an original version of a media stream

062 adjusted version of visual content

064 resource location information

100 system for enhancing media recording

110 first input interface

120 analysis subsystem

130 second input interface

140 replacement subsystem

142 reproducer of replacement subsystem

144 scene synthesizer of replacement subsystem

200 method for enhancing media recording

210 access media records

220 analyze camera recordings

230 access an original version of visual content

240 replace visual content displayed on a screen

250 computer readable medium

260 computer program stored as non-transitory data

300 transmitter device comprising an analysis subsystem

400 include a receiver device that replaces the subsystem.

Detailed Description

The following embodiments of the systems and methods relate to replacing visual content shown on a screen in a camera recording with an originally recorded or generated version. Thus, a (much) improved quality of the visual content in the camera recording may be obtained. A general explanation is provided with reference to fig. 1-4, while fig. 5-7 illustrate specific embodiments. None of these examples should be construed as representing limitations of the present invention.

Fig. 1A shows a recording device 020 in the form of a camera recording a scene comprising a person 015 and a screen 010 displaying visual content. In this example and in the following examples, the screen 010 is shown by way of example as a screen of a television 010 and is therefore indicated as "TV" in the figures. However, this is not limiting as the screen 010 may take any suitable form, as also indicated in the following paragraphs. The field of view 022 of the camera 020 is schematically indicated. Fig. 1B shows the resulting camera recording 032. It can be seen that a person and a television are shown in the camera recording 032. However, as also symbolically indicated by the pattern covering the screen 010, the visual content 034 as displayed on the screen has a sub-optimal quality in the camera record 032. Possible reasons for this have been set out in the background and introductory sections. One particular reason is the "digital to light to digital" conversion step, since the visual content 034 is shown in the camera record 032 after having been converted from the digital domain to the optical domain by the television 010 and then back into the digital domain by the camera 020 which records the scene.

Fig. 2 shows a method 200 of enhancing a media recording, wherein visual content displayed on a screen is replaced with an original version of the visual content, thereby obtaining an enhanced media recording. The method 200 includes accessing a media recording in operation 210 entitled "accessing a media recording," the media recording including a camera recording of a scene, the scene including a screen displaying visual content. The method 200 also includes analyzing the camera recording to determine coordinates of the screen in the camera recording in an operation 220 entitled "analyze camera recording". The method 200 also includes accessing the original version of the visual content in an operation 230 entitled "accessing the original version of the visual content". The method 200 also includes replacing the visual content displayed on the screen with the original version of the visual content in the camera recording and using the coordinates of the screen in operation 240 entitled "replace visual content displayed on the screen" thereby obtaining an enhanced media recording. Note that while FIG. 2 shows the

above operations

210 and 240 being performed sequentially, these operations may be performed in any suitable order, e.g., sequentially, simultaneously, or a combination thereof, subject to the particular order necessary, e.g., via input/output relationships, where applicable.

It will be appreciated that the method according to the invention may be implemented in the form of a computer program comprising instructions for causing a processor system to perform the method. The method can also be implemented in dedicated hardware or as a combination of the above.

The computer program may be stored on a computer readable medium in a non-transitory manner. The non-transitory storage may include providing a series of machine-readable physical marks and/or a series of elements having different electrical, e.g. magnetic or optical, properties or values. Fig. 3 shows a computer program product comprising a computer readable medium 250 and a computer program 260 stored thereon. Examples of computer program products include memory devices, optical storage devices, integrated circuits, servers, online software, and so forth.

Fig. 4 shows a system 100 for enhancing a camera recording, wherein visual content displayed on a screen is replaced with an original version of the visual content, thereby obtaining an enhanced camera recording. The operations of system 100 may correspond to performance of method 200 of fig. 2, and vice versa.

Note that the camera recording may be part of the overall media recording, which may include additional components such as, for example, subtitle overlays, additional audio tracks, various metadata, and so forth. However, the media recording may also consist of camera recording only. Thus, the two terms may be used interchangeably where appropriate. It is also noted that the camera recording may be a video, but may equally comprise or consist of one or more still images.

The system 100 is shown to include a first input interface 110 for accessing a media recording 030. The first input interface 110 may take any suitable form, such as a network interface to a local or wide area network, a storage interface to an internal or external data store, and so forth. The media recording 030 may be pre-recorded, but may also be a real-time "live" stream. As also shown in fig. 4, the first input interface 110 may optionally include a decoder for decoding the media stream 030X of the media recording 030, thereby making the media recording 030, or portions thereof, available in an uncompressed or generally other format. For example, the decoder may make available one or more video frames of camera recording 032.

The system 100 is also shown to include an analysis subsystem 120 for analyzing camera recordings. Such analysis may involve determining the coordinates of the screen in the camera recording. However, as will be set forth in the following paragraphs, the analysis subsystem 120 may also have other (e.g., additional) functionality. The coordinates may be determined by image analysis techniques, as is known per se from the field of image analysis. Examples of such techniques are described in the following paragraphs with reference to screen tracking.

The system 100 is also shown to include a second input interface 130 for accessing an original version of the visual content. Like the first input interface 110, the second input interface 130 may be of any suitable type, such as a network interface to a local or wide area network, a storage interface to an internal or external data store, and so forth. The original version 060 may be pre-recorded, but may also be a real-time "live" stream. As also shown in fig. 4, the second input interface 130 may optionally comprise a decoder for decoding a media stream 060X of an original version 060 of the visual content, thereby making said original version 060 or parts thereof available in an uncompressed format or in general other formats. For example, if the coordinates of the screen are made available to the decoder, the decoder may make available one or more image frames or a portion of the image frame(s) of the original version 060. The second interface 130 may make the image frame(s) available if the camera recording is obtained in a form that does not require the use of a decoder.

System 100 is also shown to include a replacement subsystem 140 for replacing the visual content displayed on the screen with the original version of the visual content 060 in the camera record 032 and using the coordinates of the screen, thereby obtaining an enhanced camera record 042, and thus enhanced media record 040. To this end, an alternative subsystem is shown receiving an original version of visual content 060 from the second input interface 130 and a media record 030 from the first input interface 110. However, as will be shown with reference to fig. 5-7, alternative subsystems may also receive media records 030 from different sources. Analysis subsystem 120 is also shown as communicating data 050 to replacement subsystem 140, data 050 can include the coordinates of the screen as determined by analysis subsystem 120.

General aspects

In general, embodiments of systems and methods may include:

detecting a screen that is completely, partially or potentially present in the camera recording, e.g. by analyzing the camera recording or via other mechanisms,

-identifying whether and, if so, which visual content the detected screen displays,

-parsing the original version of the visual content, e.g. by determining a suitable resource location comprising the original version of the visual content;

processing the original version of the visual content to spatially (e.g. geometrically) and/or temporally register it with the camera recording;

-tracking the screen in the camera record, for example by detecting its coordinates, and storing the tracking data in associated metadata, so as to enable the visual content in the camera record to be replaced with the original version; and

-replacing the visual content in the camera record with the original version of the visual content using the generated metadata.

Such functions may be performed by the analysis subsystem when it comes to analysis of camera recordings, and by the replacement subsystem otherwise. For example, the analysis subsystem may detect a media device that is deemed to render visual content on the screen. Note that in some cases, the screen may include a media device, or vice versa, such as in the case of a television with integrated media player functionality. However, in other cases, the media device may be connected to the screen directly or indirectly. Examples of media devices include, but are not limited to, televisions, monitors, projectors, media players and recorders, set-top boxes, smart phones, cameras, PCs, laptops, tablet devices, smart watches, smart glasses, professional video equipment, and the like.

Detecting media devices

Detecting a media device playing visual content may include one or more of:

image analysis techniques can be used to detect the media device in the camera recording itself. The image analysis technique may be performed locally by the analysis subsystem or remotely by the analysis subsystem forwarding the camera recording to a remote image analysis component. An example of such a remote image analysis component is http:// idtv. Suitable image analysis techniques are known per se from the field of image analysis and computer vision, which are consulted at 2015, 4.15 days, for example by Richard Szelisk, 2010, at http:// szelisk.org/Book/drafts/Szeliskibook-20100903 _ draft.pdf "Computer Vision：Algorithms and Applications"is described in.

A media device may announce its activity on the local network, e.g. using multicast DNS, DLNA, DIAL or other media protocols. By way of example, such an announcement may be to include "play channel 1"; URL = … … ".

The analysis subsystem may query the media device for its presence and activity, e.g. via a local network.

The user may manually configure the presence and/or activity of the media device, e.g. via a graphical user interface.

Identifying visual content

Identifying visual content played out by a media device may include one or more of:

the media device may signal which media is being played out, e.g. by signaling a TV channel identifier ("BBC 1"), or may be queried for this information.

The media device may provide additional information about the media source, such as a URL to the source of the media ("http:// webserver/bbc1. mpd").

The visual content may be identified by an analysis subsystem which identifies content data of a camera recording associated with the visual content displayed on the screen and subsequently applies automatic content recognition techniques to the content data to identify the visual content. The automatic content recognition technique may include determining one or more of the following: audio watermarking, video watermarking, or fingerprinting of content data. This may require an index to such content with an appropriate type of identifier.

The user may provide the media source manually, for example by providing a link to a media device presenting the source of the visual content being played out.

Note that visual content may be provided, for example, using the television domain name system (TV-DNS) system (http:// www.w.3. org/TR/TVWeb-URI-Requirements,http：//tools.ietf.org/htmI/rfc2838) Described as metadata and may therefore be announced, signaled or stored in the form of such metadata.

In the case where the camera recording is a video recording rather than, for example, a still image, the analysis subsystem may track the screen in the video recording or may track the media device in the video recording, for example, if the screen is included in the media device. Herein, the term tracking may refer to identifying one or more coordinates of a screen over time (e.g., in different image frames). Such tracking may enable spatially accurate replacement of visual content shown on the screen. That is, the camera and screen may move relative to each other over time, such that the screen is located at different image coordinates. To track the screen, image and/or object tracking techniques may be used, as is well known in the artAnd are widely available. For example, the CDVS standard ISO/IEC FDIS15938-13 (the latest release version at the time of this invention) provides a way to extract visual features from images (keypoints and their coordinates) and compress them in a compact bitstream. The tracking data may be stored to the record as associated metadata. The metadata may also contain device motion information, timing information (e.g., for synchronization purposes), occlusion information. The annotations relating to video may be expressed using the MPEG-7 standard ISO/IEC 15938-3 which allows spatio-temporal annotations. For example, the standard allows over multiple frames (i.e., from time t of the video)₁To time t₂) Coordinates of the region (e.g., object) are expressed, which can be used to track the screen in the video recording.

Accessing an original version of visual content

Accessing an original version of visual content may involve the media device itself providing the original version of visual content, for example, by streaming a media stream in the form of an MPEG-DASH stream. Alternatively or additionally, a resource location that includes the original version may be identified. For example, metadata made available to the replacement subsystem may contain a brief identification of the TV channel being aired on the screen in the camera record, such as the identifier "BBC 1". The replacement subsystem may then identify and access channel "BBC 1", e.g., via an Internet Protocol Television (IPTV) service, from which the media stream of the visual content may be accessed.

Replacing visual content

After obtaining access to the original version of the visual content, the visual content displayed on the screen may be replaced with the original version of the visual content, thereby obtaining an enhanced media recording. Such replacement may, but need not, be performed in real-time and in a synchronized manner such that the visual content in the enhanced media recording is synchronized, at least to some extent, with the visual content previously shown in the media recording. The synchronization aspect will be further elucidated with reference to "temporal registration".

Replacing the visual content displayed on the screen with the original version of the visual content may be performed in a variety of ways. For example, the replacement subsystem may overlay or otherwise insert an original version of the visual content into the camera recording. Note that such replacement may not need to be pixel accurate nor need it completely replace the visual content displayed on the screen. For example, the original version of the visual content may be alpha blended into the camera recording, wherein a residual (e.g., a 1-alpha weighted residual) of the visual content of the camera recording is thus retained in the camera recording.

Note that if the visual content is obtained from the playout of a particular version of the visual content (e.g., a particular media stream), the substitution is not limited to substitution by the particular version being played out, but may involve a different version. For example, the replacement may be by a processed version that has been downsampled or has a lower bit rate. Such a processed version may not affect or may even enhance the perceived quality, as will be further elucidated with reference to the "video conferencing aspect".

The replacement may be performed at various stages. For example, the replacement may have been performed in the recording device itself, such that the encoded version of the media recording contains the original version. Another way is for the receiver device to access both the media recording and the original version of the visual content and insert the original version into the media recording. This aspect is further elucidated in the following paragraphs. The replacement may also be performed during play-out of the media recording. Thus, enhanced media recordings may not be stored separately, but may be generated "on the fly".

System partitioning

It will be appreciated that the analysis subsystem and the replacement subsystem may be part of a single device. However, the two subsystems may also be part of different devices, or may be implemented in a distributed manner. A non-limiting example is of a sender/receiver system, where at the sender side, a media recording may be obtained by a sender device for sending to a receiver device, where at the receiver side, the receiver device then replaces the visual content in the camera recording of the scene with the original version of the visual content. Herein, the transmitter device may comprise a first input interface and an analysis subsystem, and the receiver device may comprise a second input interface and a replacement subsystem. A non-limiting example of such a system is a video conferencing system.

Fig. 5 shows an example of the transmitter side of such a system. Here, a scene is shown comprising a person 015 and a screen 010 displaying visual content. In the example of fig. 5, screen 010 is a screen of a television receiving and broadcasting visual content 060. A recording device 020 is shown recording a scene. As in fig. 1, the field of view 022 of the recording device 020 is schematically indicated in fig. 5. A recording device 020 is shown such that the resulting media recording 030 is available to the transmitter device 300 and, as will be shown with further reference to fig. 6, to the receiver device 400. Such making available may take any suitable form, including direct forms such as streaming media recordings, and indirect forms in which media recordings are stored, processed, etc. intermittently.

In general, the transmitter device, the screen and the recording device may be co-located, for example, in the same room, in the same building, in the same external area. However, this is not a requirement as the sender device 300 may be located at the sender side (e.g. at the "send" position), while the screen may be located and recorded by a recording device elsewhere (e.g. at a third, i.e. the "record" position).

Fig. 5 further illustrates that the television 010 makes resource location information 064 available to the transmitter device 300. Such resource location information 064 may enable access to the original version 060 of the visual content being played out, and may take any suitable form, as discussed throughout this specification. For example, television 010 may announce that it is playing visual content via a network message that includes a URL that references a manifest file. The manifest file may be a Media Presentation Description (MPD) file of MPEGDASH that provides various information about the media stream, an example of which is a URL such as "http:// example. com/description-of-resource. MPD". As another example, a television may advertise a communication channel endpoint, such as a WebSocket (rfc 6455, WebSocket protocol) endpoint, via which the television may deliver MPDs directly.

Transmitter device 300, and in particular its analysis subsystem, may analyze the camera recording included in or represented by media recording 030 to determine the coordinates of the screen in the camera recording. For this purpose, the tracking techniques described earlier may be used. The sender device 300 may then format and make these coordinates available as metadata 052. Specific examples of such metadata will be given in the following paragraphs. As part of the metadata 052, the transmitter device 300 may include resource location information 064.

Fig. 6 shows an example of the receiver side. Herein, a receiver device 400 is schematically shown comprising an input interface 130 for receiving a media recording 030 and an alternative subsystem divided into a renderer 142 and a scene compositor 144. The renderer 142 is shown as receiving metadata 052 generated by the sender device 300 and accessing an original version 060 of the visual content based on, for example, resource location information included in the metadata 052. Based on the coordinates of the screen as obtained from the metadata 052, the renderer 062 may then adjust one or more visual properties of the original version of the visual content 060 (such as its geometry) to better fit the visual content displayed on the screen in the media record 030. Various other aspects of the original version 060 may also be adjusted including, but not limited to, contrast, brightness, white balance, dynamic range, frame rate, spatial resolution, focus, 3D angle, 3D depth. To match the visual properties with those of the media record 030, the renderer 142 may receive information about the properties, for example from an analysis subsystem of the sender device, or may itself access and analyze the media record 030 within the receiver device 140 (not explicitly shown in fig. 6). After adjusting the original version 060 of the visual content, thereby obtaining an adjusted version 062 thereof, the scene compositor 144 may then replace the visual content displayed on-screen in the media record 030 with the adjusted original version 062 of the visual content, thereby obtaining an enhanced media record 040.

Fig. 7 shows another example of a system for enhancing a media recording, wherein the visual content displayed on the screen is replaced with an original version of the visual content, thereby obtaining an enhanced media recording. Here, analysis subsystem 120 and replacement subsystem 140 are shown, while various input interfaces as shown earlier in fig. 2 are omitted for simplicity. The two subsystems may be part of a single device, or as earlier shown with reference to fig. 5 and 6, may be part of different devices, or may be implemented in a distributed manner. In this example, a media device 012 is shown playing out visual content 060. Although not explicitly shown in fig. 7, the media device 012 may include a screen or may be connected to a screen, which is then recorded by the recording device 020. As opposed to the media device of fig. 5 (i.e., television 010), media device 012 of fig. 7 is shown as providing original version 060 of visual content directly to replacement subsystem 140, rather than (only) resource location information. For example, media device 012 may stream the original version 060 after announcement of play-out to replacement subsystem 140 or replacement subsystem 140 discovers play-out of media device 012. In contrast to fig. 5 and 6, the replacement subsystem 120 may thus obtain the original version 060 of the visual content directly from the media device 012 responsible for playing out the visual content on the screen.

Discovery

Fig. 8A and 8B relate to different discovery mechanisms that may be used to discover media content played out by a media device in order to discover visual content shown on a screen in a camera recording. Fig. 8A shows an example of a system actively polling a network for the presence of a media device in the network, while fig. 8B shows an example of a media device multicasting its presence to the system via a notification message.

The active polling network may be based on various protocols. One example is the UPnP protocol. Here, M-SEARCH is used to first discover devices in the local network, either directly or through a UPnP server. Examples of discovery messages are shown below. This is a generic discovery message for discovering all UPnP devices. As with ssdp: all searches for all devices instead, discovery messages may also be sent for specific devices (e.g., for media renderers). The display device (e.g., television) in UPnP will typically be a media renderer.

The M-SEARCH is multicast on the local network specifying the content to be looked up, in this case all devices. In fig. 8A, this is schematically indicated by an arrow entitled "1. M-SRCH" pointing from the system 100 to the media device 012.

The response may be a 200 OK message containing information about the responding device, in which case the media device 012 is a media renderer.

Alternatively or additionally, as shown in fig. 8B, media device 012 may also occasionally multicast its presence, which may be detected by system 100. Examples of advertisement messages are shown below. This message is similar in content to the 200 OK message when responding to M-SEARCH and is indicated in fig. 8B by the arrow entitled "1. NTFY" pointing from the media device 012 to the system 100.

Note that the examples of fig. 8A and 8B are within the context of UPnP, while there are various discovery protocols that may each be used instead.

Signaling screen coordinates

With further reference to the analysis subsystem that detects the coordinates of the screen in the camera record, these coordinates may be signaled to other parties, such as a replacement subsystem. The signaling may involve the analytics subsystem formatting and making available the coordinates in the form of metadata. Such metadata may be generated by encoding the detected screen in X and Y coordinates. However, even if the screen is generally rectangular, the screen may be recorded at an angle. In this case, the coordinates may represent all four corners of the screen. Furthermore, information about the visual content may be detected and signaled to the other party. The following is an example of such metadata in XML.

Note that the above-described XML-based metadata is shown as indicating the coordinates of a rectangular screen. For other types of screens, more or less metadata may need to be supplied. For example, a smart watch may have a circular display that may appear elliptical when captured from a certain angle. In this case, the coordinates for the center and the parameters describing the circle or ellipse can be detected and signaled. For curved screens, the top and bottom of the screen may not be straight. Thus, in addition to the coordinates of the corners, parameters describing the curvature can be detected and signaled. For holographic projection or light field display, 3D coordinates may be used to describe the area in which the 3D image is displayed. The screen may also be partially occluded in the camera recording or only partially shown in the field of view of the recording device. Thus, the coordinates may also describe polygons representing unobstructed visible portions of the screen.

Note that in order to Format and make available coordinates in the form of Metadata, ISO/IEC standard 23001-10 entitled "Carriage of time Metadata Metrics of Media in ISO Base Media File Format" may be used. While this standard contains only timing metadata (see ISO/IEC 23001-11) and visual quality metrics (such as PSNR) relative to the MPEG Green standard at the time of authoring, MPEG has initiated a process of modifying 23001-10 to add also carrying 2D coordinates.

Temporal registration

When replacing the visual content displayed on the screen with the original version of the visual content, the replacement may use the detected coordinates of the screen as a position to insert the original version. However, such replacement may also have a temporal aspect as the video changes over time. Thus, the insertion of the original version can be synchronized with the displayed visual content in the camera recording, since exactly the same content as before can be shown after the replacement. This may involve identifying a playout point in the camera record and identifying that same playout point in the original version and using it during replacement. For this purpose, any known technique from media synchronization may be used, including buffering and looking forward in the video. Note that in some cases, such as where a presenter interacts with visual content shown on a screen, it may be desirable to synchronize the original version with the camera recording to a relatively high degree, e.g., on the order of tens or hundreds of milliseconds with residual differences. However, in many cases, the exact timing is less important, and the insertion of the original version may be slightly shifted in time compared to the displayed visual content in the recording. By way of example, the screen may show a TV channel, such as channel "NPO 1". If the TV channel is accessed for replacement, the currently available playout may be used. This may differ from the displayed visual content in the camera recording in terms of playout timing, as playout of TV channels may vary at various locations, depending on the TV provider, the distribution technology used, transcoding during distribution, etc. Such differences are typically on the order of several seconds and may be as large as one minute. Thus, the enhanced versions of the media recording may differ slightly in the timing of the visual content shown on the screen in the scene.

Adjustment of visual properties

With further reference to the adjustment of one or more visual properties of the original version of visual content 060, as described earlier with reference to fig. 6, the original version of visual content may need to be adapted before it is inserted into the media recording. This may involve analysis of the properties of the entire scene (e.g. by histogram analysis) and adjustment of the original version of the visual content in order to register its visual properties with the recorded scene. Various image analysis and image processing techniques may be used, such as, for example, by RichardSzelisk, 2010, on day 4, month 15, 2015http://szeliski.org/Book/drafts/SzeliskiBook_ 20100903_draft.pdfConsult "Computer Vision：Algorithms and Applications"for example, in chapter 3.1 (point operators) and 3.6 (geometric transformations). Alternatively, if the original version of the visual content already has the desired visual properties, it can be used directly to replace the visual content shown on the screen.

Efficiently encoding media recordings

The visual content shown on the screen in the media recording will be replaced with the original version of the visual content. Thus, when the media recording is encoded, e.g. for transmission or storage, before the replacement occurs, the media recording can be encoded in an optimized way for a higher coding efficiency. Two possible actions are described below, which may also be combined.

The first action is the pre-processing of the media recording, which may involve making the area representing the displayed visual content easy to encode. Thus, the region will occupy fewer bits in the encoded bitstream. One possible way to do this is to replace all pixel values in this region of the captured video frame by the same pixel value (e.g., "zero" or black). That is, when an intra prediction or block matching mechanism is utilized, the uniform region may be efficiently encoded for an encoder.

The second action is the so-called region of non-interest coding. Regardless of the video coding standard, many encoders provide the possibility to define regions in a video frame for which more or less quality (more or less bits) should be allocated. Within this context, it may be beneficial to assign poor quality to the area representing the displayed visual content. Typically, the quality of the region is adjusted via a Quantization Parameter (QP). The higher the QP, the lower the quality of the encoded stream. By locally applying a higher QP to the region, one can achieve a "non-interesting" encoding of the visual content displayed on the screen for that region.

The third action may constitute an alternative to the second action, which may require a modified encoder. That is, one may consider not encoding content that is not needed. In this case, the coordinates of the area to be discarded (i.e. the area representing the displayed visual content) can be directly used by the encoder to ignore the video stream when encoding it. Effectively, the output bitstream may then contain frames with "holes". Such discarding of regions may involve the use of High Efficiency Video Coding (HEVC) tiles (tiles). For example, assuming that there is only one screen shown in the camera recording, the recording device may define a collage grid for an HEVC encoder in such a way that the collages representing the screens may be discarded during the encoding process. The collage grid may be dynamically adjusted based on the location of the screen. Alternatively, the collage grid may be static, and collages containing only pixels from visual content displayed on the screen may be discarded.

Aspects of video conferencing

Note that in a video conferencing scenario, it may not be necessary to use the same stream as seen by user a for user B; if the screen presenting the recording for user B is small or provided at a low resolution, it may be sufficient to retrieve a low bitrate version of the visual content to be displayed in the field of view of user B. Here and in the following, a reference to user a is understood as a reference to his/her sender device and a reference to user B is understood as a reference to his/her receiver device. Example (c): user a watches a full HD TV channel (1920 x1080 pixels) on his/her large screen TV, which involves a bit rate of 10 Mbit/s. User B only sees a scaled down version of user a's TV in his recording field of view, so a lower resolution version (SD) may be sufficient to get acceptable results. Note that this may also apply generally to visual content played out of a media stream, as it may not be necessary to retrieve the same media stream in order to replace the visual content shown in the screen in the camera recording. Rather, a different (e.g., lower) bit rate version may be retrieved. Still higher quality can be obtained, for example by avoiding a digital-to-light-to-digital conversion step. With further reference to the video conferencing scenario, user a and user B may access the same media stream through the media stream efficiently distributed between them (e.g., through distribution via multicast or peer-to-peer (P2P)). The system may also detect or resolve that the resource that user a is viewing is also available to user B, but via a different route. Example (c): user a views TV channel "NP 01" via subscription of TV provider a; the system may then detect that user B may access the media stream of said TV channel via a subscription of IPTV provider B, such that no media stream needs to be transmitted from user a to user B.

Other general aspects

Note that if the camera record shows a screen from a PC, tablet or smartphone or other type of computing device, the screen capture functionality of the computing device may be used as a media source for the original version of the visual content, since the screen capture(s) may be accessed and used to replace the visual content displayed on the screen in the camera record.

Note that the analysis subsystem and/or the replacement subsystem may be embodied as or in a single device or apparatus, such as a recording device or another user device. The apparatus or device may comprise one or more microprocessors executing appropriate software. The software may have been downloaded and/or stored in a corresponding memory, e.g. a volatile memory such as a RAM or a non-volatile memory such as a flash memory. Alternatively, the analysis subsystem and/or the replacement subsystem may be implemented in the form of programmable logic in a device or apparatus, for example as a Field Programmable Gate Array (FPGA). Generally, each functional unit of the system may be implemented in the form of a circuit. It is noted that the analysis subsystem and/or the replacement subsystem may also be implemented in a distributed manner, e.g. involving different devices or apparatuses. For example, the analysis subsystem and/or the replacement subsystem may be implemented as software-based functions performed by entities within the media distribution network, such as a server.

In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. Use of the verb "comprise" and its conjugations does not exclude the presence of elements or steps other than those stated in a claim. The article "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the device claim enumerating several means, several of these means may be embodied by one and the same item of hardware. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.

Claims

1. A method of enhancing media recording, comprising:

-accessing an original version of the visual content; and

2. The method of claim 1, wherein accessing the original version of the visual content comprises:

-identifying visual content displayed on the screen;

3. The method of claim 2, wherein identifying visual content displayed on the screen comprises:

-identifying content data of the camera recording associated with visual content displayed on the screen;

4. The method of claim 3, wherein the automatic content recognition technique comprises determining at least one of: an audio watermark, a video watermark, or a fingerprint of the content data.

5. The method of claim 2, wherein the visual content displayed on the screen represents a playout by a media device, and wherein identifying the visual content displayed on the screen comprises obtaining playout information indicative of the visual content from the media device.

6. The method of claim 5, wherein obtaining the playout information comprises:

-querying a media device via a network for the playout information; or

-the media device sending said playout information via the network.

7. The method of any of the above claims, wherein the replacement of the visual content in the camera recording of the scene comprises adjusting one or more visual properties of the original version of the visual content.

8. The method of claim 7, wherein the one or more visual properties comprise one or more of: contrast, brightness, white balance, dynamic range, frame rate, spatial resolution, geometry, focus, 3D angle, 3D depth.

9. The method of any of the above claims 1-6, wherein the media recording is obtained by a sender device for sending to a receiver device, wherein the replacing of the visual content in the camera recording of the scene is performed by the receiver device, and wherein the method further comprises:

10. The method of claim 9, further comprising a sender device including coordinates of the screen in the camera record in the metadata.

11. A computer-readable medium having instructions stored thereon that, when executed, cause a processor system to perform the method of any of claims 1-10.

12. A system for enhancing media recording, comprising:

13. The system of claim 12, comprising a transmitter device and a receiver device, the transmitter device comprising:

-the first input interface;

-the analysis subsystem;

and the receiver device comprises:

-the second input interface; and

-said replacement subsystem.

14. The system of claim 13, wherein:

-the transmitter device is configured for retrieving the original version of the visual content and subsequently transmitting the original version of the visual content to the receiver device; or

-the sender device is configured for sending metadata to the receiver device indicating a resource location from which the original version of the visual content is accessible, and the receiver device is configured for retrieving the original version from the resource location based on the metadata.

15. A transmitter device or a receiver device according to claim 13 or 14.