EP3008910A1

EP3008910A1 - Data processing device

Info

Publication number: EP3008910A1
Application number: EP14749863.8A
Authority: EP
Inventors: Sébastien GILLES
Original assignee: Viddiga
Current assignee: Viddiga
Priority date: 2013-06-12
Filing date: 2014-06-10
Publication date: 2016-04-20
Also published as: US20160156993A1; WO2014199059A1; FR3007235B1; FR3007235A1

Abstract

Device for analyzing streaming audio-video data, characterized in that it comprises a selector (20) designed to determine input data relating to an audio stream or to a video stream in the streaming audio-video data, a converter (22) designed to produce image data at a frequency chosen on the basis of the input data, an encoder (24) designed to produce compressed data on the basis of the image data, and a projector (26) designed to produce imprint data on the basis of the compressed data, the converter (22) being designed to produce the image data in the form of an image of fixed dimension, the encoder (24) being designed to work successively on each image described by the image data, and the projector (26) being designed to produce the imprint data as a stream on the basis of the weight of the compressed data produced successively.

Description

Data processing device

The invention relates to the field of data processing. In many environments, media rights holders, whether audio or video, want to be able to detect the broadcast of the media on which they have rights. For this, two major families of data processing exist: the fingerprinting ("fingerprinting" in English), and the marking ("watermarking" in English).

The best-known examples of the use of these technologies concern the search for the use of content broadcast illegally on networks or the detection on video sharing platforms of protected content in order to propose to the rightful owner to have his content removed. or to share with the platform the revenue derived from the advertising monetization of the views of its content. But this is only a small part of the needs.

Indeed, many economic models for valuing licensees' rights are based on a remuneration based on the number of broadcasts by lawful networks, such as radio or television channels. In the particular case of advertising, these contracts provide for the broadcast of media in a certain number and in certain time slots against remuneration.

However, for various reasons, the programs of the radio and television channels are permanently upset, and the program which is planned by the advertising agency is hardly ever respected, and arbitrations are carried out by the radio channels and television to meet their commitments.

Nevertheless, except to hire people whose only job is to follow all the radio and television channels involved in a given advertising campaign for a given company, it is not possible to check whether the contracts are actually respected. . In addition, these persons would be employed either by radio or television channel, either by a company that has purchased advertising space. They would not be considered impartial.

Third parties have thus filled the void that exists in advertiser-radio or television channel relationships, and they are known as trusted third parties. However, here again, we must trust these third parties, and their services are very expensive.

Historically, therefore, there is a need for a tool to make the relationship between advertisers and radio or television stations more objective.

This need can hardly be fulfilled through the marking: in fact, the marking must be made from the production of the media concerned, which is expensive and is difficult to catch up then. In addition, the costs of detection of the marking are very important, require an intensive computation very consumer of resources in mobile environment, and the known marking techniques can be irreversibly degraded when the radio or television channel retouches its signal for the 'program.

As for imprinting methods, they tend to fail to maintain a satisfactory level of detection quality by "scaling" (that is, their ability to identify content drops significantly when the the amount of data to be identified increases significantly), or to have insufficient performance, unless the detection cost is too high to be able to do real time.

Beyond the problem described above, there is a need to allow radio or television channels to know in real time their programming and / or advertising, in a completely reliable way, in order to promote the media whose use is increasing exponentially and are known as the "second screen"("secondscreen" in English). Indeed, many radio or television channels allow their listeners to use their tablet or their smartphone ("smartphone") with an application that they provide them to enrich their experience during a meeting. given program. Here again, the exact and instantaneous knowledge of the programming grid actually broadcast by the radio or television channel is an asset not available to date, but which would allow for example to broadcast targeted advertisements on the second screen, advertisements which it It is well known that they are worth ten to one hundred times larger than those of conventional banners. Moreover, it is often desirable for these applications to authenticate the channel or content watched by a viewer, for example to reserve the use of the service to users actually watching a given channel or content. The problem becomes even more acute if we consider mobile application editors offering "cross-section" applications on a set of channels, and no longer on a single channel in particular.

For all these reasons, there is a need to provide a data processing device that is effective to enable instantaneous and accurate detection of a real broadcast program of a radio or television channel.

The invention improves the situation. For this purpose, the invention proposes a device for processing streaming audio-video data, comprising a selector arranged to determine input data relating to an audio stream or a video stream in the audio-video data in question. flow, a converter arranged to produce image data at a frequency selected from the input data, an encoder arranged to produce compressed data from the image data, and a projector arranged to produce data from imprinting from the compressed data, the converter being arranged to produce the image data in the form of a fixed size image, the encoder being arranged to work successively on each image described by the image data, and the projector being arranged to produce the flow imprint data from the weight of the compressed data produced successively. In other aspects, the device may also have the following characteristics:

the converter is arranged to segment input data relating to an audio stream into successive windows of samples, and to convert the input data of each window into successive image data by converting the amplitude of each sample into a sample. gray scale value, the converter being further arranged to produce image data of a given window in the form of an image in which successive pixels of a given line correspond to successive samples of the data of each of which has a corresponding gray shade value, and in which the lines of the image are identical to each other,

the windows have a duration of 0.25 s, and are separated from each other by a number of samples making it possible to obtain image data at the chosen frequency,

the converter is arranged to select images in input data relating to a video stream according to the selected frequency, and to produce the image data by converting these images to a selected dimension,

the chosen dimension is 120 * 160,

the encoder includes a lossy image compressor,

the encoder works by block processing and quantization,

the encoder comprises a compressor of the JPEG family, or a compressor of the WebP type,

the projector is arranged to produce the fingerprint data by projecting on a given range the weight of the compressed data successively produced according to a chosen projection law,

the range comprises integers between 0 and 255, and the projection law is linear. Other features and advantages of the invention will appear better on reading the following description, taken from examples given for illustrative and non-limiting purposes, taken from the drawings in which:

FIG. 1 represents an exemplary implementation environment of a device according to the invention,

FIG. 2 represents a device according to the invention,

FIG. 3 represents an example of a fingerprint produced using a first encoding algorithm,

FIG. 4 represents an example of a fingerprint produced using a second encoding algorithm.

The drawings and the description below contain, for the most part, elements of a certain character. They can therefore not only serve to better understand the present invention, but also contribute to its definition, if any.

FIG. 1 represents an environment for implementing a device according to the invention.

In this environment, a licensee transmits unmarked content from a content server 10. The transmitted content is received by users by various media consumption devices, such as a computer 12, a tablet 14 or a radio 16.

These media consumption devices are arranged to implement the device according to the invention, and to contact a fingerprint server 18 to identify in real time the content received by a consumer device, and to return to the latter an identifier. content and / or other additional information, such as targeted advertising. It should be understood that the invention has a very broad application, in that: the holder can transmit audio contents (for example a digital radio, terrestrial radio, or via the Internet, or any other audio content supply), such as of the video content (for example a television channel, or a provider of VOD or Internet content such as Youtube or Dailymotion (registered trademarks), these contents being thus generally qualified as audio-video, ie audio, video, or combining both,

the consumer devices may comprise any device capable of implementing the device described in FIG. 2, whether (in addition to the devices already mentioned as examples) of a smart phone (smartphone in English), a connected television, a connected television box, a server dedicated to the analysis of contents, or any other suitable device,

- the content server can be connected to third-party servers for the provision of additional information of the identified content, or be a black box (black box in English) which carries out both the identification of content and the determination of further information. As discussed in the introduction, an effective cost and performance solution for the type of environment shown in Fig. 1 has long been sought. The invention solves this problem with a device that produces a robust footprint, light, and low cost of calculation. The Applicant has found that known marking or fingerprinting solutions seek to qualify the contents individually, as if they were autonomous entities, regardless of their transmission environment. As a result, the resulting markings and imprints are often strongly correlated to the content itself, and in fact represent a kind of simplification of the original content, ultimately close enough to the original. Assuming that content is mainly emitted and consumed as a flow for the applications that concern it, the Applicant has sought to abstract the generated footprint, while correlating it strongly to the information transported by the content, without success. a "miniature" version of the original content.

This work has resulted in the device shown schematically in Figure 2, which will now be described. The device according to the invention comprises a selector 20, a converter 22, an encoder 24 and a projector 26.

The function of the selector 20 is to demultiplex the original stream, i.e. to receive streamed audio-video data, and to extract the audio or video track to form an input data stream. The input data stream contains only audio data or exclusively video data. Thus, if the streamed audio-video data received relates to an audio stream, then the selector 20 produces input data designating the amplitude of the successive samples of this audio stream. If the streamed audio-video data received relates to a video stream, then the selector 20 produces on the one hand input data corresponding to the audio stream of the video, and on the other hand input data corresponding to the image stream. video, by demultiplexing. Alternatively, the selector 20 may omit producing the input data corresponding to the image stream of the video.

The selector 20 calls the converter 22 with the input data ent dat and outputs image data im da t. This step is fundamental, and will be explained in more detail later. The converter 22 is arranged to produce the image data differently depending on whether the input data relates to an audio stream or a video stream.

The converter 22 is arranged to produce successive images of fixed size from the input data. In the case of input data relating to an audio stream, the converter 22 thus receives an input data stream, and divides this input stream into successive windows. Each window contains a number of samples depending on the length of the window and the sampling frequency of the audio stream corresponding to the input data. Each window will have image data defining an output image. For each window, the converter 22 converts the amplitude of the successive samples into gray level values in order to define a row of pixels whose length corresponds to the number of samples in the window. Then, the pixel line is repeated a number of times chosen to form the image corresponding to the window.

In the example described here, the pixel line is copied 8 times, so that the size of the images produced is L * 8, where L is the number of audio samples in each window. Starting from an audio stream encoded at 44.1 kHz, with a window of 0.25 s, and for a 25 Hz frequency footprint, we obtain:

- windows each containing 11,025 samples,

the successive windows being shifted by 1764 samples relative to one another,

- images of dimension 11025 * 8. When the audio stream of the input data has another sampling frequency, for example 48 kHz, the input data can be transformed to 44.1 kHz, or the converter 22 can act by producing pixels whose the value in gray levels takes into account this resampling, for example by extrapolation. When the audio stream contains multiple channels, sampling may be based on one of the channels only, or on an average of the channels.

The calculation of the gray level value for each pixel depends on the quantization of the audio stream of the input data. In the example described here, the converter 22 produces images coded in 256 gray levels. Thus, if the input data represents a 16-bit quantized flow, the amplitude of each sample of [0; 65536] to [0; 255]. In the example described here, the projection is linear. However, the projection can also be Gaussian, or any other suitable projection. In the case where the input data relates to a video stream, the converter 22 is arranged to produce successive images of fixed size. As a reminder, a video stream implements two main devices: a container (whose role is to carry elementary packets of information) and a codec (whose role is to encode and decode elementary packets). Whatever the type of container and video codec used by a stream, the elementary decompression of this stream gives rise to a series of images ordered temporally, of fixed size (for example 1920x1080 for a TV signal in HD format). Nevertheless, a re-encoding of this stream for a mobile terminal (for example 720x576 pixels for a TV signal in SD format) will give rise to images of different definition. In addition, other diffusion parameters influence the final size of the elementary image of a stream, such as the addition of horizontal black bars to transform a 16: 9 signal into a 4: 3 signal. In order to eliminate the dependence of the subsequent processing steps on the size of the original image, it is "resized" to a fixed size, regardless of the input stream. This situation is fairly standard, and it is therefore a question of reducing an image of dimensions given by the video stream to a chosen format, 120 * 160 in the example described here. In the case where the images of the video stream of the input data have a different aspect of 120 * 160, the converter 22 can operate:

by cutting selected portions of each image in order to find the same aspect ratio as the images produced by the converter 22 (that is to say 3/4), or

by extrapolating selected portions of each image in order to find the same aspect ratio as the images produced by the converter 22 (that is to say 3/4), or

producing images whose aspect ratio corresponds to that of the images of the input data, that is to say 120 * (K * 160) where K is an aspect compensation factor. As for the case where the input data is for an audio stream, it is intended to produce an impression of flow at 25 Hz. The converter 22 is arranged to select an image every l / ^25th of second in the data entries. In the case where the video stream of the input data is present at a rate other than 25 frames per second, for example at 30 frames per second, the converter 22 can carry out an extrapolation of images surrounding each time marker at 25 Hz. At the output, the converter 22 transmits the image data corresponding to each successive image derived from the input data to the encoder 24. The function of the encoder 24 is to produce comp compressed data which constitutes a compressed version of the data. image. In the example described here, the encoder 24 is the standard JPEG encoder, free, developed and distributed by the Independent JPEG Group. Alternatively, the encoder 24 could also be an open-source WebP encoder developed by Google. The encoder 24 has the particularity of performing a lossy encoding operating by block processing and quantization. Other image encoding algorithms with similar characteristics may be considered.

At the output, the compressed data is transmitted to the projector 26. The projector 26 generates the print data stream prnt dat by taking the computer weight of the compressed data generated successively by the encoder 24, and projecting them on the interval [0; 255]. In the example described here, the projection is linear. However, the projection can also be Gaussian, or any other suitable projection. Figures 3 and 4 show examples of fingerprints produced from a JPEG encoder for Figure 3, and WebP for Figure 4. Surprisingly, these fingerprints are almost superimposable. The use of the encoder 24 renders the robust fingerprint data to the transmission noise of the flow defining the input data, and produces compressed data whose weight is an intrinsic measure of the information quantity (in Shannon's sense). carried by the image data. Thus the fingerprint data is abstract with respect to the input data, while being strongly related to it. In addition, if fingerprint data taken in isolation are not always discriminating, the fact that they are generated in flow makes the fingerprint generation process particularly robust, repeatable and discriminant. Thus, the imprint flux has an invariance character with respect to the transformations or losses that can affect a video or audio signal during its transmission and its reproduction (noise, re-encoding, resizing, changing colors, contrast or brilliance) and descriptive power to uniquely identify any excerpt from that flux. Finally, the generation process is very inexpensive in computing time, which allows to generate a robust footprint in real time.

The conversion of an input data stream relative to an audio or video stream indifferently into successive image data may seem surprising. This is a major discovery by the Applicant.

Indeed, we have seen that the Applicant has oriented its research on the generation of fingerprint taking into account that the contents are emitted in flow. In doing so, he discovered that it is advantageous to produce a footprint also as a flow. Continuing his research, the Applicant has identified that the elementary elements of the stream (the images for a video stream, and the sample windows for an audio stream) represent instantaneous static / spatial information. This discovery, on the other hand, led to the exclusion of the video or audio encoders generating imprints that intrinsically correlate the elements of the stream to take advantage of the redundancies between the successive elementary elements of a stream.

Thus, the Applicant was interested in image compression algorithms such as JPEG, which reduce noise, while preserving only the "useful" amount of information, which is reflected by the variable weight of each picture. This led to the conversion / encoding / projection structure of the weight he applied to video streams. Continuing his research, the Applicant has also discovered that this advantage is obtained as well when it is an audio stream as a video stream, and that the audio or video nature of the stream for which is generated the imprint is less important than the fact that this stream carries information of a sequential and instant nature.

This results in a very light footprint generation process both from the point of view of the weight of the generated fingerprints and the cost of generating the fingerprints. In the above, it is considered that streaming audio-video data are of a digital nature. In a variant, the device according to the invention may comprise a stage analog acquisition and digital conversion according to the recommended formats described above.

Similarly, the examples described here recommend an audio data stream of 44.1 kHz input, with windows of 0.25 s, and for a 25 Hz fingerprint data stream, and a video data stream of input at 25 frames per second, with an aspect ratio of 3/4. These particular elements may vary depending on the desired applications.

Finally, in addition to the provision of an automated trusted third party service, as well as additional information and / or targeted advertising, the device of the invention can also be used to detect the presence of illegal content on the program. content-sharing platforms, by detection at the input before any sharing, which offers a great security to the hosting of contents.

Claims

claims

An audio-video streaming data analysis device, characterized in that it comprises a selector (20) arranged to determine input data relating to an audio stream or a video stream in the streaming audio-video data. , a converter (22) arranged to produce image data at a frequency selected from the input data, an encoder (24) arranged to produce compressed data from the image data, and a projector (26); ) arranged to produce fingerprint data from the compressed data, the converter (22) being arranged to produce the image data in the form of a fixed size image, the encoder (24) being arranged to work successively on each image described by the image data, and the projector (26) being arranged to produce the flow imprint data from the weight of the successively produced compressed data.

An apparatus according to claim 1, wherein the converter (22) is arranged to segment input data relating to an audio stream into successive windows of samples, and to convert the input data of each window to image data. by converting the amplitude of each sample to a gray-scale value, the converter (22) being further arranged to produce image data of a given window in the form of an image in which successive pixels of a given line correspond to successive samples of the input data which each have a corresponding gray shade value, and in which the lines of the image are identical to each other.

Device according to claim 2, the windows have a duration of 0.25 s, and are separated from each other by a number of samples to obtain image data at the chosen frequency.

An apparatus according to claim 1, wherein the converter (22) is arranged to select images in input data relating to a video stream according to the selected frequency, and to produce the image data by converting those images to a chosen dimension.

5. Device according to claim 4, wherein the selected dimension is 120 * 160.

6. Device according to one of the preceding claims, wherein the encoder (24) comprises a lossy image compressor.

The device of claim 6, wherein the encoder (24) operates by block processing and quantization.

8. Device according to claim 7, wherein the encoder (24) comprises a compressor of the JPEG family, or a compressor type WebP.

9. Device according to one of the preceding claims, wherein the projector (26) is arranged to produce the fingerprint data by projecting on a given range the weight of the compressed data produced successively according to a chosen projection law.

10. Device according to claim 9, wherein the range comprises integers between 0 and 255, and wherein the projection law is linear.