ES2382747B1

ES2382747B1 - MULTIMODAL INTERACTION ON DIGITAL TELEVISION APPLICATIONS

Info

Publication number: ES2382747B1
Application number: ES200930385A
Authority: ES
Inventors: Jose Luis Gomez Soto; Susana Mielgo Fernandez
Original assignee: Telefonica SA
Current assignee: Telefonica SA
Priority date: 2009-06-30
Filing date: 2009-06-30
Publication date: 2013-05-08
Anticipated expiration: 2029-06-30
Also published as: AR077281A1; ES2382747A1; UY32729A; WO2011000749A1

Abstract

La invención propone un método de interacción multimodal sobre aplicaciones de televisión digital donde la aplicación multimodal reside en un servidor web y es descargada por un navegador (110) que reside en el propio descodificador de televisión (100). Utilizando un sistema distribuido de componentes y a través de los protocolos de comunicaciones se pueden realizar todos los procesos de análisis de interacción multimodal en tiempo real. El sistema permite la interacción del usuario con la aplicación mediante el uso del mando a distancia o de la voz.The invention proposes a method of multimodal interaction on digital television applications where the multimodal application resides on a web server and is downloaded by a browser (110) that resides on the television decoder itself (100). Using a distributed system of components and through communication protocols, all multimodal interaction analysis processes can be performed in real time. The system allows user interaction with the application through the use of the remote control or voice.

Description

Interacción multimodal sobre aplicaciones de televisión digital. Multimodal interaction on digital television applications.

Campo de la invención Field of the Invention

La presente invención se aplica al sector de la televisión digital, más concretamente al campo de las interacciones hombre-máquina sobre terminales como descodiﬁcadores de televisión digital o teléfonos móviles capaces de ejecutar aplicaciones interactivas que se visualicen sobre un televisor. The present invention applies to the digital television sector, more specifically to the field of human-machine interactions on terminals such as digital television decoders or mobile phones capable of executing interactive applications that are displayed on a television.

Antecedentes de la invención Background of the invention

Un sistema multimodal debe permitir simultáneamente diferentes métodos o mecanismos de entrada (teclado, voz, imágenes, etc.), recogiendo la información de cada uno de ellos según se necesite, por ejemplo, a veces, el usuario podría decir alguna cosa mediante un comando vocal, pero otras veces podría seleccionar un nombre de una lista mediante el uso del teclado e incluso podría seleccionar un menú o una parte de la pantalla apuntando con su propio dedo haciendo que el motor de la interfaz multimodal sea capaz de detectar el método de interacción que el usuario libremente ha escogido (descartando información incongruente recibida a través de los otros métodos). A multimodal system must simultaneously allow different input methods or mechanisms (keyboard, voice, images, etc.), collecting the information of each of them as needed, for example, sometimes, the user could say something through a command vowel, but other times you could select a name from a list by using the keyboard and you could even select a menu or a part of the screen pointing with your own finger making the multimodal interface engine able to detect the interaction method that the user has freely chosen (discarding incongruous information received through the other methods).

En lo que respecta al diseño de interfaces de usuario, estos tradicionalmente se han basado en la metáfora del escritorio, desarrollada décadas atrás en los laboratorios de Xeros, y que intenta trasladar todos los objetos y tareas que normalmente se realizan en una oﬁcina real al mundo de los ordenadores; así por ejemplo, tanto los archivos reales como los electrónicos pueden ser almacenados, la tradicional máquina de escribir tiene su equivalente en el procesador de textos, el folio en blanco equivale al documento en blanco del procesador, etc. De esta forma se consigue que el modelo mental que tiene el usuario cuando realiza estas tareas tradicionales se mantenga con pocos cambios cuando se traslada al campo de los ordenadores, es decir, tratar de alcanzar el mayor grado de familiaridad entre objetos y acciones. Esta metáfora de escritorio se ha implementado a través del paradigma WIPM (en inglés, Windows, Icons, Pointers and Menus) que son los principales elementos que sostienen la inmensa mayoría de las interfaces gráﬁcas actuales. With regard to the design of user interfaces, these have traditionally been based on the desktop metaphor, developed decades ago in the Xeros laboratories, and which attempts to transfer all the objects and tasks normally performed in a real office to the world. of computers; Thus, for example, both real and electronic files can be stored, the traditional typewriter has its equivalent in the word processor, the blank sheet is equivalent to the blank document of the processor, etc. In this way it is achieved that the mental model that the user has when performing these traditional tasks is maintained with few changes when moving to the field of computers, that is, trying to achieve the highest degree of familiarity between objects and actions. This desktop metaphor has been implemented through the WIPM paradigm (in English, Windows, Icons, Pointers and Menus), which are the main elements that support the vast majority of current graphic interfaces.

Sin embargo este paradigma resulta claramente inapropiado en un entorno de TV Digital interactiva por varias razones. La primera de ellas está relacionada con la propia naturaleza de las tareas que realiza un usuario sobre una aplicación interactiva (más distendidas y próximas a un entorno de entretenimiento, social, etc.) las cuales hacen que sean muy diferentes a las de una oﬁcina real. Como segundo punto hay que señalar que el dispositivo con el que interactúa el usuario (mando a distancia) es muy diferente en funcionalidad y accesibilidad al del teclado y ratón, lo cual impone muchas restricciones a la hora de realizar tareas sobre un entorno de TV Digital (por ejemplo, la introducción de texto a través del mando a distancia para realizar una simple búsqueda puede convertirse en una tarea costosa). Durante bastantes años, y desde su aparición, el mando a distancia usado en el entorno de la TV se ha convertido en el dispositivo por excelencia y a través de él ha sido posible controlar una gran variedad de dispositivos y funciones asociadas a los mismos. Sin embargo, los modelos de tareas empleados en cualquiera de los servicios interactivos actualmente desplegados a nivel comercial sobre cualquiera de las tecnologías de distribución y entornos de desarrollo de los mismos, hacen que en numerosas ocasiones su utilización resulte ineﬁciente presentando grandes problemas de usabilidad, lo que se traduce en una desmotivación y perdida de interés en la exploración por parte de los usuarios (usabilidad se deﬁne como la eﬁciencia y satisfacción con la que un producto permite alcanzar objetivos especíﬁcos, como por ejemplo la compra de un partido de fútbol, a usuarios especíﬁcos, como por ejemplo los tele espectadores, en un contexto de uso especíﬁco, como por ejemplo el salón de una vivienda). However, this paradigm is clearly inappropriate in an interactive Digital TV environment for several reasons. The first one is related to the very nature of the tasks that a user performs on an interactive application (more relaxed and close to an entertainment, social, etc. environment) which make them very different from those of a real office . As a second point it should be noted that the device with which the user interacts (remote control) is very different in functionality and accessibility to the keyboard and mouse, which imposes many restrictions when performing tasks on a Digital TV environment (For example, entering text through the remote control to perform a simple search can become an expensive task). For many years, and since its inception, the remote control used in the TV environment has become the device par excellence and through it it has been possible to control a wide variety of devices and functions associated with them. However, the task models used in any of the interactive services currently deployed on a commercial level on any of the distribution technologies and their development environments, make their use inefficient in many cases, presenting great usability problems. which translates into a demotivation and loss of interest in the exploration by users (usability is defined as the efficiency and satisfaction with which a product allows to achieve specific objectives, such as the purchase of a football match, users specifics, such as television viewers, in a context of specific use, such as the living room of a dwelling).

Si además tenemos en cuenta que numerosas personas tienen problemas de accesibilidad al utilizar un mando a distancia tradicional, podemos concluir que claramente el mecanismo de interacción tradicional con la televisión se ha quedado desfasado y sobrepasado por los nuevos servicios interactivos ejecutados sobre los decodiﬁcadores de televisión digital. If we also take into account that many people have accessibility problems when using a traditional remote control, we can conclude that clearly the mechanism of traditional interaction with television has become outdated and surpassed by the new interactive services executed on digital television decoders. .

Tareas como la introducción de texto con el mando a distancia a la hora de realizar una búsqueda en una EPG (Electronic Programing Guide o Guía Electrónica de Programación) o la posibilidad de enviar un mensaje a través de una aplicación interactiva de TV, se puede convertir en una tarea costosa que hará que ﬁnalmente el usuario pueda perder el interés por su utilización. A la hora de introducir estos datos se suele utilizar un teclado virtual que aparece en pantalla y que puede tener una apariencia similar al teclado de un teléfono móvil o bien el de un teclado ANSI. En cualquier caso el proceso resulta lento, no todo el mundo está acostumbrado a utilizar el mando a distancia como si se tratase de un teclado de un teléfono móvil y además no son infrecuentes los errores que se cometen al utilizar este mecanismo (el mando a distancia funciona por infrarrojos que en función de la luz del entorno, objetos ubicados entre el usuario y el receptor, etc. puede hacer que la pulsación de teclas no se traduzca en una introducción de caracteres). Casi todos los estudios y tests de usabilidad que se realizan sobre las aplicaciones interactivas señalan este proceso como algo costoso para el usuario. Tasks such as entering text with the remote control when performing a search on an EPG (Electronic Programming Guide or Electronic Programming Guide) or the possibility of sending a message through an interactive TV application, can be converted in an expensive task that will finally cause the user to lose interest in its use. When entering this data, a virtual keyboard that appears on the screen and that can look similar to the keyboard of a mobile phone or that of an ANSI keyboard is usually used. In any case the process is slow, not everyone is accustomed to using the remote control as if it were a keyboard of a mobile phone and also the errors that are made when using this mechanism (the remote control are not uncommon) it works by infrared that depending on the light of the environment, objects located between the user and the receiver, etc. can make the keystroke not translate into an introduction of characters). Almost all studies and usability tests carried out on interactive applications point to this process as expensive for the user.

También cabe señalar que la TV tiene un carácter mucho más social, y el usuario normalmente se encuentra en un ambiente mucho más distendido, sentado a 3-4 metros del TV, y con un actitud de mucho menor concentración que la que exige trabajar con un ordenador. Es claro, que muchas de las tareas que se realizan sobre un ordenador a través de un interfaz gráﬁco tradicional no se podrán realizar o tendrán que ser realizadas de una forma muy diferente. It should also be noted that the TV has a much more social character, and the user is usually in a much more relaxed environment, sitting 3-4 meters from the TV, and with an attitude of much less concentration than that required to work with a computer. It is clear that many of the tasks performed on a computer through a traditional graphical interface cannot be performed or will have to be performed in a very different way.

Todo lo anterior ha hecho que necesariamente esta metáfora del escritorio se haya abandonado en los desarrollos de aplicaciones de TV Digital. All of the above has meant that this desktop metaphor has necessarily been abandoned in the development of Digital TV applications.

Las aplicaciones interactivas sobre TV Digital, además, se ejecutan sobre una única ventana presentada simultáneamente (en lugar de varias como los interfaces gráﬁcos de PC, por ejemplo) por todas las restricciones arriba señaladas. Sobre esta ventana se disponen los diferentes objetos multimedia que componen la escena (textos, gráﬁcos, vídeos, etc.) tratando de que todos ellos estén sincronizados en base a una línea temporal generando un conjunto de escenas que describen las diferentes acciones o pasos que debe completar el usuario hasta conseguir su objetivo. Por ejemplo, en la compra de una película de un sistema interactivo de vídeo bajo demanda, el usuario inicialmente debe entrar en esa sección, realizar una búsqueda del contenido en base a algún criterio, introducir los datos, seleccionar el contenido, introducir un PIN de compra, etc. De forma sincronizada van apareciendo los diferentes objetos en la escena a medida que el usuario interactúa con ellos. In addition, interactive applications on Digital TV, in addition, run on a single window presented simultaneously (instead of several such as the graphical PC interfaces, for example) for all the restrictions indicated above. On this window the different multimedia objects that make up the scene are arranged (texts, graphics, videos, etc.) trying to ensure that all of them are synchronized based on a timeline generating a set of scenes that describe the different actions or steps that should be complete the user until his goal is achieved. For example, when purchasing a movie from an interactive video on demand system, the user must initially enter that section, perform a content search based on some criteria, enter the data, select the content, enter a PIN purchase, etc. In a synchronized way the different objects appear in the scene as the user interacts with them.

Aunque este concepto basado en la descripción de escenas y la presentación simultánea de objetos (vídeos, textos, gráﬁcos) pueda parecer sencillo resulta costoso desde un punto de vista de procesamiento gráﬁco y en especial se acentúa sobre aquellos dispositivos -como descodiﬁcadores de TV-donde los modelos de negocio, por razones de coste, imponen restricciones importantes en los componentes electrónicos que constituyen el dispositivo. Sin embargo ya existen en la industria tecnologías y mecanismos que soportan este paradigma a nivel de la capa de presentación. Although this concept based on the description of scenes and the simultaneous presentation of objects (videos, texts, graphics) may seem simple, it is costly from a graphic processing point of view and it is especially accentuated on those devices - such as TV decoders - where Business models, for cost reasons, impose significant restrictions on the electronic components that constitute the device. However, technologies and mechanisms that support this paradigm at the presentation layer level already exist in the industry.

Si quisiéramos trasladar el concepto de sincronización de objetos en la capa de presentación a la de sincronización de los diferentes mecanismos de interacción (mando a distancia, interacción mediante comandos vocales, etc.) veríamos que sobre los entornos de TV Digital apenas se han desarrollado arquitecturas que soporten esta sincronización de mecanismos de interacción. Por ejemplo, podríamos querer, volviendo a la aplicación interactiva de la compra de una película de vídeo bajo demanda, realizar la búsqueda del contenido mediante un comando vocal que se de al sistema, pero luego introducir -por cuestiones de privacidad-el PIN de compra mediante el mando a distancia tradicional. La gestión simultanea de los diferentes mecanismos de interacción resulta compleja desde un punto de vista semántico (por ejemplo, cuando se dan órdenes contrarias y simultáneas a través de una interfaz vocal y gráﬁca) y costosa en recursos de procesamiento computacional. Si a esto le sumamos el hecho de que las aplicaciones interactivas multimodales se ejecutarán sobre dispositivos con poca capacidad de procesamiento (CPUs de 100 MHz y memoria RAM limitada 32-64 MB y muy inferior al rendimiento de cualquier PC doméstico, CPUs de 1 GHz y RAM de 1 GB) y sobre los cuales -por ejemplo, descodiﬁcadores de televisión digital-resulta imposible realizar procesamiento de la voz en tiempo real, se puede concluir que existen diﬁcultades técnicas importantes a la hora de ofrecer interfaces multimodales sobre entornos de TV Digital. If we wanted to transfer the concept of synchronization of objects in the presentation layer to the synchronization of the different interaction mechanisms (remote control, voice command interaction, etc.) we would see that architectures have hardly been developed on Digital TV environments that support this synchronization of interaction mechanisms. For example, we might want, by returning to the interactive application of the purchase of a video movie on demand, to search for the content by means of a vocal command that is given to the system, but then enter -for privacy reasons-the purchase PIN by traditional remote control. Simultaneous management of the different interaction mechanisms is complex from a semantic point of view (for example, when opposite and simultaneous orders are given through a vocal and graphic interface) and expensive in computational processing resources. If we add to this the fact that multimodal interactive applications will run on devices with low processing capacity (100 MHz CPUs and 32-64 MB limited RAM and much lower than the performance of any home PC, 1 GHz CPUs and 1 GB RAM) and on which - for example, digital television decoders - it is impossible to perform voice processing in real time, it can be concluded that there are important technical difficulties when offering multimodal interfaces over Digital TV environments.

Existen diferentes patentes que tratan sobre interfaces multimodales, aunque ninguna lo hace de forma concreta aplicado al campo de la televisión digital. There are different patents dealing with multimodal interfaces, although none do so specifically applied to the field of digital television.

La patente US5265014 está centrada en cómo resolver ambigüedades que se producen al utilizar lenguaje natural como mecanismo de interacción. US4829423 es similar a la anterior aunque está más enfocada a cómo solventar los problemas que se producen al utilizar un lenguaje natural en un entorno multimodal. La patente US6345111 está más relacionada con un mecanismo de análisis de imagen de forma que el sistema es capaz de reconocer sobre qué objeto el usuario tiene la vista puesta, de forma que sirva como mecanismo de entrada en la selección de elementos. En US5577165 se describe de forma general como se realiza el mecanismo de diálogo entre una máquina y una persona, teniendo en cuenta las palabras clave detectadas durante el proceso de reconocimiento, así como los diferentes estados por los que pasa el sistema durante el dialogo. US5265014 is focused on how to solve ambiguities that occur when using natural language as an interaction mechanism. US4829423 is similar to the previous one although it is more focused on how to solve the problems that occur when using natural language in a multimodal environment. The US6345111 patent is more related to an image analysis mechanism so that the system is able to recognize on which object the user has the view, so that it serves as an input mechanism in the selection of elements. In US5577165 it is described in a general way how the mechanism of dialogue between a machine and a person is performed, taking into account the keywords detected during the recognition process, as well as the different states through which the system goes through during the dialogue.

Tratar de implementar una arquitectura completa de utilización de interfaces multimodales sobre dispositivos ligeros, como descodiﬁcadores de TV digital o teléfonos móviles, presenta problemas de rendimiento (principalmente en lo relacionado al procesamiento de la voz en tiempo real o la incorporación de un intérprete multimodal, por ejemplo SALT o VoiceXML, dentro del propio dispositivo que también resulta costoso en términos de procesamiento y reserva de memoria), por lo que se hace necesario abordar nuevas arquitecturas que permitan conseguir que los tiempos de respuesta en el proceso de interacción utilizando cualquiera de los mecanismos (visual o comandos vocales)sean los mínimos posibles como ya se producen en otros dispositivos más potentes como PCs, donde todo el procesamiento de la voz (síntesis y reconocimiento) se realiza en la propia máquina, sin requerir el procesamiento externo. Así por ejemplo, la CPU y memoria RAM necesaria para realizar un proceso de reconocimiento de voz en tiempo real con unos tiempos de respuesta aceptables (inferiores a los 5 segundos) y que permitan que el usuario no pierda la atención sobre el sistema implicaría el uso de un PC actual de tipo medio (512 RAM, 1 Ghz CPU), lo que hace que en cualquier caso diste mucho de las capacidades de procesamiento de los descodiﬁcadores de televisión digital (32 MB RAM, 100 MHz), incluso de los de gama más alta (128 MB RAM, 200 Mhz CPU). Trying to implement a complete architecture for the use of multimodal interfaces on light devices, such as digital TV decoders or mobile phones, presents performance problems (mainly related to real-time voice processing or the incorporation of a multimodal interpreter, for example SALT or VoiceXML, within the device itself that is also expensive in terms of processing and memory reservation), so it is necessary to address new architectures that allow to achieve response times in the interaction process using any of the mechanisms (visual or vocal commands) are the minimum possible as they already occur in other more powerful devices such as PCs, where all voice processing (synthesis and recognition) is performed on the machine itself, without requiring external processing. For example, the CPU and RAM memory needed to perform a real-time voice recognition process with acceptable response times (less than 5 seconds) and allowing the user not to lose attention to the system would imply the use of a current medium-type PC (512 RAM, 1 GHz CPU), which in any case gave much of the processing capabilities of digital television decoders (32 MB RAM, 100 MHz), even the range higher (128 MB RAM, 200 MHz CPU).

En resumen, el problema técnico consiste en que la limitada capacidad de procesamiento de los descodiﬁcadores de televisión digital, impiden que sobre ellos se puedan desarrollar auténticas aplicaciones multimodales que utilicen como mecanismos de interacción además de la visual la interacción vocal de forma simultanea. In summary, the technical problem is that the limited processing capacity of digital television decoders prevents them from developing authentic multimodal applications that use simultaneous interaction as vocal mechanisms in addition to the visual one.

Objeto de la invención Object of the invention

El objeto de la presente invención es crear un método, plataforma o sistema que permita que sobre un entorno de televisión digital convivan simultáneamente y de forma sincronizada diferentes mecanismos de interacción, produciendo así lo que se conocen como interfaces multimodales. Para ello, la invención propone un método de interacción multimodal sobre aplicaciones interactivas de TV Digital, donde la televisión está provista de un descodiﬁcador en red que incorpora un navegador asociado, y donde el método se compone principalmente de los siguientes pasos: The object of the present invention is to create a method, platform or system that allows different interaction mechanisms to coexist simultaneously and in a synchronized manner, thus producing what are known as multimodal interfaces. For this, the invention proposes a multimodal interaction method on interactive applications of Digital TV, where the television is provided with a network decoder incorporating an associated browser, and where the method is mainly composed of the following steps:

a. to.: Conexión del navegador a un servidor en red y descarga de una aplicación multimodal y sus etiquetas descriptivas que se generan en respuesta a un evento de interacción producido por un usuario durante el diálogo hombre-máquina. Connecting the browser to a network server and downloading a multimodal application and its descriptive tags that are generated in response to an interaction event produced by a user during the man-machine dialogue.

b. b.: Envío por parte del navegador de las etiquetas que caracterizan la aplicación multimodal a un intérprete que reside en un servidor en red. Sending by the browser of the labels that characterize the multimodal application to an interpreter residing in a network server.

c. C.: Interpretación de las etiquetas por parte del intérprete, que ordena la ejecución de acciones correspondientes a las etiquetas. Interpretation of the labels by the interpreter, which orders the execution of actions corresponding to the labels.

d. d.: Repetición de los pasos a-c hasta que el usuario salga de la aplicación. Repeat steps a-c until the user exits the application.

En el paso a. los eventos pueden ser gráﬁcos y/o voz. Es ventajoso asociar un módulo externo al navegador con la función de transferir las etiquetas descriptivas del diálogo de voz al intérprete de dichas etiquetas mediante un protocolo IP. Este intérprete, de preferencia, coordina y controla todos los eventos de voz y se comunica con uno o varios servidores que proporcionan recursos de voz mediante el protocolo MRCP. También de preferencia, analiza la estructura de la aplicación multimodal y envía los correspondientes comandos al servidor de voz que cumple el protocolo MRCP. Opcionalmente se comunica con el módulo externo y le transﬁere los datos necesarios para que éste establezca una sesión mediante SIP con el servidor de voz MRCP. El descodiﬁcador puede recibir y enviar los datos de voz al servidor MRCP mediante el protocolo RTP. El módulo externo establece de preferencia una comunicación con un cliente RTP, obteniéndose de este modo el estado de la comunicación entre el descodiﬁcador y el servidor de voz MRCP. El descodiﬁcador a su vez dispone de una aplicación capaz de recoger los datos provenientes de cualquier dispositivo externo que recoja datos de audio y sea capaz de enviarlo mediante una conexión IP a los servidores de voz. Dicha aplicación es capaz de comprimir dichos datos de audio al formato compatible con un servidor MRCP y enviarlos a través del protocolo RTP hasta el servidor de voz. De preferencia, el descodiﬁcador dispone de una aplicación capaz de recoger los datos de audio provenientes del canal RTP, descomprimirlos al formato reproducible por el descodiﬁcador y enviarlos a un dispositivo electrónico existente en él encargado de la generación de audio. In step a. the events can be graphics and / or voice. It is advantageous to associate an external module to the browser with the function of transferring the descriptive tags of the voice dialogue to the interpreter of said tags by means of an IP protocol. This interpreter preferably coordinates and controls all voice events and communicates with one or more servers that provide voice resources through the MRCP protocol. Also preferably, it analyzes the structure of the multimodal application and sends the corresponding commands to the voice server that complies with the MRCP protocol. Optionally, it communicates with the external module and transfers the necessary data for it to establish a session via SIP with the MRCP voice server. The decoder can receive and send the voice data to the MRCP server using the RTP protocol. The external module preferably establishes a communication with an RTP client, thus obtaining the communication status between the decoder and the MRCP voice server. The decoder in turn has an application capable of collecting data from any external device that collects audio data and is capable of sending it via an IP connection to the voice servers. Said application is capable of compressing said audio data to the format compatible with an MRCP server and sending it through the RTP protocol to the voice server. Preferably, the decoder has an application capable of collecting audio data from the RTP channel, decompressing it to the format reproducible by the decoder and sending it to an existing electronic device in charge of generating audio.

La comunicación entre el navegador existente en el descodiﬁcador y el módulo externo se realiza a través de una interfaz de programación de aplicaciones. Communication between the existing browser in the decoder and the external module is done through an application programming interface.

Opcionalmente, las aplicaciones multimodales ejecutadas en el navegador son preprocesadas, separando la lógica multimodal de la lógica de servicio antes de ser mostradas al usuario. Optionally, multimodal applications executed in the browser are preprocessed, separating multimodal logic from service logic before being shown to the user.

La utilización de interfaces multimodales en el entorno de la TV interactiva no trata de sustituir la utilización del mando a distancia (interacción visual) sino complementarla y mejorarla según las necesidades y deseos del usuario. The use of multimodal interfaces in the interactive TV environment is not about replacing the use of the remote control (visual interaction) but rather complementing and improving it according to the needs and wishes of the user.

Breve descripción de las ﬁguras Brief description of the ﬁ gures

Con objeto de ayudar a una mejor comprensión de la presente descripción, de acuerdo con un ejemplo preferente de realización práctica de la invención, se adjunta una ﬁgura, cuyo carácter es ilustrativo y no limitativo y que describe la arquitectura del sistema (ﬁgura 1). In order to help a better understanding of the present description, according to a preferred example of practical embodiment of the invention, a figure is attached, the character of which is illustrative and not limiting and which describes the architecture of the system (Figure 1).

Descripción detallada de la invención Detailed description of the invention

El sistema de la invención es capaz de realizar todos los procesos de análisis de interacción multimodal en tiempo real, utilizando un sistema distribuido de componentes a través de los protocolos de comunicaciones descritos en la ﬁgura 1. La potencia del sistema se basa en la arquitectura distribuida de componentes, delegando en servidores y máquinas externas aquellos procesos de uso intensivo de la CPU. The system of the invention is capable of performing all multimodal interaction analysis processes in real time, using a distributed system of components through the communication protocols described in Figure 1. The power of the system is based on the distributed architecture. of components, delegating to processes and external machines those processes of intensive use of the CPU.

El sistema de la invención debe disponer de: The system of the invention must have:

• •: Un descodiﬁcador con un navegador web integrado y un canal de retorno que le proporcione la capacidad de acceso a un servidor externo. El navegador debe permitir el uso y ejecución de un lenguaje scriptable, por ejemplo, el lenguaje JavaScript. A decoder with an integrated web browser and a return channel that gives you the ability to access an external server. The browser must allow the use and execution of a scriptable language, for example, the JavaScript language.

• •: Este descodiﬁcador se conectará al televisor para permitir la visualización de la parte gráﬁca de la aplicación multimodal y reproducir los mensajes de voz. This decoder will be connected to the TV to allow viewing of the graphic part of the multimodal application and play the voice messages.

• •: El navegador debe permitir el desarrollo e instalación de plugins (o módulos externos al navegador) que proporcionen al navegador, y por lo tanto al descodiﬁcador, la funcionalidad especíﬁca para la interpretación y ejecución de aplicaciones multimodales. The browser must allow the development and installation of plugins (or modules external to the browser) that provide the browser, and therefore the decoder, the specific functionality for the interpretation and execution of multimodal applications.

• •: Uno o varios servidores externos donde residan los recursos de voz (síntesis y reconocimiento del habla). One or more external servers where voice resources reside (speech synthesis and recognition).

• •: Un servidor capaz de interpretar las etiquetas que proporcionan la multimodalidad (como por ejemplo, SALT/ VoiceXML). El sistema posee un máquina interna de estados de forma que ésta se va actualizando durante todo el diálogo hombre-máquina. Las etiquetas SALT/VoiceXML (u otras similares) se encuentran distribuidas a lo largo de la página web de la aplicación, proporcionando características como el reconocimiento o la síntesis de voz. La ubicación de las mismas depende del propio diseño de la página web y junto a las tradicionales etiquetas de HTML o JS conformarían lo que llamamos aplicación multimodal. Así por ejemplo puede existir una etiqueta <prompt> “Texto” que lo que produce al ﬁnal es un audio sintetizado del mensaje que aparece a continuación de la etiqueta. De forma similar, existen otras etiquetas <listen>, <reco> que permiten grabar los comandos del usuario para su posterior reconocimiento. En cualquier caso, la sintaxis de las etiquetas depende del propio lenguaje o especiﬁcación que se esté utilizando (SALT/VoiceXML u otros). A server capable of interpreting the tags that provide multimodality (such as SALT / VoiceXML). The system has an internal state machine so that it is updated throughout the man-machine dialogue. SALT / VoiceXML (or similar) tags are distributed throughout the application's website, providing features such as speech recognition or speech synthesis. The location of the same depends on the design of the web page itself and together with the traditional HTML or JS tags would form what we call multimodal application. Thus, for example, there may be a <prompt> "Text" tag that produces a synthesized audio of the message that appears after the tag at the end. Similarly, there are other <listen>, <reco> tags that allow you to record user commands for later recognition. In any case, the syntax of the tags depends on the language or speci ﬁ cation being used (SALT / VoiceXML or others).

• •: Servidor web donde reside la aplicación multimodal. Web server where the multimodal application resides.

• •: Cualquier dispositivo externo, como por ejemplo un micrófono conectado al descodiﬁcador, un teléfono móvil, un dispositivo manos libres bluetooth, etc. que permita recoger la voz y enviarla en formato digital al descodiﬁcador. Any external device, such as a microphone connected to the decoder, a mobile phone, a Bluetooth hands-free device, etc. that allows to pick up the voice and send it in digital format to the decoder.

El sistema permite la interacción del usuario con la aplicación mediante el uso del mando a distancia o de la voz. Ambos métodos son complementarios permitiendo al usuario decidir cual quiere utilizar en cada caso. A esta interacción del usuario con la aplicación se le denomina diálogo hombre-máquina. The system allows user interaction with the application through the use of the remote control or voice. Both methods are complementary allowing the user to decide which one they want to use in each case. This user interaction with the application is called the man-machine dialogue.

El sistema permite además la sincronización entre los eventos generados por el usuario a través de cualquiera de los modos posibles (texto/voz) junto a lo presentado hacia el usuario, resolviendo aquellas incoherencias que puedan producirse durante la interacción. Esta sincronización es efectuada por un módulo externo que se ejecuta en el navegador impidiendo acciones indeseadas o que puedan producir efectos adversos sobre el sistema. The system also allows synchronization between the events generated by the user through any of the possible modes (text / voice) along with the one presented to the user, resolving those inconsistencies that may occur during the interaction. This synchronization is carried out by an external module that runs in the browser preventing unwanted actions or that may cause adverse effects on the system.

De forma resumida el sistema realiza los siguientes pasos: In summary, the system performs the following steps:

1. one.: La aplicación multimodal reside en un servidor Web. En un primer paso, el usuario selecciona dicha aplicación (compra de contenidos, banca electrónica, etc) que le proporcione su proveedor de servicios de TV Digital. Al tratarse de una aplicación multimodal, el proveedor de servicios le informará de tal hecho al usuario, indicándole que conecte previamente el micrófono, teléfono móvil, headset bluetooth al descodiﬁcador (los detalles de este paso en cualquier caso, quedan fuera del alcance de la invención). Una vez conectado el dispositivo, el descodiﬁcador se descarga dicha aplicación mediante un protocolo http u otro mecanismo-al navegador web residente en el descodiﬁcador, se ejecuta un plugin y se ejecuta la aplicación multimodal en el navegador. El plugin en este momento ya se encuentra enlazado con el navegador. The multimodal application resides on a Web server. In a first step, the user selects said application (content purchase, electronic banking, etc.) provided by his Digital TV service provider. As it is a multimodal application, the service provider will inform the user of this fact, instructing him to previously connect the microphone, mobile phone, bluetooth headset to the decoder (the details of this step in any case, are beyond the scope of the invention ). Once the device is connected, the decoder downloads said application through an http protocol or other mechanism-to the web browser resident in the decoder, a plugin is executed and the multimodal application is executed in the browser. The plugin is currently linked to the browser.

2. 2.: Con el ﬁn de realizar un procesamiento en tiempo real, el plugin del navegador envía la página web conteniendo las etiquetas que proporcionan la multimodalidad a un interprete (SALT/VoiceXML, por ejemplo) externo. In order to perform real-time processing, the browser plugin sends the web page containing the tags that provide multimodality to an external interpreter (SALT / VoiceXML, for example).

3. 3.: Inmediatamente, el navegador procesa la página web y la presenta en el televisor. Immediately, the browser processes the web page and presents it on the TV.

4. Four.: De forma simultanea, el interprete multimodal (SALT/VoiceXML), reconoce y procesa las etiquetas multimodales comunicando a los servidores de voz las acciones que deben realizar en cada momento (por ejemplo, en un momento dado del diálogo hombre-máquina con la aplicación interactiva, la etiqueta multimodal indica que se tiene que reproducir por audio un determinado mensaje que aparece por pantalla). Simultaneously, the multimodal interpreter (SALT / VoiceXML), recognizes and processes multimodal labels communicating to the voice servers the actions that they must perform at each moment (for example, at a given moment of the man-machine dialogue with the interactive application , the multimodal label indicates that a certain message that appears on the screen must be reproduced by audio).

5. 5.: El servidor de voz envía al descodiﬁcador los datos de audio, que son procesados y adaptados, de forma que sean reproducibles por el dispositivo de audio del televisor y oídos por el usuario. The voice server sends the audio data to the decoder, which is processed and adapted, so that it is reproducible by the television's audio device and heard by the user.

6. 6.: En este punto, el estado del diálogo ha avanzado un paso, y el plugin del navegador informa al interprete SALT que puede procesar la siguiente etiqueta multimodal, repitiéndose este proceso hasta que el usuario abandona la aplicación. At this point, the status of the dialogue has advanced one step, and the browser plugin informs the SALT interpreter that it can process the next multimodal tag, repeating this process until the user leaves the application.

La aplicación multimodal se compone de un documento HTML compuesto de dos frames -elemento existente en la terminología HTML que se corresponde con una parte de una página web-principales: un frame que contiene la aplicación donde el propio contenido web reside (frame de aplicación, que es lo que se muestra en el interfaz gráﬁco y lo que el usuario en deﬁnitiva ve en su terminal) y otro frame para la creación en tiempo de ejecución -instanciacióndel módulo externo o plugin (frame del módulo externo). Los frames se utilizan para separar el contenido de la aplicación. El frame de aplicación contiene los elementos con los que el usuario es capaz de interactuar tanto gráﬁca como vocalmente. El módulo externo o plugin es una aplicación que se relaciona con el frame de aplicación para aportarle la funcionalidad especíﬁca de la lógica multimodal, es decir, la que permite mantener un diálogo hombremáquina mediante comandos vocales y de forma complementaria al tradicional interfaz gráﬁco junto con el mando a distancia durante todo el período de ejecución de la aplicación. The multimodal application is composed of an HTML document composed of two frames - an element existing in the HTML terminology that corresponds to a part of a web-main page: a frame that contains the application where the web content itself resides (application frame, which is what is shown in the graphical interface and what the user de ﬁ nitively sees in his terminal) and another frame for creation at runtime - instantiation of the external module or plugin (external module frame). The frames are used to separate the content of the application. The application frame contains the elements with which the user is able to interact both graphically and vocally. The external module or plugin is an application that is related to the application frame to provide the specific functionality of multimodal logic, that is, the one that allows to maintain a man-machine dialogue by means of vocal commands and in a complementary way to the traditional graphical interface together with the remote control during the entire period of application execution.

La estructura de una aplicación multimodal reside en un documento tipo XML formateado para cumplir con las especiﬁcaciones deﬁnidas en SALT (Speech Application Lenguaje Tags) o VoiceXML (http://www.w3.org/Voice/). The structure of a multimodal application resides in an XML type document formatted to meet the specifications defined in SALT (Speech Application Language Tags) or VoiceXML (http://www.w3.org/Voice/).

El frame externo está oculto al usuario ya que no contiene una interfaz gráﬁca y sólo se encarga de agrupar las etiquetas de SALT/VoiceXML, mencionadas en el párrafo anterior, especíﬁcas de cada aplicación, y de la instanciación del módulo externo. Este conjunto de etiquetas SALT/VoiceXML determinan la lógica multimodal, es decir, las interacciones permitidas en la aplicación. Estas etiquetas permiten conﬁgurar la síntesis y la ejecución de la voz así como el reconocedor de voz y el conjunto de eventos que se pueden realizar utilizando el interfaz vocal. Por ejemplo durante el diálogo hombre-máquina con la aplicación, el usuario podría ver en su televisor un campo de edición que le invita a introducir un texto con el mando a distancia; de forma paralela en la aplicación existirá una etiqueta SALT/VoiceXML que indicará que el sistema en ese momento se encuentra a la espera de que el usuario de un comando vocal. El usuario en ese momento puede optar por introducir los datos con el mando a distancia, o bien, dar la orden vocal equivalente. The external frame is hidden from the user since it does not contain a graphical interface and is only responsible for grouping the SALT / VoiceXML tags, mentioned in the previous paragraph, specifics of each application, and the instantiation of the external module. This set of SALT / VoiceXML tags determines multimodal logic, that is, the interactions allowed in the application. These tags allow you to configure the synthesis and execution of the voice as well as the voice recognizer and the set of events that can be performed using the vocal interface. For example, during the man-machine dialogue with the application, the user could see on his television an editing field that invites him to enter a text with the remote control; in parallel in the application there will be a SALT / VoiceXML tag that will indicate that the system is currently waiting for the user of a vocal command. The user can then choose to enter the data with the remote control, or give the equivalent vocal order.

Para que exista una concordancia entre lo que se muestra por pantalla, en este caso la televisión, y las interacciones So that there is a concordance between what is shown on the screen, in this case television, and the interactions

o eventos que se realizan de forma vocal, ambos frames se comunican entre sí utilizando una interfaz de programación de aplicaciones o API (Application Programming Interface) a través de la propia arquitectura del navegador que se ejecuta en el descodiﬁcador. Este API deﬁne el conjunto de funciones y procedimientos de comunicación entre los dos frames consiguiendo un nivel de abstracción y separación entre ellos. El API se deﬁne en javascript ya que es compatible con el navegador integrado en el descodiﬁcador. or events that are performed vocally, both frames communicate with each other using an application programming interface or API (Application Programming Interface) through the browser architecture itself that runs in the decoder. This API defines the set of functions and communication procedures between the two frames, achieving a level of abstraction and separation between them. The API is defined in javascript since it is compatible with the browser integrated in the decoder.

Esta estructura de frames permite separar la lógica de servicio de la lógica multimodal. La lógica multimodal (proporcionada por el frame oculto donde se ejecuta el módulo externo o plugin) está asociada a la gestión de la interacción hombre-máquina desde cualquiera de los interfaces gráﬁco o vocal, es decir, se encargarla de la gestión de los eventos desde el interfaz vocal o gráﬁco lanzando las acciones oportunas ante esos eventos. También se encargarla de resolver los problemas que plantean interacciones simultáneas entre ambos interfaces. La lógica de servicio (proporcionada por el frame de aplicación) estarla asociada a la consecución en si del objetivo que el usuario tiene al utilizar la aplicación, como puede ser la compra de una película en un servicio de TVD de pago por visión. También permite al módulo externo o plugin mantener activo el servicio que proporciona la multimodalidad cuando se navega entre las distintas páginas y contenidos de la aplicación principal. Esto permite que mientras se carga un nuevo interfaz gráﬁco se puedan realizar conversiones de textos a voz sintética o TTS (Text To Speech), o conversiones de voz a un formato entendible por el sistema también llamado SR (Speech Recognition). Esto proporciona una mejora de la experiencia de usuario evitando cualquier tipo de espera ya que el servicio que proporciona la multimodalidad siempre está funcionando en segundo plano. This frame structure allows to separate the service logic from the multimodal logic. Multimodal logic (provided by the hidden frame where the external module or plugin is executed) is associated with the management of human-machine interaction from any of the graphic or vocal interfaces, that is, it is responsible for managing events from the vocal or graphic interface launching the appropriate actions before these events. It is also responsible for solving problems that pose simultaneous interactions between both interfaces. The service logic (provided by the application frame) will be associated with the achievement in itself of the objective that the user has when using the application, such as the purchase of a movie in a pay-per-view TVD service. It also allows the external module or plugin to keep active the service provided by multimodality when browsing between the different pages and contents of the main application. This allows that while loading a new graphical interface, text conversions to synthetic voice or TTS (Text To Speech), or voice conversions to a format understandable by the system also called SR (Speech Recognition) can be performed. This provides an improvement of the user experience avoiding any type of waiting since the service provided by multimodality is always running in the background.

El plugin que proporciona la multimodalidad requiere que las comunicaciones con el usuario estén basadas en eventos, es decir, requiere que se realice alguna acción ante la cual se produzca algún proceso. Así por ejemplo, en un momento del diálogo el usuario puede pulsar con el mando a distancia un botón de aceptar o bien enviar el comando vocal “Aceptar” para su reconocimiento; ante ambos casos se generará el mismo código de ejecución, no distinguiéndose a través de qué mecanismo de interacción ha llegado el evento. Si es a través de la interfaz gráﬁca, el plugin informará al interprete SALT/VoiceXML de la pulsación de ese botón, para que su máquina de estados esté sincronizada con la ejecución de la aplicación, a la vez que se ejecutará el código asociado a la pulsación de la tecla. Si el usuario envía un comando de voz y éste es reconocido correctamente, el intérprete SALT/VoiceXML informará mediante un comando al plugin de tal hecho y esté generará el correspondiente evento que hará que se ejecute el código correspondiente a la pulsación de ese evento. The plugin that provides multimodality requires that communications with the user are based on events, that is, it requires that some action be taken before any process occurs. For example, at a time during the dialogue the user can press an accept button with the remote control or send the “Accept” vocal command for recognition; In both cases, the same execution code will be generated, not distinguishing through which interaction mechanism the event has arrived. If it is through the graphical interface, the plugin will inform the interpreter SALT / VoiceXML of the press of that button, so that its state machine is synchronized with the execution of the application, at the same time as the code associated with the key press. If the user sends a voice command and it is correctly recognized, the SALT / VoiceXML interpreter will inform the plugin of this fact by means of a command and will generate the corresponding event that will cause the code corresponding to the pulsation of that event to be executed.

Para gestionar todos los eventos producidos tanto por parte del usuario como por parte del sistema se utiliza preferentemente el lenguaje de programación Javascript que permite programar manejadores de eventos, los cuales se encargan de capturar las acciones producidas en el sistema. To manage all the events produced by both the user and the system, the Javascript programming language is used, which allows programming event handlers, which are responsible for capturing the actions produced in the system.

El navegador que se utiliza en los descodiﬁcadores de TV Digital interpreta el código Javascript que está integrado en las páginas Web. Además, la propia lógica multimodal (las interacciones permitidas), determinada por las etiquetas SALT/XML, es codiﬁcada en JavaScript. Por todo esto, el módulo externo multimodal requiere JavaScript, para poder comunicarse con el navegador. The browser used in Digital TV decoders interprets the Javascript code that is integrated into the Web pages. In addition, the multimodal logic itself (the allowed interactions), determined by the SALT / XML tags, is coded in JavaScript. For all this, the multimodal external module requires JavaScript, in order to communicate with the browser.

El módulo externo multimodal también necesita acceder a los recursos hardware de audio del descodiﬁcador para controlar acciones básicas como reproducir, parar, etc. The multimodal external module also needs to access the audio hardware resources of the decoder to control basic actions such as play, stop, etc.

En la ﬁgura 1 se puede apreciar la arquitectura de alto nivel de un ejemplo de sistema capaz de llevar a cabo el método de la invención: In Figure 1 the high-level architecture of an example of a system capable of carrying out the method of the invention can be seen:

• Decodiﬁcador [100]: • Decoder [100]:

• •: [110] Navegador con posibilidad de incluir un módulo externo [120]. [110] Browser with the possibility of including an external module [120].

• •: Soporte lógico personalizado capaz de permitir la incorporación de módulos que habiliten una conexión IP con otros sistemas y cualquier otro protocolo de comunicación. Customized software capable of allowing the incorporation of modules that enable an IP connection with other systems and any other communication protocol.

• •: Cliente SIP (Protocolo de Inicio de Sesiones) [140]: encargado de crear una sesión entre el descodiﬁcador [100] y el servidor MRCP (Media Resource Control Protocol ó Protocolo de Control de Recursos Multimedia) [460]. SIP Client (Session Initiation Protocol) [140]: responsible for creating a session between the decoder [100] and the MRCP (Media Resource Control Protocol) [460].

• •: Cliente RTP (Real Time Protocol -Protocolo de Tiempo Real) [170]. RTP Client (Real Time Protocol - Real Time Protocol) [170].

• •: Grabación de Audio [190] y reproducción de audio [180]. Audio Recording [190] and audio playback [180].

• •: Conexión de red [500] IP. Network connection [500] IP.

• •: Servidores externos: External Servers:

• •: Servidor MRCP [350] y [460]. MRCP server [350] and [460].

• •: Intérprete SALT/Voice XML [300]. SALT / Voice XML interpreter [300].

• •: Recursos de voz [400] (TTS o Texto a Voz, SP o Reconocimiento del Habla). Voice resources [400] (TTS or Text to Speech, SP or Speech Recognition).

Los procesos de comunicación entre el cliente (descodiﬁcador) [100] y los recursos del servidor MRCP [460] se realizan a través del protocolo SIP [610] (Protocolo de Inicio de Sesiones) que permite el establecimiento de sesiones multimedia mediante el intercambio de mensajes entre las partes que quieren comunicarse. El descodiﬁcador [100] implementa un módulo SIP [140] que crea una sesión [610] mediante el envío de peticiones al servidor MRCP. En este mensaje también le envía las características de la sesión que quiere establecer como codiﬁcadores/decodiﬁcadores de audio soportados, direcciones, puertos donde se espera recibirlos, velocidades de transmisión, etc. que son necesarios para realizar los procesos de síntesis y reconocimiento de la voz. Todas estas acciones son coordinadas por el interprete SALT [300]. The communication processes between the client (decoder) [100] and the resources of the MRCP server [460] are carried out through the SIP protocol [610] (Session Initiation Protocol) that allows the establishment of multimedia sessions through the exchange of messages between the parties that want to communicate. The decoder [100] implements a SIP module [140] that creates a session [610] by sending requests to the MRCP server. This message also sends you the characteristics of the session that you want to establish as supported audio encoders / decoders, addresses, ports where you expect to receive them, transmission speeds, etc. that are necessary to carry out the processes of synthesis and recognition of the voice. All these actions are coordinated by the SALT interpreter [300].

En la parte del descodiﬁcador [100] puede ser necesario un módulo RTP [170](Real Time Protocol -Protocolo de Tiempo Real) puesto que es posible que el descodiﬁcador no soporte la reproducción en streaming mediante RTP. Por lo tanto se hace necesario utilizar un reproductor [180] capaz de enviar al altavoz del sistema los datos de voz en bruto, recogidos y almacenados en tiempo real en el buffer del cliente RTP [170]. In the decoder part [100] an RTP module [170] (Real Time Protocol-Real Time Protocol) may be necessary since the decoder may not support streaming playback via RTP. Therefore, it is necessary to use a player [180] capable of sending raw voice data to the system speaker, collected and stored in real time in the RTP client buffer [170].

El proceso de envío del audio [620] a través del canal RTP hasta el elemento RTP [411] se basa en la utilización de un aplicación [190] capaz de grabar y recoger los datos de voz en bruto desde el dispositivo de entrada de audio (por ejemplo, un micrófono) para a continuación crear un buffer con dichos datos. Dependiendo de las posibilidades de los servidores de voz y de los formatos que se desee soportar, serían necesarios utilizar diferentes compresores/descompresores de voz, como por ejemplo, PCMU-PCM (Pulse Code Modulation mu-law -Modulación de código de pulso). The process of sending the audio [620] through the RTP channel to the RTP element [411] is based on the use of an application [190] capable of recording and collecting raw voice data from the audio input device (for example, a microphone) and then create a buffer with this data. Depending on the possibilities of the voice servers and the formats that you want to support, it would be necessary to use different voice compressors / decompressors, such as PCMU-PCM (Pulse Code Modulation mu-law - Pulse code modulation).

Esta aplicación compresor/descompresor [180][190] implementa el paradigma cliente/servidor siendo el objeto de la transacción los datos de voz sobre un protocolo RTP [620]. La aplicación consta de dos módulos principales: This compressor / decompressor application [180] [190] implements the client / server paradigm being the object of the transaction voice data on an RTP protocol [620]. The application consists of two main modules:

• •: Reproductor RTP [180]: descomprime y convierte el sonido en formato PCMU (o PCMA) a PCM, que procede del canal RTP, y ﬁnalmente reproduce el sonido. RTP Player [180]: decompresses and converts sound in PCMU (or PCMA) format to PCM, which comes from the RTP channel, and finally plays the sound.

• •: Grabador RTP [190]: lee datos del dispositivo de entrada de audio, comprime/convierte PCM a PCMU (o PCMA) y lo envía ﬁnalmente a través del canal RTP [620]. RTP Recorder [190]: reads data from the audio input device, compresses / converts PCM to PCMU (or PCMA) and finally sends it through the RTP channel [620].

Todos estos componentes están coordinados a su vez por el intérprete SALT [300]. Las acciones más comunes que se realizan con ellos son PLAY, STOP, PAUSE y REC. All these components are coordinated in turn by the SALT interpreter [300]. The most common actions performed with them are PLAY, STOP, PAUSE and REC.

La invención se puede aplicar a la casi totalidad de los servicios interactivos que se ejecuten sobre descodiﬁcadores de TV Digital que cumplan con las características anteriormente mencionadas. Como ejemplos de servicios interactivos: The invention can be applied to almost all of the interactive services that run on Digital TV decoders that meet the aforementioned characteristics. As examples of interactive services:

• •: Control y navegación a través de EPGs (Electronic Program Guide) o ESG (Electronic Service Guide) o UEG (Uniﬁed Electronic Guide). Control and navigation through EPGs (Electronic Program Guide) or ESG (Electronic Service Guide) or UEG (Uni ﬁ ed Electronic Guide).

• •: Control y navegación a través de las funcionalidades de VoD (Video on Demand), CoD (Content on Demand). Control and navigation through the features of VoD (Video on Demand), CoD (Content on Demand).

• •: Control y navegación a través de servicios de banca electrónica (home banking), compra/venta electrónica acceso a catálogos de productos, etc. Control and navigation through electronic banking services (home banking), electronic purchase / sale access to product catalogs, etc.

• •: Control y navegación a través de las funcionalidades ofrecidas por aplicaciones de mensajería electrónica y navegación a través de Internet mediante navegadores. Control and navigation through the functionalities offered by electronic messaging applications and Internet browsing through browsers.

Claims

1. Multimodal interaction method on interactive applications of Digital TV, where television is provided with a network decoder (100) that incorporates an associated browser (110), characterized by the following steps:

Connecting the browser to a network server and downloading a multimodal application and its tags a.

descriptions that are generated in response to an interaction event produced by a user during the man-machine dialogue.

b. Sending by the browser of the labels that characterize the multimodal application to an interpreter

(300) that resides on a network server.

Interpretation of the labels by the interpreter, who orders the execution of corresponding actions.

Teeth to tags.

d. Repeat steps a-c until the user exits the application.

2. Method according to claim 1 characterized in that in step a. The events are graphic and / or voice.

3. 3.: Método según la reivindicación 2 caracterizado porque se asocia un módulo externo (120) al navegador (110) con la función de transferir las etiquetas descriptivas del diálogo de voz al intérprete de dichas etiquetas (300) mediante un protocolo IP. Method according to claim 2 characterized in that an external module (120) is associated with the browser (110) with the function of transferring the descriptive tags of the voice dialogue to the interpreter of said tags (300) by means of an IP protocol.

4. Four.: Método según la reivindicación 3 caracterizado porque el intérprete (300) de las etiquetas descriptivas del diálogo de voz coordina y controla todos los eventos de voz. Method according to claim 3 characterized in that the interpreter (300) of the descriptive labels of the voice dialogue coordinates and controls all the voice events.

5. 5.: Método según la reivindicación 4, caracterizado porque el intérprete (300) de las etiquetas descriptivas del diálogo de voz se comunica con uno o varios servidores que proporcionan recursos de voz mediante el protocolo MRCP. Method according to claim 4, characterized in that the interpreter (300) of the descriptive labels of the voice dialogue communicates with one or more servers that provide voice resources by means of the MRCP protocol.

6. 6.: Método según la reivindicación 5 caracterizado porque el interprete (300) de las etiquetas descriptivas del diálogo de voz analiza la estructura de la aplicación multimodal y envía los correspondientes comandos al servidor de voz que cumple el protocolo MRCP. Method according to claim 5 characterized in that the interpreter (300) of the descriptive tags of the voice dialogue analyzes the structure of the multimodal application and sends the corresponding commands to the voice server that complies with the MRCP protocol.

7. 7.: Método según la reivindicación 6 caracterizado porque el interprete (300) de las etiquetas descriptivas del diálogo de voz se comunica con el módulo externo (120) asociado al navegador del descodiﬁcador y le transﬁere los datos necesarios para que éste establezca una sesión mediante SIP con el servidor de voz MRCP (460). Method according to claim 6, characterized in that the interpreter (300) of the descriptive labels of the voice dialogue communicates with the external module (120) associated with the decoder navigator and transfers the necessary data for it to establish a session via SIP with the MRCP voice server (460).

8. 8.: Método según la reivindicación 7, caracterizado porque el descodiﬁcador recibe y envía los datos de voz al servidor MRCP (460) mediante el protocolo RTP. Method according to claim 7, characterized in that the decoder receives and sends the voice data to the MRCP server (460) by means of the RTP protocol.

9. 9.: Método según la reivindicación 8, caracterizado porque el módulo externo (120) asociado al navegador establece una comunicación con un cliente RTP (170) obteniéndose de este modo el estado de la comunicación entre el descodiﬁcador y el servidor de voz MRCP (460). Method according to claim 8, characterized in that the external module (120) associated with the browser establishes a communication with an RTP client (170) thereby obtaining the communication status between the decoder and the MRCP voice server (460).

10. 10.: Método según cualquiera de las reivindicaciones 5-9 caracterizado porque el descodiﬁcador (100) dispone de una aplicación (190) capaz de recoger los datos provenientes de cualquier dispositivo externo que recoja datos de audio y sea capaz de enviarlo mediante una conexión IP a los servidores de voz. Method according to any of claims 5-9 characterized in that the decoder (100) has an application (190) capable of collecting data from any external device that collects audio data and is capable of sending it via an IP connection to the servers voice.

11. eleven.: Método según la reivindicación 10 caracterizado porque dicha aplicación es capaz de comprimir dichos datos de audio al formato compatible con un servidor MRCP y enviarlos a través del protocolo RTP hasta el servidor de voz (400). Method according to claim 10 characterized in that said application is capable of compressing said audio data to the format compatible with an MRCP server and sending it through the RTP protocol to the voice server (400).

12. 12.: Método según la reivindicación 11, caracterizado porque el descodiﬁcador dispone de una aplicación (180) capaz de recoger los datos de audio provenientes del canal RTP, descomprimirlos al formato reproducible por el descodiﬁcador y enviarlos a un dispositivo electrónico existente en él encargado de la generación de audio. Method according to claim 11, characterized in that the decoder has an application (180) capable of collecting audio data from the RTP channel, decompressing them to the format reproducible by the decoder and sending them to an existing electronic device in charge of generating Audio.

13. 13.: Método según cualquiera de las reivindicaciones 3-12 caracterizado porque la comunicación entre el navegador (110) existente en el descodiﬁcador y el módulo externo (120) se realiza a través de una interfaz de programación de aplicaciones. Method according to any of claims 3-12 characterized in that the communication between the navigator (110) existing in the decoder and the external module (120) is carried out through an application programming interface.

14. 14.: Método según cualquiera de las reivindicaciones anteriores, caracterizado porque las aplicaciones multimodales ejecutadas en el navegador (110) son preprocesadas, separando la lógica multimodal de la lógica de servicio antes de ser mostradas al usuario. Method according to any of the preceding claims, characterized in that the multimodal applications executed in the browser (110) are preprocessed, separating the multimodal logic from the service logic before being shown to the user.

15. fifteen.: Sistema capaz de llevar a cabo cualquiera de los métodos de las reivindicaciones 1 a 14. System capable of carrying out any of the methods of claims 1 to 14.

16. 16.: Uso del sistema de la reivindicación 15 en un servicio de televisión digital de pago por visión. Use of the system of claim 15 in a pay-per-view digital television service.

SPANISH OFFICE OF THE PATENTS AND BRAND

Application no .: 200930385

SPAIN

Date of submission of the application: 06.30.2009

Priority Date:

REPORT ON THE STATE OF THE TECHNIQUE

51 Int. Cl.: See Additional Sheet

RELEVANT DOCUMENTS

Categoría Category: 56 Documentos citados Reivindicaciones afectadas 56 Documents cited Claims Affected

X X X X: US 2008255850 A1 (CROSS ET AL.) 16/10/2008, todo el documento. EP 1455282 A1 (CIT ALCATEL) 08/09/2004, todo el documento. 1-16 1-4, 15, 16 US 2008255850 A1 (CROSS ET AL.) 10/16/2008, the whole document. EP 1455282 A1 (CIT ALCATEL) 08/09/2004, the whole document. 1-16 1-4, 15, 16

Categoría de los documentos citados X: de particular relevancia Y: de particular relevancia combinado con otro/s de la misma categoría A: refleja el estado de la técnica O: referido a divulgación no escrita P: publicado entre la fecha de prioridad y la de presentación de la solicitud E: documento anterior, pero publicado después de la fecha de presentación de la solicitud Category of the documents cited X: of particular relevance Y: of particular relevance combined with other / s of the same category A: reflects the state of the art O: refers to unwritten disclosure P: published between the priority date and the date of priority submission of the application E: previous document, but published after the date of submission of the application

El presente informe ha sido realizado • para todas las reivindicaciones • para las reivindicaciones nº: This report has been prepared • for all claims • for claims no:

Fecha de realización del informe 14.05.2012 Date of realization of the report 14.05.2012: Examinador J. Botella Maldonado Página 1/4 Examiner J. Maldonado Bottle Page 1/4

REPORT OF THE STATE OF THE TECHNIQUE

Application number: 200930385

CLASSIFICATION OBJECT OF THE APPLICATION G06F3 / 16 (2006.01)

G06F17 / 30 (2006.01) G10L15 / 00 (2006.01) Minimum documentation sought (classification system followed by classification symbols)

G06F, G10L

Electronic databases consulted during the search (name of the database and, if possible, terms of search used) INVENES, EPODOC, WPI, NPL, XPESP, XPAIP, XPI3E, INSPEC.

State of the Art Report Page 2/4

WRITTEN OPINION

Application number: 200930385

Date of Written Opinion: 05.05.2012

Statement

Novedad (Art. 6.1 LP 11/1986) Novelty (Art. 6.1 LP 11/1986): Reivindicaciones Reivindicaciones 1-16 SI NO Claims Claims 1-16 IF NOT

Actividad inventiva (Art. 8.1 LP11/1986) Inventive activity (Art. 8.1 LP11 / 1986): Reivindicaciones Reivindicaciones 1-16 SI NO Claims Claims 1-16 IF NOT

The application is considered to comply with the industrial application requirement. This requirement was evaluated during the formal and technical examination phase of the application (Article 31.2 Law 11/1986).

Opinion Base.-

This opinion has been made on the basis of the patent application as published.

State of the Art Report Page 3/4

WRITTEN OPINION

Application number: 200930385

1. Documents considered.-

The documents belonging to the state of the art taken into consideration for the realization of this opinion are listed below.

Documento Document: Número Publicación o Identificación Fecha Publicación Publication or Identification Number publication date

D01 D01: US 2008255850 A1 (CROSS et al.) 16.10.2008 US 2008255850 A1 (CROSS et al.) 16.10.2008

D02 D02: EP 1455282 A1 (CIT ALCATEL) 08.09.2004 EP 1455282 A1 (CIT ALCATEL) 08.09.2004

2. Statement motivated according to articles 29.6 and 29.7 of the Regulations for the execution of Law 11/1986, of March 20, on Patents on novelty and inventive activity; quotes and explanations in support of this statement

Document D01 presents a system and method of interaction of a user with a multimodal application operating with a multimodal navigator in a multimodal device that supports multiple modes of interaction including voice and other different modes. The system ([paragraph 0064]) incorporates network servers to provide users, through protocols to use (HTTP, HDTP, WAP or similar), of XHTML + voice documents that are interpreted in the multimodal equipment by means of an XMLX voice interpreter or Well on a remote voice server. In one of the embodiments (paragraphs [0078] to paragraph [0083]), the multimodal device and a voice server are connected by a VOIP protocol and present a voice interface to the user that digitizes and encodes the outgoing voice and through the browser and a voice module presents the data in RTP protocol format, the process is the reverse for incoming data. An example of a multimodal device that includes a video bus and a video adapter is presented in paragraph [0090] of the document.

Document D02 presents an operating software for a multimodal device with voice interaction through a browser that allows you to interact with applications that use XHTML and voiceXML.

We consider that the object of the invention set forth in the claims from 1 to 16 derives directly and without any ambiguity from document D01. The claims from 1st to 4th, 15th and 16th are also set out in document D02. Therefore, the claims from 1 to 16 are not new in view of the state of the known art or present inventive activity.

State of the Art Report Page 4/4