ES2382747A1

ES2382747A1 - Multimodal interaction on digital television applications

Info

Publication number: ES2382747A1
Application number: ES200930385A
Authority: ES
Inventors: Jose Luis Gomez Soto; Susana Mielgo Fernandez
Original assignee: Telefonica SA
Current assignee: Telefonica SA
Priority date: 2009-06-30
Filing date: 2009-06-30
Publication date: 2012-06-13
Anticipated expiration: 2029-06-30
Also published as: AR077281A1; UY32729A; ES2382747B1; WO2011000749A1

Abstract

The invention relates to a method of multimodal interaction on digital television applications, wherein the multimodal application resides in a web server and is downloaded by a browser (110) residing in the actual television decoder (100). All the multimodal interaction analysis processes can be performed in real time using a distributed system of components and through the communications protocols. The system allows the interaction of the user with the application by means of using the remote control or voice.

Description

Interacción multimodal sobre aplicaciones de televisión digital.Multimodal interaction on applications of digital television

Field of the Invention

La presente invención se aplica al sector de la televisión digital, más concretamente al campo de las interacciones hombre-máquina sobre terminales como descodificadores de televisión digital o teléfonos móviles capaces de ejecutar aplicaciones interactivas que se visualicen sobre un televisor.The present invention applies to the sector of digital television, more specifically to the field of interactions man-machine on terminals like Digital television decoders or mobile phones capable to run interactive applications that are displayed on a television.

Background of the invention

Un sistema multimodal debe permitir simultáneamente diferentes métodos o mecanismos de entrada (teclado, voz, imágenes, etc.), recogiendo la información de cada uno de ellos según se necesite, por ejemplo, a veces, el usuario podría decir alguna cosa mediante un comando vocal, pero otras veces podría seleccionar un nombre de una lista mediante el uso del teclado e incluso podría seleccionar un menú o una parte de la pantalla apuntando con su propio dedo haciendo que el motor de la interfaz multimodal sea capaz de detectar el método de interacción que el usuario libremente ha escogido (descartando información incongruente recibida a través de los otros métodos).A multimodal system must allow simultaneously different methods or input mechanisms (keyboard, voice, images, etc.), collecting the information of each one of them as needed, for example, sometimes, the user could say something using a vocal command, but other times it could select a name from a list by using the keyboard and you could even select a menu or a part of the screen pointing with your own finger making the interface engine multimodal is able to detect the interaction method that the user has freely chosen (discarding incongruous information received through the other methods).

En lo que respecta al diseño de interfaces de usuario, estos tradicionalmente se han basado en la metáfora del escritorio, desarrollada décadas atrás en los laboratorios de Xeros, y que intenta trasladar todos los objetos y tareas que normalmente se realizan en una oficina real al mundo de los ordenadores; así por ejemplo, tanto los archivos reales como los electrónicos pueden ser almacenados, la tradicional máquina de escribir tiene su equivalente en el procesador de textos, el folio en blanco equivale al documento en blanco del procesador, etc. De esta forma se consigue que el modelo mental que tiene el usuario cuando realiza estas tareas tradicionales se mantenga con pocos cambios cuando se traslada al campo de los ordenadores, es decir, tratar de alcanzar el mayor grado de familiaridad entre objetos y acciones. Esta metáfora de escritorio se ha implementado a través del paradigma WIPM (en inglés, Windows, Icons, Pointers and Menus) que son los principales elementos que sostienen la inmensa mayoría de las interfaces gráficas actuales.Regarding the interface design of user, these have traditionally been based on the metaphor of desk, developed decades ago in Xeros laboratories, and that tries to move all the objects and tasks that normally they are made in a real office to the world of computers; so by For example, both real and electronic files can be stored, the traditional typewriter has its equivalent in the word processor, the blank folio is equivalent to the document blank processor, etc. This way you get the mental model that the user has when performing these tasks traditional be kept with few changes when moving to field of computers, that is, try to reach the highest degree of familiarity between objects and actions. This metaphor of desktop has been implemented through the WIPM paradigm (in English, Windows, Icons, Pointers and Menus) which are the main elements that support the vast majority of interfaces Current graphics

Sin embargo este paradigma resulta claramente inapropiado en un entorno de TV Digital interactiva por varias razones. La primera de ellas está relacionada con la propia naturaleza de las tareas que realiza un usuario sobre una aplicación interactiva (más distendidas y próximas a un entorno de entretenimiento, social, etc.) las cuales hacen que sean muy diferentes a las de una oficina real. Como segundo punto hay que señalar que el dispositivo con el que interactúa el usuario (mando a distancia) es muy diferente en funcionalidad y accesibilidad al del teclado y ratón, lo cual impone muchas restricciones a la hora de realizar tareas sobre un entorno de TV Digital (por ejemplo, la introducción de texto a través del mando a distancia para realizar una simple búsqueda puede convertirse en una tarea costosa). Durante bastantes años, y desde su aparición, el mando a distancia usado en el entorno de la TV se ha convertido en el dispositivo por excelencia y a través de él ha sido posible controlar una gran variedad de dispositivos y funciones asociadas a los mismos. Sin embargo, los modelos de tareas empleados en cualquiera de los servicios interactivos actualmente desplegados a nivel comercial sobre cualquiera de las tecnologías de distribución y entornos de desarrollo de los mismos, hacen que en numerosas ocasiones su utilización resulte ineficiente presentando grandes problemas de usabilidad, lo que se traduce en una desmotivación y perdida de interés en la exploración por parte de los usuarios (usabilidad se define como la eficiencia y satisfacción con la que un producto permite alcanzar objetivos específicos, como por ejemplo la compra de un partido de fútbol, a usuarios específicos, como por ejemplo los tele espectadores, en un contexto de uso específico, como por ejemplo el salón de una vivienda).However, this paradigm is clearly inappropriate in an interactive Digital TV environment for several reasons. The first one is related to the own one nature of the tasks that a user performs on an application interactive (more relaxed and close to an environment of entertainment, social, etc.) which make them very different from those of a real office. As a second point you have to point out that the device with which the user interacts (command distance) is very different in functionality and accessibility from the keyboard and mouse, which imposes many restrictions when it comes to perform tasks on a Digital TV environment (for example, the text entry via remote control to perform a simple search can become an expensive task). During quite a few years, and since its inception, the remote control used in the TV environment has become the device for excellence and through it it has been possible to control a great variety of devices and functions associated with them. Without However, the task models used in any of the interactive services currently deployed commercially about any of the distribution technologies and environments of development of the same, make numerous times their use is inefficient presenting great problems of usability, which translates into a demotivation and loss of interest in user exploration (usability is defined as the efficiency and satisfaction with which a product allows to achieve specific objectives, such as the purchase of a football match, to specific users, such as TV viewers, in a specific context of use, as per example the living room of a house).

Si además tenemos en cuenta que numerosas personas tienen problemas de accesibilidad al utilizar un mando a distancia tradicional, podemos concluir que claramente el mecanismo de interacción tradicional con la televisión se ha quedado desfasado y sobrepasado por los nuevos servicios interactivos ejecutados sobre los decodificadores de televisión digital.If we also consider that numerous people have accessibility problems when using a remote traditional distance, we can conclude that clearly the mechanism traditional interaction with television has become outdated and surpassed by the new interactive services executed on The digital television decoders.

Tareas como la introducción de texto con el mando a distancia a la hora de realizar una búsqueda en una EPG (Electronic Programing Guide o Guía Electrónica de Programación) o la posibilidad de enviar un mensaje a través de una aplicación interactiva de TV, se puede convertir en una tarea costosa que hará que finalmente el usuario pueda perder el interés por su utilización. A la hora de introducir estos datos se suele utilizar un teclado virtual que aparece en pantalla y que puede tener una apariencia similar al teclado de un teléfono móvil o bien el de un teclado ANSI. En cualquier caso el proceso resulta lento, no todo el mundo está acostumbrado a utilizar el mando a distancia como si se tratase de un teclado de un teléfono móvil y además no son infrecuentes los errores que se cometen al utilizar este mecanismo (el mando a distancia funciona por infrarrojos que en función de la luz del entorno, objetos ubicados entre el usuario y el receptor, etc. puede hacer que la pulsación de teclas no se traduzca en una introducción de caracteres). Casi todos los estudios y tests de usabilidad que se realizan sobre las aplicaciones interactivas señalan este proceso como algo costoso para el usuario.Tasks such as entering text with the remote control when performing an EPG search (Electronic Programming Guide) or the ability to send a message through an application Interactive TV, it can become an expensive task that will do Finally, the user may lose interest in his utilization. When entering this data, it is usually used a virtual keyboard that appears on the screen and can have a similar appearance to the keyboard of a mobile phone or that of a ANSI keyboard In any case the process is slow, not all the world is used to using the remote control as if it it is a keyboard of a mobile phone and they are also not infrequent mistakes made when using this mechanism (The remote control works by infrared which, depending on the ambient light, objects located between the user and the receiver, etc. you can make the keystroke not translate into a character entry). Almost all studies and tests of usability that are performed on interactive applications They point to this process as expensive for the user.

También cabe señalar que la TV tiene un carácter mucho más social, y el usuario normalmente se encuentra en un ambiente mucho más distendido, sentado a 3-4 metros del TV, y con un actitud de mucho menor concentración que la que exige trabajar con un ordenador. Es claro, que muchas de las tareas que se realizan sobre un ordenador a través de un interfaz gráfico tradicional no se podrán realizar o tendrán que ser realizadas de una forma muy diferente. Todo lo anterior ha hecho que necesariamente esta metáfora del escritorio se haya abandonado en los desarrollos de aplicaciones de TV Digital.It should also be noted that TV has a character much more social, and the user is usually in a much more relaxed atmosphere, sitting at 3-4 meters of TV, and with an attitude of much less concentration than It requires working with a computer. It is clear that many of the tasks that are performed on a computer through a graphical interface traditional can not be done or will have to be made of A very different way. All of the above has made necessarily this desktop metaphor has been abandoned in Digital TV application developments.

Las aplicaciones interactivas sobre TV Digital, además, se ejecutan sobre una única ventana presentada simultáneamente (en lugar de varias como los interfaces gráficos de PC, por ejemplo) por todas las restricciones arriba señaladas. Sobre esta ventana se disponen los diferentes objetos multimedia que componen la escena (textos, gráficos, vídeos, etc.) tratando de que todos ellos estén sincronizados en base a una línea temporal generando un conjunto de escenas que describen las diferentes acciones o pasos que debe completar el usuario hasta conseguir su objetivo. Por ejemplo, en la compra de una película de un sistema interactivo de vídeo bajo demanda, el usuario inicialmente debe entrar en esa sección, realizar una búsqueda del contenido en base a algún criterio, introducir los datos, seleccionar el contenido, introducir un PIN de compra, etc. De forma sincronizada van apareciendo los diferentes objetos en la escena a medida que el usuario interactúa con ellos.Interactive applications on Digital TV, in addition, they run on a single window presented simultaneously (instead of several as the graphical interfaces of PC, for example) for all the restrictions indicated above. On this window arranges the different multimedia objects that make up the scene (texts, graphics, videos, etc.) trying to make all of them are synchronized based on a timeline generating a set of scenes that describe the different actions or steps that the user must complete until they get their objective. For example, when buying a movie from a system interactive video on demand, the user must initially enter that section, perform a content search based on some criteria, enter the data, select the content, enter a purchase PIN, etc. In a synchronized way they go appearing the different objects in the scene as the User interacts with them.

Aunque este concepto basado en la descripción de escenas y la presentación simultánea de objetos (vídeos, textos, gráficos) pueda parecer sencillo resulta costoso desde un punto de vista de procesamiento gráfico y en especial se acentúa sobre aquellos dispositivos -como descodificadores de TV- donde los modelos de negocio, por razones de coste, imponen restricciones importantes en los componentes electrónicos que constituyen el dispositivo. Sin embargo ya existen en la industria tecnologías y mecanismos que soportan este paradigma a nivel de la capa de presentación.Although this concept based on the description of scenes and simultaneous presentation of objects (videos, texts, graphics) may seem simple is costly from a point of graphic processing view and especially accentuates about those devices - like TV decoders - where business models, for cost reasons, impose restrictions important in the electronic components that constitute the device. However, technologies and technologies already exist in the industry mechanisms that support this paradigm at the layer level of presentation.

Si quisiéramos trasladar el concepto de sincronización de objetos en la capa de presentación a la de sincronización de los diferentes mecanismos de interacción (mando a distancia, interacción mediante comandos vocales, etc.) veríamos que sobre los entornos de TV Digital apenas se han desarrollado arquitecturas que soporten esta sincronización de mecanismos de interacción. Por ejemplo, podríamos querer, volviendo a la aplicación interactiva de la compra de una película de vídeo bajo demanda, realizar la búsqueda del contenido mediante un comando vocal que se de al sistema, pero luego introducir -por cuestiones de privacidad- el PIN de compra mediante el mando a distancia tradicional. La gestión simultanea de los diferentes mecanismos de interacción resulta compleja desde un punto de vista semántico (por ejemplo, cuando se dan órdenes contrarias y simultáneas a través de una interfaz vocal y gráfica) y costosa en recursos de procesamiento computacional. Si a esto le sumamos el hecho de que las aplicaciones interactivas multimodales se ejecutarán sobre dispositivos con poca capacidad de procesamiento (CPUs de 100 MHz y memoria RAM limitada 32-64 MB y muy inferior al rendimiento de cualquier PC doméstico, CPUs de 1 GHz y RAM de 1 GB) y sobre los cuales -por ejemplo, descodificadores de televisión digital- resulta imposible realizar procesamiento de la voz en tiempo real, se puede concluir que existen dificultades técnicas importantes a la hora de ofrecer interfaces multimodales sobre entornos de TV Digital.If we wanted to transfer the concept of synchronization of objects in the presentation layer to that of synchronization of the different interaction mechanisms (command to distance, interaction through vocal commands, etc.) we would see that about Digital TV environments have barely developed architectures that support this synchronization of mechanisms of interaction. For example, we might want to go back to the interactive application of buying a video movie under demand, perform the content search using a command vowel that is given to the system, but then introduce - for reasons Privacy - Purchase PIN via remote control traditional. Simultaneous management of the different mechanisms of interaction is complex from a semantic point of view (for example, when counter and simultaneous orders are given through a vocal and graphic interface) and expensive in processing resources computational If we add to this the fact that the applications multimodal interactive will run on devices with little Processing capacity (100 MHz CPUs and limited RAM 32-64 MB and much lower than the performance of any Home PC, 1 GHz CPUs and 1 GB RAM) and on which -by example, digital television decoders - it is impossible Perform real-time voice processing, can be concluded that there are important technical difficulties when offering Multimodal interfaces over Digital TV environments.

Existen diferentes patentes que tratan sobre interfaces multimodales, aunque ninguna lo hace de forma concreta aplicado al campo de la televisión digital.There are different patents that deal with multimodal interfaces, although none do it concretely Applied to the field of digital television.

La patente US5265014 está centrada en cómo resolver ambigüedades que se producen al utilizar lenguaje natural como mecanismo de interacción. US4829423 es similar a la anterior aunque está más enfocada a cómo solventar los problemas que se producen al utilizar un lenguaje natural en un entorno multimodal. La patente US6345111 está más relacionada con un mecanismo de análisis de imagen de forma que el sistema es capaz de reconocer sobre qué objeto el usuario tiene la vista puesta, de forma que sirva como mecanismo de entrada en la selección de elementos. En US5577165 se describe de forma general como se realiza el mecanismo de diálogo entre una máquina y una persona, teniendo en cuenta las palabras clave detectadas durante el proceso de reconocimiento, así como los diferentes estados por los que pasa el sistema durante el dialogo.US5265014 is focused on how resolve ambiguities that occur when using natural language As an interaction mechanism. US4829423 is similar to the previous one although it is more focused on how to solve the problems that they produce when using natural language in a multimodal environment. US6345111 is more related to a mechanism of image analysis so that the system is able to recognize on what object the user has his sights set, so that serve as an input mechanism in the selection of elements. In US5577165 describes in general how the mechanism is performed of dialogue between a machine and a person, taking into account the keywords detected during the recognition process as well as the different states that the system goes through during the dialogue.

Tratar de implementar una arquitectura completa de utilización de interfaces multimodales sobre dispositivos ligeros, como descodificadores de TV digital o teléfonos móviles, presenta problemas de rendimiento (principalmente en lo relacionado al procesamiento de la voz en tiempo real o la incorporación de un intérprete multimodal, por ejemplo SALT o VoiceXML, dentro del propio dispositivo que también resulta costoso en términos de procesamiento y reserva de memoria), por lo que se hace necesario abordar nuevas arquitecturas que permitan conseguir que los tiempos de respuesta en el proceso de interacción utilizando cualquiera de los mecanismos (visual o comandos vocales)sean los mínimos posibles como ya se producen en otros dispositivos más potentes como PCs, donde todo el procesamiento de la voz (síntesis y reconocimiento) se realiza en la propia máquina, sin requerir el procesamiento externo. Así por ejemplo, la CPU y memoria RAM necesaria para realizar un proceso de reconocimiento de voz en tiempo real con unos tiempos de respuesta aceptables (inferiores a los 5 segundos) y que permitan que el usuario no pierda la atención sobre el sistema implicaría el uso de un PC actual de tipo medio (512 RAM, 1 Ghz CPU), lo que hace que en cualquier caso diste mucho de las capacidades de procesamiento de los descodificadores de televisión digital (32 MB RAM, 100 MHz), incluso de los de gama más alta (128 MB RAM, 200 Mhz CPU).Try to implement a complete architecture of use of multimodal interfaces on devices lightweight, such as digital TV decoders or mobile phones, it presents performance problems (mainly related to real-time voice processing or the incorporation of a multimodal interpreter, for example SALT or VoiceXML, within the own device that is also expensive in terms of memory processing and reservation), so it becomes necessary address new architectures that allow getting times of response in the interaction process using any of the mechanisms (visual or vocal commands) are the minimum possible as they are already produced in other more powerful devices such as PCs, where all voice processing (synthesis and recognition) is performed on the machine itself, without requiring the external processing So for example, the CPU and RAM necessary to perform a voice recognition process in real time with acceptable response times (less than 5 seconds) and allow the user not to lose attention about the system would imply the use of a current medium-type PC (512 RAM, 1 GHz CPU), which means that in any case you gave a lot of the processing capabilities of the decoders of digital television (32 MB RAM, 100 MHz), even those of more range high (128 MB RAM, 200 MHz CPU).

En resumen, el problema técnico consiste en que la limitada capacidad de procesamiento de los descodificadores de televisión digital, impiden que sobre ellos se puedan desarrollar auténticas aplicaciones multimodales que utilicen como mecanismos de interacción además de la visual la interacción vocal de forma simultanea.In short, the technical problem is that the limited processing capacity of decoders of digital television, prevent them from being developed authentic multimodal applications that use as mechanisms of interaction in addition to visual vocal interaction so simultaneous.

Object of the invention

El objeto de la presente invención es crear un método, plataforma o sistema que permita que sobre un entorno de televisión digital convivan simultáneamente y de forma sincronizada diferentes mecanismos de interacción, produciendo así lo que se conocen como interfaces multimodales. Para ello, la invención propone un método de interacción multimodal sobre aplicaciones interactivas de TV Digital, donde la televisión está provista de un descodificador en red que incorpora un navegador asociado, y donde el método se compone principalmente de los siguientes pasos:The object of the present invention is to create a method, platform or system that allows for an environment of digital television live simultaneously and synchronously different interaction mechanisms, thus producing what is known as multimodal interfaces. For this, the invention proposes a method of multimodal interaction on applications Interactive Digital TV, where television is provided with a network decoder that incorporates an associated browser, and where The method consists mainly of the following steps:

a. to.: Conexión del navegador a un servidor en red y descarga de una aplicación multimodal y sus etiquetas descriptivas que se generan en respuesta a un evento de interacción producido por un usuario durante el diálogo hombre-máquina.Connecting the browser to a network server and download of a multimodal application and its descriptive tags that are generated in response to an interaction event produced by a user during the dialogue man-machine

b. b.: Envío por parte del navegador de las etiquetas que caracterizan la aplicación multimodal a un intérprete que reside en un servidor en red.Sending by the browser of the labels that characterize the multimodal application to an interpreter residing in A network server.

c. C.: Interpretación de las etiquetas por parte del intérprete, que ordena la ejecución de acciones correspondientes a las etiquetas.Interpretation of labels by the interpreter, who orders the execution of actions corresponding to the tags.

d. d.: Repetición de los pasos a-c hasta que el usuario salga de la aplicación.Repeat steps a-c until that the user leave the application.

       \vskip1.000000\baselineskip\ vskip1.000000 \ baselineskip

En el paso a. los eventos pueden ser gráficos y/o voz. Es ventajoso asociar un módulo externo al navegador con la función de transferir las etiquetas descriptivas del diálogo de voz al intérprete de dichas etiquetas mediante un protocolo IP. Este intérprete, de preferencia, coordina y controla todos los eventos de voz y se comunica con uno o varios servidores que proporcionan recursos de voz mediante el protocolo MRCP. También de preferencia, analiza la estructura de la aplicación multimodal y envía los correspondientes comandos al servidor de voz que cumple el protocolo MRCP. Opcionalmente se comunica con el módulo externo y le transfiere los datos necesarios para que éste establezca una sesión mediante SIP con el servidor de voz MRCP. El descodificador puede recibir y enviar los datos de voz al servidor MRCP mediante el protocolo RTP. El módulo externo establece de preferencia una comunicación con un cliente RTP, obteniéndose de este modo el estado de la comunicación entre el descodificador y el servidor de voz MRCP. El descodificador a su vez dispone de una aplicación capaz de recoger los datos provenientes de cualquier dispositivo externo que recoja datos de audio y sea capaz de enviarlo mediante una conexión IP a los servidores de voz. Dicha aplicación es capaz de comprimir dichos datos de audio al formato compatible con un servidor MRCP y enviarlos a través del protocolo RTP hasta el servidor de voz. De preferencia, el descodificador dispone de una aplicación capaz de recoger los datos de audio provenientes del canal RTP, descomprimirlos al formato reproducible por el descodificador y enviarlos a un dispositivo electrónico existente en él encargado de la generación de audio.In step a. events can be graphic and / or voice. It is advantageous to associate an external module to the browser with the function of transferring descriptive voice dialogue tags to the interpreter of these labels by means of an IP protocol. This interpreter, preferably, coordinates and controls all the events of voice and communicates with one or more servers that provide Voice resources using the MRCP protocol. Also preferably, analyze the structure of the multimodal application and send the corresponding commands to the voice server that complies with the protocol MRCP Optionally it communicates with the external module and it transfers the necessary data for it to establish a session via SIP with the MRCP voice server. The decoder can receive and send voice data to the MRCP server through the RTP protocol The external module preferably establishes a communication with an RTP client, thus obtaining the status of communication between the decoder and the voice server MRCP The decoder in turn has an application capable of collect data from any external device that collect audio data and be able to send it via a connection IP to voice servers. This application is able to compress said audio data to the format compatible with an MRCP server and send them through the RTP protocol to the voice server. From Preferably, the decoder has an application capable of collect audio data from the RTP channel, decompress them to the playable format by the decoder and send them to an existing electronic device in it responsible for Audio generation

La comunicación entre el navegador existente en el descodificador y el módulo externo se realiza a través de una interfaz de programación de aplicaciones.Communication between the existing browser in the decoder and the external module is done through a Application programming interface.

Opcionalmente, las aplicaciones multimodales ejecutadas en el navegador son preprocesadas, separando la lógica multimodal de la lógica de servicio antes de ser mostradas al usuario.Optionally, multimodal applications executed in the browser are preprocessed, separating the logic multimodal service logic before being shown to Username.

La utilización de interfaces multimodales en el entorno de la TV interactiva no trata de sustituir la utilización del mando a distancia (interacción visual) sino complementarla y mejorarla según las necesidades y deseos del usuario.The use of multimodal interfaces in the interactive TV environment is not about replacing the use of the remote control (visual interaction) but complement it and improve it according to the needs and wishes of the user.

Brief description of the figures

Con objeto de ayudar a una mejor comprensión de la presente descripción, de acuerdo con un ejemplo preferente de realización práctica de la invención, se adjunta una figura, cuyo carácter es ilustrativo y no limitativo y que describe la arquitectura del sistema (figura 1).In order to help a better understanding of the present description, according to a preferred example of practical embodiment of the invention, a figure is attached, whose character is illustrative and not limiting and that describes the system architecture (figure 1).

Detailed description of the invention

El sistema de la invención es capaz de realizar todos los procesos de análisis de interacción multimodal en tiempo real, utilizando un sistema distribuido de componentes a través de los protocolos de comunicaciones descritos en la figura 1. La potencia del sistema se basa en la arquitectura distribuida de componentes, delegando en servidores y máquinas externas aquellos procesos de uso intensivo de la CPU.The system of the invention is capable of performing all multimodal interaction analysis processes in time real, using a distributed system of components through the communications protocols described in figure 1. The System power is based on the distributed architecture of components, delegating to servers and external machines those CPU intensive processes.

El sistema de la invención debe disponer de:The system of the invention must have:

\bullet?: Un descodificador con un navegador web integrado y un canal de retorno que le proporcione la capacidad de acceso a un servidor externo. El navegador debe permitir el uso y ejecución de un lenguaje scriptable, por ejemplo, el lenguaje JavaScript.A decoder with a integrated web browser and a return channel that provides the ability to access an external server. The browser must allow the use and execution of a scriptable language, for example, JavaScript language

\bullet?: Este descodificador se conectará al televisor para permitir la visualización de la parte gráfica de la aplicación multimodal y reproducir los mensajes de voz.This decoder will connect to the TV to allow viewing of the graphic part of The multimodal application and play the voice messages.

\bullet?: El navegador debe permitir el desarrollo e instalación de plugins (o módulos externos al navegador) que proporcionen al navegador, y por lo tanto al descodificador, la funcionalidad específica para la interpretación y ejecución de aplicaciones multimodales.The browser must allow the development and installation of plugins (or modules external to browser) to provide the browser, and therefore the decoder, the specific functionality for interpretation and Multimodal application execution.

       \global\parskip0.930000\baselineskip\ global \ parskip0.930000 \ baselineskip

\bullet?: Uno o varios servidores externos donde residan los recursos de voz (síntesis y reconocimiento del habla).One or more external servers where voice resources reside (synthesis and recognition of speaks).

\bullet?: Un servidor capaz de interpretar las etiquetas que proporcionan la multimodalidad (como por ejemplo, SALT/ VoiceXML). El sistema posee un máquina interna de estados de forma que ésta se va actualizando durante todo el diálogo hombre-máquina. Las etiquetas SALT/VoiceXML (u otras similares) se encuentran distribuidas a lo largo de la página web de la aplicación, proporcionando características como el reconocimiento o la síntesis de voz. La ubicación de las mismas depende del propio diseño de la página web y junto a las tradicionales etiquetas de HTML o JS conformarían lo que llamamos aplicación multimodal. Así por ejemplo puede existir una etiqueta <prompt> "Texto" que lo que produce al final es un audio sintetizado del mensaje que aparece a continuación de la etiqueta. De forma similar, existen otras etiquetas <listen>, <reco> que permiten grabar los comandos del usuario para su posterior reconocimiento. En cualquier caso, la sintaxis de las etiquetas depende del propio lenguaje o especificación que se esté utilizando (SALT/VoiceXML u otros).A server capable of interpreting the labels that provide multimodality (such as SALT / VoiceXML). The system has an internal machine of states so that it is updated throughout the dialogue man-machine SALT / VoiceXML tags (or others similar) are distributed throughout the website of the application, providing features such as recognition or voice synthesis. The location of the same depends on the own Website design and next to the traditional labels of HTML or JS would make up what we call multimodal application. So for example there may be a <prompt> "Text" tag that what it produces in the end is an audio synthesized from the message that It appears after the label. Similarly, there are other <listen>, <reco> tags that allow you to record the User commands for later recognition. In any case, the syntax of the tags depends on the language itself or specification being used (SALT / VoiceXML or others).

\bullet?: Servidor web donde reside la aplicación multimodal.Web server where the multimodal application.

\bullet?: Cualquier dispositivo externo, como por ejemplo un micrófono conectado al descodificador, un teléfono móvil, un dispositivo manos libres bluetooth, etc. que permita recoger la voz y enviarla en formato digital al descodificador.Any external device, such as a microphone connected to the decoder, a mobile phone, a bluetooth hands-free device, etc. that allow to pick up the voice and send it in digital format to decoder

       \vskip1.000000\baselineskip\ vskip1.000000 \ baselineskip

El sistema permite la interacción del usuario con la aplicación mediante el uso del mando a distancia o de la voz. Ambos métodos son complementarios permitiendo al usuario decidir cual quiere utilizar en cada caso. A esta interacción del usuario con la aplicación se le denomina diálogo hombre-máquina.The system allows user interaction With the application by using the remote control or voice. Both methods are complementary allowing the user to decide which one you want to use in each case. To this user interaction with the application it is called dialogue man-machine

El sistema permite además la sincronización entre los eventos generados por el usuario a través de cualquiera de los modos posibles (texto/voz) junto a lo presentado hacia el usuario, resolviendo aquellas incoherencias que puedan producirse durante la interacción. Esta sincronización es efectuada por un módulo externo que se ejecuta en el navegador impidiendo acciones indeseadas o que puedan producir efectos adversos sobre el sistema.The system also allows synchronization between the events generated by the user through any of the possible modes (text / voice) next to what was presented towards the user, solving those inconsistencies that may occur during the interaction This synchronization is carried out by a external module that runs in the browser preventing actions unwanted or that may cause adverse effects on the system.

De forma resumida el sistema realiza los siguientes pasos:In summary, the system performs the Next steps:

1. one.: La aplicación multimodal reside en un servidor Web. En un primer paso, el usuario selecciona dicha aplicación (compra de contenidos, banca electrónica, etc) que le proporcione su proveedor de servicios de TV Digital. Al tratarse de una aplicación multimodal, el proveedor de servicios le informará de tal hecho al usuario, indicándole que conecte previamente el micrófono, teléfono móvil, headset bluetooth al descodificador (los detalles de este paso en cualquier caso, quedan fuera del alcance de la invención). Una vez conectado el dispositivo, el descodificador se descarga dicha aplicación mediante un protocolo http u otro mecanismo- al navegador web residente en el descodificador, se ejecuta un plugin y se ejecuta la aplicación multimodal en el navegador. El plugin en este momento ya se encuentra enlazado con el navegador.The multimodal application resides on a Web server. In a first step, the user selects said application (purchase of content, electronic banking, etc.) provided by your provider of Digital TV services. Being an application multimodal, the service provider will inform the user, telling you to connect the microphone, phone mobile, bluetooth headset to decoder (the details of this in any case, they are outside the scope of the invention). Once the device is connected, the decoder is downloaded said application through an http protocol or other mechanism- al web browser resident in the decoder, a plugin is executed and The multimodal application is executed in the browser. The plugin in This moment is already linked to the browser.

2. 2.: Con el fin de realizar un procesamiento en tiempo real, el plugin del navegador envía la página web conteniendo las etiquetas que proporcionan la multimodalidad a un interprete (SALT/VoiceXML, por ejemplo) externo.In order to perform time processing real, the browser plugin sends the web page containing the tags that provide multimodality to an interpreter (SALT / VoiceXML, for example) external.

3. 3.: Inmediatamente, el navegador procesa la página web y la presenta en el televisor.Immediately, the browser processes the web page and He presents it on the television.

4. Four.: De forma simultanea, el interprete multimodal (SALT/VoiceXML), reconoce y procesa las etiquetas multimodales comunicando a los servidores de voz las acciones que deben realizar en cada momento (por ejemplo, en un momento dado del diálogo hombre-máquina con la aplicación interactiva, la etiqueta multimodal indica que se tiene que reproducir por audio un determinado mensaje que aparece por pantalla).Simultaneously, the multimodal interpreter (SALT / VoiceXML), recognizes and processes multimodal tags communicating to the voice servers the actions they must perform at each moment (for example, at a given moment in the dialogue man-machine with the interactive application, the multimodal label indicates that an audio has to be reproduced certain message that appears on the screen).

5. 5.: El servidor de voz envía al descodificador los datos de audio, que son procesados y adaptados, de forma que sean reproducibles por el dispositivo de audio del televisor y oídos por el usuario.The voice server sends the data to the decoder audio, which are processed and adapted, so that they are playable by the television's audio device and ears by the user.

6. 6.: En este punto, el estado del diálogo ha avanzado un paso, y el plugin del navegador informa al interprete SALT que puede procesar la siguiente etiqueta multimodal, repitiéndose este proceso hasta que el usuario abandona la aplicación.At this point, the status of the dialogue has advanced a step, and the browser plugin informs the SALT interpreter that it can process the following multimodal label, repeating this process until the user leaves the application.

       \vskip1.000000\baselineskip\ vskip1.000000 \ baselineskip

La aplicación multimodal se compone de un documento HTML compuesto de dos frames -elemento existente en la terminología HTML que se corresponde con una parte de una página web- principales: un frame que contiene la aplicación donde el propio contenido web reside (frame de aplicación, que es lo que se muestra en el interfaz gráfico y lo que el usuario en definitiva ve en su terminal) y otro frame para la creación en tiempo de ejecución -instanciación- del módulo externo o plugin (frame del módulo externo). Los frames se utilizan para separar el contenido de la aplicación. El frame de aplicación contiene los elementos con los que el usuario es capaz de interactuar tanto gráfica como vocalmente. El módulo externo o plugin es una aplicación que se relaciona con el frame de aplicación para aportarle la funcionalidad específica de la lógica multimodal, es decir, la que permite mantener un diálogo hombre-máquina mediante comandos vocales y de forma complementaria al tradicional interfaz gráfico junto con el mando a distancia durante todo el período de ejecución de la aplicación.The multimodal application consists of a HTML document composed of two frames - existing element in the HTML terminology that corresponds to a part of a page main web-: a frame containing the application where the web content itself resides (application frame, which is what shows in the graphical interface and what the user definitely sees in your terminal) and another frame for creation at runtime -instance- of the external module or plugin (module frame external). The frames are used to separate the content from the application. The application frame contains the elements with the that the user is able to interact both graphically and vocally The external module or plugin is an application that relates to the application frame to provide functionality specific to multimodal logic, that is, the one that allows maintain a man-machine dialogue through commands vowels and in a complementary way to the traditional graphic interface together with the remote control during the entire execution period of the application.

       \global\parskip1.000000\baselineskip\ global \ parskip1.000000 \ baselineskip

La estructura de una aplicación multimodal reside en un documento tipo XML formateado para cumplir con las especificaciones definidas en SALT (Speech Application Lenguaje Tags) o VoiceXML (http://www.w3.org/Voice/).The structure of a multimodal application resides in an XML type document formatted to meet the specifications defined in SALT (Speech Application Language Tags) or VoiceXML ( http://www.w3.org/Voice/ ).

El frame externo está oculto al usuario ya que no contiene una interfaz gráfica y sólo se encarga de agrupar las etiquetas de SALT/VoiceXML, mencionadas en el párrafo anterior, específicas de cada aplicación, y de la instanciación del módulo externo. Este conjunto de etiquetas SALT/VoiceXML determinan la lógica multimodal, es decir, las interacciones permitidas en la aplicación. Estas etiquetas permiten configurar la síntesis y la ejecución de la voz así como el reconocedor de voz y el conjunto de eventos que se pueden realizar utilizando el interfaz vocal. Por ejemplo durante el diálogo hombre-máquina con la aplicación, el usuario podría ver en su televisor un campo de edición que le invita a introducir un texto con el mando a distancia; de forma paralela en la aplicación existirá una etiqueta SALT/VoiceXML que indicará que el sistema en ese momento se encuentra a la espera de que el usuario de un comando vocal. El usuario en ese momento puede optar por introducir los datos con el mando a distancia, o bien, dar la orden vocal
equivalente.The external frame is hidden from the user since it does not contain a graphical interface and is only responsible for grouping the SALT / VoiceXML tags, mentioned in the previous paragraph, specific to each application, and the instantiation of the external module. This set of SALT / VoiceXML tags determines multimodal logic, that is, the interactions allowed in the application. These tags allow you to configure the synthesis and execution of the voice as well as the voice recognizer and the set of events that can be performed using the vocal interface. For example, during the man-machine dialogue with the application, the user could see on his television an editing field that invites him to enter a text with the remote control; in parallel in the application there will be a SALT / VoiceXML tag that will indicate that the system is currently waiting for the user of a vocal command. The user can then choose to enter the data with the remote control, or give the voice command
equivalent.

Para que exista una concordancia entre lo que se muestra por pantalla, en este caso la televisión, y las interacciones o eventos que se realizan de forma vocal, ambos frames se comunican entre sí utilizando una interfaz de programación de aplicaciones o API (Application Programming Interface) a través de la propia arquitectura del navegador que se ejecuta en el descodificador. Este API define el conjunto de funciones y procedimientos de comunicación entre los dos frames consiguiendo un nivel de abstracción y separación entre ellos. El API se define en javascript ya que es compatible con el navegador integrado en el descodificador.For there to be a concordance between what is shows on the screen, in this case television, and interactions or events that are performed vocally, both frames communicate with each other using a programming interface applications or API (Application Programming Interface) through the browser architecture itself that runs on the decoder This API defines the set of functions and communication procedures between the two frames getting a level of abstraction and separation between them. The API is defined in javascript since it is compatible with the browser integrated in the decoder

Esta estructura de frames permite separar la lógica de servicio de la lógica multimodal. La lógica multimodal (proporcionada por el frame oculto donde se ejecuta el módulo externo o plugin) está asociada a la gestión de la interacción hombre-máquina desde cualquiera de los interfaces gráfico o vocal, es decir, se encargarla de la gestión de los eventos desde el interfaz vocal o gráfico lanzando las acciones oportunas ante esos eventos. También se encargarla de resolver los problemas que plantean interacciones simultáneas entre ambos interfaces. La lógica de servicio (proporcionada por el frame de aplicación) estarla asociada a la consecución en si del objetivo que el usuario tiene al utilizar la aplicación, como puede ser la compra de una película en un servicio de TVD de pago por visión. También permite al módulo externo o plugin mantener activo el servicio que proporciona la multimodalidad cuando se navega entre las distintas páginas y contenidos de la aplicación principal. Esto permite que mientras se carga un nuevo interfaz gráfico se puedan realizar conversiones de textos a voz sintética o TTS (Text To Speech), o conversiones de voz a un formato entendible por el sistema también llamado SR (Speech Recognition). Esto proporciona una mejora de la experiencia de usuario evitando cualquier tipo de espera ya que el servicio que proporciona la multimodalidad siempre está funcionando en segundo plano.This frame structure allows you to separate the Service logic of multimodal logic. Multimodal logic (provided by the hidden frame where the module is executed external or plugin) is associated with the interaction management man-machine from any of the interfaces graphic or vocal, that is to say, it is responsible for the management of events from the vocal or graphic interface by launching the actions Timely to these events. It will also be responsible for resolving problems that pose simultaneous interactions between both interfaces The service logic (provided by the frame of application) be associated with the achievement of the objective The user has when using the application, such as the purchase of a movie on a pay-per-view TVD service. Too allows the external module or plugin to keep active the service that provides multimodality when navigating between different pages and contents of the main application. This allows while loading a new graphical interface can be performed text conversions to synthetic voice or TTS (Text To Speech), or voice conversions to a system understandable format too called SR (Speech Recognition). This provides an improvement of the user experience avoiding any type of wait since the service that provides multimodality is always working in background.

El plugin que proporciona la multimodalidad requiere que las comunicaciones con el usuario estén basadas en eventos, es decir, requiere que se realice alguna acción ante la cual se produzca algún proceso. Así por ejemplo, en un momento del diálogo el usuario puede pulsar con el mando a distancia un botón de aceptar o bien enviar el comando vocal "Aceptar" para su reconocimiento; ante ambos casos se generará el mismo código de ejecución, no distinguiéndose a través de qué mecanismo de interacción ha llegado el evento. Si es a través de la interfaz gráfica, el plugin informará al interprete SALT/VoiceXML de la pulsación de ese botón, para que su máquina de estados esté sincronizada con la ejecución de la aplicación, a la vez que se ejecutará el código asociado a la pulsación de la tecla. Si el usuario envía un comando de voz y éste es reconocido correctamente, el intérprete SALT/VoiceXML informará mediante un comando al plugin de tal hecho y esté generará el correspondiente evento que hará que se ejecute el código correspondiente a la pulsación de ese evento.The plugin that provides multimodality requires that communications with the user be based on events, that is, it requires that some action be taken before the which occurs some process. So for example, at a time of dialogue the user can press a remote control button with the remote control accept or send the vocal command "Accept" for your recognition; in both cases the same code will be generated execution, not distinguishing through what mechanism of Interaction has come the event. If it is through the interface graphically, the plugin will inform the SALT / VoiceXML interpreter of the pressing that button, so that your state machine is synchronized with the execution of the application, while will execute the code associated with the key press. If he user sends a voice command and it is recognized correctly, The SALT / VoiceXML interpreter will inform the plugin using a command in this way and it will generate the corresponding event that will cause the code corresponding to the press of that one is executed event.

Para gestionar todos los eventos producidos tanto por parte del usuario como por parte del sistema se utiliza preferentemente el lenguaje de programación Javascript que permite programar manejadores de eventos, los cuales se encargan de capturar las acciones producidas en el sistema.To manage all the events produced both by the user and by the system it is used preferably the Javascript programming language that allows schedule event handlers, which are responsible for capturing the actions produced in the system.

El navegador que se utiliza en los descodificadores de TV Digital interpreta el código Javascript que está integrado en las páginas Web. Además, la propia lógica multimodal (las interacciones permitidas), determinada por las etiquetas SALT/XML, es codificada en JavaScript. Por todo esto, el módulo externo multimodal requiere JavaScript, para poder comunicarse con el navegador.The browser used in the Digital TV decoders interprets the Javascript code that It is integrated into Web pages. In addition, the logic itself multimodal (the allowed interactions), determined by the SALT / XML tags, is encoded in JavaScript. For all this, the multimodal external module requires JavaScript, in order to Communicate with the browser.

El módulo externo multimodal también necesita acceder a los recursos hardware de audio del descodificador para controlar acciones básicas como reproducir, parar, etc.The multimodal external module also needs access the audio hardware resources of the decoder to control basic actions such as play, stop, etc.

En la figura 1 se puede apreciar la arquitectura de alto nivel de un ejemplo de sistema capaz de llevar a cabo el método de la invención:In figure 1 you can see the architecture high level of an example of a system capable of carrying out the method of the invention:

\bullet?: Decodificador [100]:Decoder [100]:

\bullet?: [110] Navegador con posibilidad de incluir un módulo externo [120].[110] Browser with possibility to include an external module [120].

\bullet?: Soporte lógico personalizado capaz de permitir la incorporación de módulos que habiliten una conexión IP con otros sistemas y cualquier otro protocolo de comunicación.Custom software able to allow the incorporation of modules that enable a IP connection with other systems and any other protocol communication.

\bullet?: Cliente SIP (Protocolo de Inicio de Sesiones) [140]: encargado de crear una sesión entre el descodificador [100] y el servidor MRCP (Media Resource Control Protocol ó Protocolo de Control de Recursos Multimedia) [460].SIP Client (Start Protocol of Sessions) [140]: responsible for creating a session between the decoder [100] and the MRCP server (Media Resource Control Protocol or Multimedia Resource Control Protocol) [460].

\bullet?: Cliente RTP (Real Time Protocol - Protocolo de Tiempo Real) [170].RTP Client (Real Time Protocol - Real Time Protocol) [170].

\bullet?: Grabación de Audio [190] y reproducción de audio [180].Audio Recording [190] and audio playback [180].

\bullet?: Conexión de red [500] IP.Network connection [500] IP.

\bullet?: Servidores externos:External Servers:

\bullet?: Servidor MRCP [350] y [460].MRCP server [350] and [460].

\bullet?: Intérprete SALT/Voice XML [300].SALT / Voice XML interpreter [300].

\bullet?: Recursos de voz [400] (TTS o Texto a Voz, SP o Reconocimiento del Habla).Voice Resources [400] (TTS or Text to Speech, SP or Speech Recognition).

       \vskip1.000000\baselineskip\ vskip1.000000 \ baselineskip

Los procesos de comunicación entre el cliente (descodificador) [100] y los recursos del servidor MRCP [460] se realizan a través del protocolo SIP [610] (Protocolo de Inicio de Sesiones) que permite el establecimiento de sesiones multimedia mediante el intercambio de mensajes entre las partes que quieren comunicarse. El descodificador [100] implementa un módulo SIP [140] que crea una sesión [610] mediante el envío de peticiones al servidor MRCP. En este mensaje también le envía las características de la sesión que quiere establecer como codificadores/decodificadores de audio soportados, direcciones, puertos donde se espera recibirlos, velocidades de transmisión, etc. que son necesarios para realizar los procesos de síntesis y reconocimiento de la voz. Todas estas acciones son coordinadas por el interprete SALT [300].The communication processes between the client (decoder) [100] and MRCP server resources [460] are performed through the SIP protocol [610] (Start Protocol of Sessions) that allows the establishment of multimedia sessions by exchanging messages between the parties they want communicate. The decoder [100] implements a SIP module [140] that creates a session [610] by sending requests to MRCP server In this message you also send the features of the session you want to establish as Supported audio encoders / decoders, addresses, ports where you expect to receive them, transmission speeds, etc. that are necessary to perform the synthesis processes and voice recognition All these actions are coordinated by the interpreter SALT [300].

En la parte del descodificador [100] puede ser necesario un módulo RTP [170](Real Time Protocol - Protocolo de Tiempo Real) puesto que es posible que el descodificador no soporte la reproducción en streaming mediante RTP. Por lo tanto se hace necesario utilizar un reproductor [180] capaz de enviar al altavoz del sistema los datos de voz en bruto, recogidos y almacenados en tiempo real en el buffer del cliente RTP [170].In the decoder part [100] an RTP module [170] (Real Time Protocol) may be necessary since the decoder may not support streaming playback via RTP. Therefore, it is necessary to use a player [180] capable of sending raw voice data to the system speaker, collected and stored in real time in the RTP client buffer [170].

El proceso de envío del audio [620] a través del canal RTP hasta el elemento RTP [411] se basa en la utilización de un aplicación [190] capaz de grabar y recoger los datos de voz en bruto desde el dispositivo de entrada de audio (por ejemplo, un micrófono) para a continuación crear un buffer con dichos datos. Dependiendo de las posibilidades de los servidores de voz y de los formatos que se desee soportar, serían necesarios utilizar diferentes compresores/descompresores de voz, como por ejemplo, PCMU-PCM (Pulse Code Modulation mu-law - Modulación de código de pulso).The process of sending the audio [620] through the RTP channel up to the RTP element [411] is based on the use of an application [190] capable of recording and collecting voice data in gross from the audio input device (for example, a microphone) to then create a buffer with this data. Depending on the possibilities of the voice servers and the formats that you want to support, would be necessary to use different voice compressors / decompressors, such as PCMU-PCM (Pulse Code Modulation mu-law - Pulse code modulation).

Esta aplicación compresor/descompresor [180][190] implementa el paradigma cliente/servidor siendo el objeto de la transacción los datos de voz sobre un protocolo RTP [620]. La aplicación consta de dos módulos principales:This compressor / decompressor application [180] [190] implements the client / server paradigm being the object of the transaction the voice data on an RTP protocol [620]. The Application consists of two main modules:

\bullet?: Reproductor RTP [180]: descomprime y convierte el sonido en formato PCMU (o PCMA) a PCM, que procede del canal RTP, y finalmente reproduce el sonido.RTP Player [180]: decompress and convert sound in PCMU (or PCMA) format to PCM, which comes from the RTP channel, and finally plays the sound.

\bullet?: Grabador RTP [190]: lee datos del dispositivo de entrada de audio, comprime/convierte PCM a PCMU (o PCMA) y lo envía finalmente a través del canal RTP [620].RTP Recorder [190]: Read data of the audio input device, compress / convert PCM to PCMU (or PCMA) and finally sends it through the RTP channel [620].

       \vskip1.000000\baselineskip\ vskip1.000000 \ baselineskip

Todos estos componentes están coordinados a su vez por el intérprete SALT [300]. Las acciones más comunes que se realizan con ellos son PLAY, STOP, PAUSE y REC.All these components are coordinated to your once by the SALT interpreter [300]. The most common actions that are They perform with them are PLAY, STOP, PAUSE and REC.

La invención se puede aplicar a la casi totalidad de los servicios interactivos que se ejecuten sobre descodificadores de TV Digital que cumplan con las características anteriormente mencionadas. Como ejemplos de servicios interactivos:The invention can be applied to almost all the interactive services that run on Digital TV decoders that meet the characteristics previously mentioned. As examples of services interactive:

\bullet?: Control y navegación a través de EPGs (Electronic Program Guide) o ESG (Electronic Service Guide) o UEG (Unified Electronic Guide).Control and navigation through EPGs (Electronic Program Guide) or ESG (Electronic Service Guide) or UEG (Unified Electronic Guide).

\bullet?: Control y navegación a través de las funcionalidades de VoD (Video on Demand), CoD (Content on Demand).Control and navigation through VoD (Video on Demand), CoD (Content on Demand).

\bullet?: Control y navegación a través de servicios de banca electrónica (home banking), compra/venta electrónica acceso a catálogos de productos, etc.Control and navigation through electronic banking services (home banking), purchase / sale electronic access to product catalogs, etc.

\bullet?: Control y navegación a través de las funcionalidades ofrecidas por aplicaciones de mensajería electrónica y navegación a través de Internet mediante navegadores.Control and navigation through the features offered by messaging applications electronics and Internet browsing through browsers

Claims

         \global\parskip0.900000\baselineskip\ global \ parskip0.900000 \ baselineskip

1. Multimodal interaction method on interactive applications of Digital TV, where the television is provided with a network decoder (100) that incorporates an associated browser (110), characterized by the following steps:

b. b.: Envío por parte del navegador de las etiquetas que caracterizan la aplicación multimodal a un intérprete (300) que reside en un servidor en red.Sending by the browser of the labels that characterize the multimodal application to an interpreter (300) that It resides on a network server.

         \vskip1.000000\baselineskip\ vskip1.000000 \ baselineskip

2. Method according to claim 1 characterized in that in step a. The events are graphics and / or voice.

3. Method according to claim 2 characterized in that an external module (120) is associated with the browser (110) with the function of transferring the descriptive tags of the voice dialogue to the interpreter of said tags (300) by means of an IP protocol.

Method according to claim 3 characterized in that the interpreter (300) of the descriptive labels of the voice dialogue coordinates and controls all the voice events.

Method according to claim 4, characterized in that the interpreter (300) of the descriptive tags of the voice dialogue communicates with one or more servers that provide voice resources by means of the MRCP protocol.

Method according to claim 5, characterized in that the interpreter (300) of the descriptive labels of the voice dialogue analyzes the structure of the multimodal application and sends the corresponding commands to the voice server that complies with the MRCP protocol.

Method according to claim 6, characterized in that the interpreter (300) of the descriptive labels of the voice dialogue communicates with the external module (120) associated with the decoder's browser and transfers the necessary data for it to establish a session via SIP with the MRCP voice server (460).

Method according to claim 7, characterized in that the decoder receives and sends the voice data to the MRCP server (460) by means of the RTP protocol.

9. Method according to claim 8, characterized in that the external module (120) associated with the browser establishes a communication with an RTP client (170) thereby obtaining the communication status between the decoder and the MRCP voice server (460) .

10. Method according to any of claims 5-9 characterized in that the decoder (100) has an application (190) capable of collecting data from any external device that collects audio data and is capable of sending it via an IP connection to Voice servers

Method according to claim 10, characterized in that said application is capable of compressing said audio data to the format compatible with an MRCP server and sending it through the RTP protocol to the voice server (400).

12. Method according to claim 11, characterized in that the decoder has an application (180) capable of collecting audio data from the RTP channel, decompressing them to the format reproducible by the decoder and sending them to an existing electronic device in charge of the audio generation

13. Method according to any of claims 3-12 characterized in that the communication between the browser (110) existing in the decoder and the external module (120) is carried out through an application programming interface.

14. Method according to any of the preceding claims, characterized in that the multimodal applications executed in the browser (110) are preprocessed, separating the multimodal logic from the service logic before being shown to the user.

15. System capable of carrying out any of the methods of claims 1 to 14.

16. Use of the system of claim 15 in a pay-per-view digital television service.