WO2004080072A1

WO2004080072A1 - System for the dynamic sub-titling of television and radio broadcasts

Info

Publication number: WO2004080072A1
Application number: PCT/FR2004/000175
Authority: WO
Inventors: Ghislain Moncomble
Original assignee: France Telecom
Priority date: 2003-02-04
Filing date: 2004-01-27
Publication date: 2004-09-16
Also published as: FR2850821B1; FR2850821A1

Abstract

The invention relates to a system for automatic sub-titling of an audio signal in real time. Display parameters (PAF) fixed by the user of the equipment (EQm) are stored. A linguistic converter (CL) converts the audio signal (SAV) into a sub-title signal (ST), the audio signal being temporarily buffered during the conversion. A sub-title generator (GS) combines the audio signal which has been temporarily buffered and the sub-title signal to give a sub-titled audio signal (SAVST) suitable for the equipment (EQm), with sub-titles formatted according to the display parameters (PAF).

Description

Dynamic captioning system for television and radio signals

The present invention relates to a system for dynamically captioning television and radio signals.

The adaptation of television programs to deaf and hard of hearing people or people of foreign languages has already been known for several years but is not sufficient. Currently, the volume of hours subtitled by all French television channels represents a proportion of approximately 12% of the total hours of broadcast programs. Even if the television channels offer many more hours of subtitling than the quota imposed in their specifications, they do not meet demand and the 12% subtitled remain far below neighboring countries like Germany or Switzerland.

The main problem encountered is the cost of subtitling. Currently, the average cost of one hour of subtitling is around 25 euros per hour, or 1500 euros per hour. The additional cost of subtitling is directly attributable to the channels and represents up to 2% of the budget of a television program.

Traditional subtitling imposes a so-called detection phase during which an operator watches the program, transcribes the text into text, and marks time codes at the start and end of each subtitling area. of the image continues. Then an editing phase produces a copy of the initial video signal (video master) with subtitles correctly positioned according to the time marks.

There is a lack of captioning of live programs among closed captioned television programs, due to technical difficulties in making this captioning very quickly. Indeed the technique described above is not applicable in real time due to the numerous manipulations. A stenotyping technique with real-time computer transcription was then implemented. The temporal marks of the image to which the speech relates are memorized in correspondence with the stenographic signs entered. The text transcribed by the computer is thus indexed to the image as soon as it is captured, and not during the detection phase, an extremely long and tedious phase. Shorthand virtually eliminates the editing phase, since the subtitles are already indexed to the time stamps. The transcription into text of the shorthand signs between two time marks takes approximately 3 seconds. All time stamps are offset by approximately 2 seconds so that the subtitles are optimally synchronized. One of the strengths of shorthand is the production of live subtitles using an overlay module which broadcasts them in real time.

However, captioning by live shorthand requires a very fast typing speed, at a rate of more than 220 words per minute, and a very high typing quality. In addition, the cost of transcription charged to the television channel remains high. In parallel, a teletext decoder generally incorporated in televisions appeared in order to activate subtitling remotely with better readability by printing clear subtitles on a black strip, a position of the subtitle varying according to the speaker, different colors for voices ("off") external to the image and for descriptions of the soundscape, setting the text to the rhythm of the images, etc. Subtitling is carried directly in the television signal on at least two raster lines provided for this purpose.

With the cultural mix accentuated by Europe, the French-speaking population who do not speak the spoken language needs written support. Linguistic subtitling, not including classic subtitling for films in original version, is not possible in any language due to the limited number of users. The cost of subtitling would be prohibitive compared to the number of users. Virtual subtitling responds to this problem, but within a very specific framework, that of films projected on screen. The virtual subtitling presented to the public is based on a copy of a subtitled film by means of a system generating subtitles by microcomputer and projecting them with a video projector synchronized with the film projector. This system avoids burning the copy and offers a reduction in cost, better flexibility for a change of subtitle corresponding for example to a change of language, and great freedom in the position of the subtitle, on, below or above the image. But this system remains confined to this precise framework. These techniques are based either on a preparation of the program before it is broadcast, or on an intervention during the broadcasting of the program but always with the help of rapid and costly human action.

US Patent 5,815,196 discloses a method and a device for continuously producing subtitling of an input audio / video signal translated into a target language during a videophone communication, but without a user can interact directly with the subtitle produced.

All these techniques do not offer any real action by the viewer or the user on subtitling.

The objective of the present invention is to automatically and subtitle an audio signal, in particular from television or radio, in real time by offering customization of the subtitling to the user.

To achieve this objective, a system for dynamically captioning an audio signal continuously received by receiving equipment, comprising means for converting the received audio signal into a captioning signal including subtitles, and combining means audio signal and subtitling signal, is characterized in that it comprises

means for memorizing display parameters determined beforehand by a user of the equipment, and a buffer means for temporarily storing the received audio signal into an audio signal delayed by the conversion time in the means for converting,

- and in that the combining means combines the delayed audio signal and the subtitling signal into a subtitled audio signal with subtitles formatted according to the display parameters in order to apply the subtitled audio signal with subtitles formatted to the equipment. Another advantage of the invention is to allow the user to personalize the subtitling in real time, since the subtitles are formatted according to the display parameters during dynamic subtitling. When the audio signal already includes closed captioning, the system may include means for detecting a closed caption signal in the audio signal so that the means for combining formats closed captioning of the detected closed caption signal display settings.

The invention also offers the possibility for the user to display the subtitling generated by the means for converting, or detected in the audio signal, according to a language chosen by the user. In this case, the means for storing memorizes an identifier defining a language determined beforehand by the user of the equipment. The system then preferably comprises means for determining an identifier of a language of the detected subtitling signal, means for comparing the stored language identifier with the language identifier of the subtitling signal, and at least a means for translating the subtitles of the subtitling signal into subtitles of the language determined beforehand when the Language identifiers are different in order to apply the subtitles of the determined language in the form of the closed captioning signal to combine. According to a preferred embodiment of the invention, the means for converting may comprise means for filtering the continuous audio signal into a voice signal and a noisy signal, means for analyzing the voice signal in order to produce voice parameters, recognition means voice converting the voice signal into a text signal, means for segmenting the voice signal into periodic time text segments, means for determining a context of each text segment based on averages of the voice parameters over the duration of the text segment and in function of the text segment so that contexts are involved in converting the speech signal into the text signal performed by the speech recognition means, and means for aggregating the text segments into a captioning signal. The system may also include means for determining a language of the current segment of the speech signal so that the means for converting dynamically determines the subtitle signal according to the determined language.

According to another embodiment, the system of the invention can also be used to subtitle an audio video signal. In this embodiment, the system may include means for extracting the audio signal from an audio video signal which is received by the system and the equipment and which is applied to the converting means and the buffer means in place of the audio signal. Other characteristics and advantages of the present invention will appear more clearly on reading the following description of several preferred embodiments of the invention with reference to the corresponding appended drawings in which:

- Figure 1 is a schematic block diagram of a subtitling system according to a first embodiment of the invention, in the environment of a terminal user installation comprising several receiving equipment and several servers of sub- titration; FIG. 2 is an algorithm of steps executed by the subtitling system according to the first embodiment for subtitling an audio video signal; and

- Figure 3 is a schematic block diagram of a preferred embodiment of a language converter included in the subtitling system according to the invention.

In the following, the term "channel" denotes either a channel or a transmission channel for broadcasting a sound broadcasting program or a television program, and the program company broadcasting said program. The term "program" designates a succession of sound or television broadcasting programs, also called magazines, broadcast by a specific channel.

With reference to FIG. 1, the subtitling system according to a first embodiment of the invention essentially comprises a terminal installation of IT user and an STT subtitling server, or more generally several subtitling servers. The IT user terminal installation includes M receiver equipment EQl, ... EQm, ... EQM with 1 <m <M. For example, one EQl of the equipment is a sound broadcasting receiver fitted with a display can selectively receive broadcasts from several sound broadcasting stations (stations). Another EQm equipment is a personal computer (PC), for example connected to a packet network of the Internet network type, or connected to a cable network for distribution of television program and / or sound broadcasting. A last piece of EQM equipment is a television receiver which is for example provided with means for receiving television signals to receive predetermined television programs and equipped with one or more decoders for receiving programs transmitted via a satellite and / or via a cable distribution network.

The EQl to EQM equipment is controlled via a distributed bus BU by a central processing unit UCit in the IT installation. As a variant, all or part of the BU bus can be replaced by a proximity radio link of the Bluetooth type or according to the 802.11b standard. The UCit central unit essentially comprises a microcontroller connected to various peripherals such as a Mit buffer memory, a closed captioning generator GS, an IC communication interface and optionally a keyboard and a screen. The central unit, the buffer memory, the captioning generator and the communication interface are physically included in a housing independent of the equipment. Alternatively, the UCit central unit with peripherals is integrated into the computer or the broadcasting receiver or the radio receiver. EQm television. The UCit central unit constitutes a basic module which can serve various home automation equipment such as that illustrated in FIG. 1 as well as one or more mobile telephones and radiotelephones, an alarm center, etc. The communication interface IC is adapted to a telecommunications link LT connected to an access network RA of the installation IT. The link LT and the network RA can conventionally be a telephone line and the switched telephone network PSTN itself connected to a high speed packet transmission network RP of the internet type. According to other variants, the telecommunications link LT is an xDSL line (Digital Subscriber Line) or an ISDN line (Digital Network with Service Integration) connected to the corresponding access network. The link LT can also be confused with one of the links serving one of the equipment's EQm through one of the distribution networks RD defined below.

According to another variant, the IT terminal installation can be organized around a DVB-MHP platform (Digital Video Broadcasting-Multimedia Home Platform) for which the telecommunications link LT is asymmetrical with a return path at low speed to the network RA access.

Figure 1 also schematically shows the telecommunications system surrounding the IT user terminal installation. In particular, the references RD and TR designate respectively one or more distribution networks for scheduled sound and television broadcasting programs and one or more head ends broadcasting programs and managed by various television and sound broadcasting program companies. All of the RD distribution networks include in particular analog and / or digital broadcasting networks for broadcasting programs capable of being received by the radio receiver EQl, terrestrial analog and digital cable, wireless (radioelectric) networks, by satellites in analog and digital modes for broadcasting television programs and possibly sound broadcasting capable of being received by the television receiver EQM. All the RD distribution networks also include the Internet network through which the computer EQm is capable of receiving radio and / or television broadcasts broadcast by certain program companies.

Each closed captioning server STT is connected to the program distribution network RD and to the terminal installation of the user IT via the packet network RP and the access network RA. According to another variant, the functionalities of the closed captioning server STT are located in a headend TR, or more generally, the server STT is connected to the broadcast distribution networks RD. In this case, subtitling is carried out at least in part before broadcasting.

The scheduled programs, except the live ones, are subtitled by slight anticipation, at least a few minutes before their broadcast, which offers almost no time lag. Indeed, as explained below, the processing of an audio video signal by the subtitling system has a certain duration which generates a relatively small delay or time difference between the incoming SAV signal into the system and the closed captioned SAVST signal out of the system. When subtitling occurs during the display of a continuous audio video signal, the delay due to subtitling is made up by the continuous audio video signal which will then be duplicated but with subtitles at the start of the subtitle. titration, or by a message of the "subtitling in progress" type, or by any other predetermined audio / video sequence. The STT server comprises a central processing unit UCs and a set of peripherals including at least one database, a linguistic converter CL described in detail below and an AV video analyzer. Many variants of the hardware distribution of the components of the IT user terminal installation and of the STT subtitling server can be deduced from the embodiment of the invention illustrated in FIG. 1. According to a first variant of architecture called "thin client / heavy server", the Mit buffer memory and the GS generator are included in the STT server in order to simplify the installation of the user, as well as part of the processing carried out by the central processing unit UCit is then executed in the central unit UCs of the STT server.

According to a second variant of architecture called "thick client / thin server", the language converter CL, the AV video analyzer and the database BD are installed in the user installation IT, and the processing which was carried out by the central unit UCs is then executed in the processing unit UCit. Other intermediate variants between the thin client / heavy server architecture and the heavy client / thin server architecture such as that of the preferred embodiment presented in FIG. 1 are conceivable.

According to another embodiment, all of the processing carried out thereafter is executed upstream of the broadcasting of the programs, in a network head TR. In this case, the user's terminal installation is almost reduced to the equipment EQl to EQM.

The term "closed captioning parameters" means PAC activation parameters, PAF display parameters and an IL language identifier. The activation parameters characterize an activation period of the subtitling system according to the invention as a function of start and end dates and times and / or of the type of program. The PAC activation parameters refer, among other things, to program grids of a chain. The PAF display parameters characterize the display of the subtitles on the display included in the user's receiving equipment, such as positioning, font type, colors allocated to the different speakers, display by continuous scrolling text or static sentences, etc. The language identifier IL defines a subtitle language.

In another embodiment of the invention, a preference program is used to store in the database BD and configure preferences on the subtitling desired by the user in order to establish and store parameters PAC, PAF and IL and the modify if desired. The preference program is executed by the STT server via the packet network RP, or directly by the central unit UCit of the IT terminal installation when the database BD is included in the IT installation.

For example, the preference program presents a complete list of equipment EQl to EQM of the user via a display in the IT installation so that the user selects the equipment for which he wishes to modify the subtitling parameters when the identifiers of several of the user's devices were registered during his subscription. Subtitling parameters can be proposed by default to the user, or the current parameters if the user has already selected or modified these parameters. A first page invites the user to enter PAC activation parameters programmable by the user according to dates and times or directly according to programs chosen from a program schedule. Each time the user validates an entry page, the entered values of the parameters are sent to the STT server for storage in the BD database, or directly in the BD database of the terminal installation for architecture. "heavy client / thin server". The same is true for PAF display settings and IL language identifiers. If the IT terminal installation does not have human-machine interface means such as a mouse or keyboard, the parameters corresponding to the user's preferences are selected by default. If the captioning of the invention is carried out in a TR network head and the IT terminal installation is essentially reduced to equipment EQl to EQM, the parameters are modified by the user via any other means, for example by a telephone or radiotelephone terminal or by an operator when subscribing to the subtitling service according to the invention.

FIG. 2 shows an algorithm of steps E1 to

Eli executed by the subtitling system according to the first embodiment to subtitle an audio video audio signal transmitted by the distribution network RD to one EQm of the receiving equipment of the IT installation.

In step E1, the user U of the IT installation powers up the latter and selects an equipment EQm in order to globally activate the subtitling system of the invention. For example, a predetermined pressure from a remote control of the selected equipment EQm when this selected equipment contains the UCit central unit, or a switch to the switch-on position of a button on the box integrating the UCit central unit powers up the UCit unit. This reads from memory and then automatically transmits an identifier IU of the user U and an identifier IEQm of the equipment EQm selected by the user U to the server STT. Switching on the UCit central unit empties the Mit buffer.

The server STT identifies the user U who has subscribed to the subtitling service, by comparing the identifier received IU with the identifiers of the users subscribed in the database BD, in step E2. In a variant, the STT server requests the user to enter the UI identifier and a password which has been given to him in the IT installation. assigned when subscribing to the service in order to transmit the identifier and password to the STT server for verification. Then in step E2, the central unit UCs reads the subtitling parameters PAC, PAF and IL from the database BD in correspondence with the user identifier IU in order to analyze them according to the following steps in view to produce the subtitles in the selected equipment EQm for the selected channel. The PAC activation parameters are considered by the central unit UCs, so that the generator GS and the converter CL, or more generally the system, are only active during the duration of activation determined by the parameters PAC. After identifying the user in step E2, the central unit UCs in the STT server invites the user to select a chain in the equipment EQm which then transmits an ICM identifier of the selected chain to the STT server via the UCit unit, in step E3.

As a variant, the equipment EQm and the audio video signal chain to be subtitled have been preselected by the user U, in particular when subscribing to the subtitling service, and the identifiers IEQm and ICM have been registered in correspondence with the identifier U of user U in the database BD. In this variant, the EQm equipment is simply powered up awaiting subtitling. In the next step E4, the audio video signal SAV of the selected channel received by the selected equipment is temporarily stored temporarily in the buffer memory Mit in a delayed audio signal SAVR. Like any after-sales audio video signal, this includes periodic time marks such as frame alignment words, packet synchronization words, video or line frame synchronization signals, etc. These time marks are counted modulo to the predetermined number and stored in the buffer buffer Mit in response to the selection of the ICH identifier of the chain by the user. The UCit unit then transmits a determined synchronization time reference from the server ST so that the latter begins captioning for the channel selected relative to the user U in response to the synchronization time reference. The duration of storage of the after-sales service signal depends on the processing time for the subtitling of the after-sales service signal by the device, including the time of routing of the messages exchanged between the terminal installation IT of the user U and the sub-server - STT title. In parallel, the central unit UCs of the server STT selects the channel designated by the identifier received ICH from among all the channels available at the level of the server in step E5.

Alternatively, the central unit checks whether the audio video signal SAV identified by the channel identifier ICH is being closed captioned by the STT server and whether the closed caption settings for the current closed caption match the settings PAC and IL selected by the user. When the parameters match, the subtitling is continued in step E8, otherwise the after-sales service signal continues processing in step E6. In step Eβ, the central unit UCs triggers the processing of the after-sales service signal of the selected chain in response to the synchronization time frame received with the parameters IU, IEQm and ICH. From the synchronization time mark, the following time marks in the service signal are detected and included in the signal by the central unit UCs. The central unit UCs processes the service signal so that the AV video analyzer detects closed captioning in the service signal. When the SAV signal already has subtitling, the AV video analyzer extracts the ST subtitles from the SAV signal and a language determination unit 8 (FIG. 3) of the language converter CL determines the identifier IL of the language of the subtitling in step E61. The central unit UCs compares it to the identifier IL of the language determined beforehand by the user read in the database BD, in step E62. If the language identifiers are identical, the STT server continues the process with the subsequent step E8.

For example, if the subtitling is not separated from the after-sales service signal or if it is not automatically recoverable, as for an MPEG4 audio-video signal with descriptive marking via the SMIL language (Synchronized Multimedia Integration Language) , the AV analyzer detects closed captioning by optical character recognition (OCR). The time required for image analysis by this shape recognition is not penalizing for the following reasons. Subtitles are very often positioned in a lower portion of an image, the analysis is considerably limited. To be visible to the user, the subtitles are in large type, typed generally with good contrast to the image. They are therefore simple to recognize, which limits the power of optical character recognition and therefore its duration. A minimum perception time is such that the subtitling changes on average approximately every five seconds, and at least every three about seconds. The AV audio analyzer thus analyzes only a lower portion (the fifth) of the images per minimum three-second period.

Otherwise, in step E62 where the language identifier of the subtitling in the after-sales service signal is not identical to the language identifier IL determined by the user, a translation module 41 (FIG. 3 ) included in the linguistic converter CL translates the subtitles extracted from the signal SAV into subtitles of the language determined by the language identifiers IL of the user, in step E63 which is followed by step E8.

Returning to step E6, when the AV video analyzer does not detect any subtitle in the SAV signal, the linguistic converter CL dynamically determines the subtitling ST of the SAV signal as a function of the audio signal SA therein and of the language used in this audio signal and translates the subtitling into the language defined by the user as a function of the language identifier IL determined by the latter as is more detailed later with reference to FIG. 3.

The subtitling signal ST comprising the subtitles deduced from the corresponding after-sales signal and the PAF display parameters as well as the time marks previously detected in the after-sales signal and delayed by the subtitling operation are sent continuously during the progressive processing of the after-sales service signal by the STT server at the IT terminal installation in step E8.

All the processing steps up to step E8 have caused a delay necessary for the execution of the processing in the STT server.

In step E9, the closed captioning generator GS in the terminal installation IT synchronizes in function of the time marks and combines the subtitling signal ST received by the IT installation with the delayed audio video signal SAVR of the selected channel ICH read in the buffer buffer Mit, that is to say the subtitles with the audio signal of the SAV signal dialogue in order to produce an audio video signal subtitled SAVST.

The closed captioning generator uses speech alignment techniques known as detection of change of camera plane in the SAVR signal. If a subtitle is present when the plan is changed, the user tends to look at the image and then come back to the text. The user then loses the place to read in the present subtitle and resumes reading at the beginning of the same subtitle at the risk of not reading it in full. The GS generator takes care that each subtitle is not disturbed by a change of plan.

Then in step E10, the subtitling generator GS dynamically generates an audio-video signal with subtitles SAVST according to the display parameters PAF read from the database BD and received by the central unit UCit of the installation terminal in step E8. The PAF display parameters are transmitted by the STT server so that the GS generator receives any modifications to these parameters as quickly as possible to adapt the subtitling accordingly during system operation. The audio video signal subtitled SAVST with the subtitles embedded in the images of the initial signal SAV is displayed in step Eli by the display of the selected receiving equipment EQm of user U with a delay relative to the initial signal received after-sales service. The combination of the SAVR and ST signals in the GS generator, as well as in particular the conversion in the converter CL, is ended at the expiration of the activation time according to which the PAC activation parameters are determined and monitored by the other unit UCs. If the after-sales service signal already includes subtitling (step E6, yes), the GS subtitling generator affixes the new texts instead of those deduced from a translation and / or format according to the display parameters PAF. In the other cases, the subtitling is positioned in the lower part of the images. The GS generator determines a display duration of each subtitle as a function of the length of the subtitle to be displayed and of an average reading time. This display duration is at least equal to approximately three seconds and can extend significantly in one direction or the other with respect to the recognized sentences.

The closed captioning server STT includes a linguistic converter CL, the operation of which is described below with reference to FIG. 3.

The linguistic converter according to the invention comprises an audio extractor 1, an audio filter 2, a voice analyzer 3, a voice recognition module 4, a translation module 41, a segmentation unit 51, a segment context determination unit 5, a contextual database 45, a general context determination unit 6, an audio comparator 7, an audio database 71, and a language determination unit 8.

In the following, the term "context" designates a list of key words or expressions and their equivalents. Each key word or phrase characterizes a context that can be addressed in any what multimedia document. Certain contexts are combinations of contexts, or in the case of current or regional contexts, combinations of contexts specified by a proper name, such as for example: Brittany Weather, Afghanistan War, etc. A continuous audio signal SA of indefinite duration is extracted from the audio video signal SAV in the audio extractor 1 adapted to the standard relating to the signal SAV, and is applied to the audio filter 2. It will be assumed that the audio signal SA received by the server STT is digital; otherwise, the audio signal received is analog and converted by an analog-digital converter included in the audio filter 2.

The unit 12 further comprises a buffer memory continuously storing the audio signal SA for a duration greater than a predetermined duration DS of segments of the audio signal. In practice, the capacity of the buffer memory is such that it records a maximum of a portion of the audio signal SA having a duration at least ten times approximately greater than that DS of the segments. The unit 12 segments the audio signal SA into time and periodic segments ..., S _n , ... as the audio signal is received. The predetermined duration DS of the audio signal segments depends on the ratio between the quality of the conversion and the processing time of the segments of the signal SA desired by the converter CL. A minimum duration of 15 seconds is typically sufficient for the converter to ensure minimum quality. In another preferred embodiment of the invention, the segmentation is not based on a temporal characteristic but depends on a syntactic element such as a word, or a group of words or a sentence. A syntactic element is for example defined by a sound level above a threshold predetermined and framed by intervals of the audio signal having a sound level below the predetermined threshold and considered as silences.

The filter 2 filters by spectral subtraction or adaptive filtering the audio signal SA in order to dissociate it into a signal comprising only voice and called "voice signal" SV and a signal comprising background noises and called "noisy signal" SB. Filter 2 is for example based on a linear predictive analysis LPC (Linear Predictive Coding) and isolates different acoustic components in an audio signal such as voice, vocal noise and pure music.

The voice signal SV is then processed in parallel by the voice analyzer 3 and the voice recognition module 4.

The vocal analyzer 3 analyzes the vocal signal SV in order to continuously determine a list of parameters PVS _n characterizing the vocal segment SV, called "list of vocal parameters". The list of voice parameters is not fixed but includes, among other things, acoustic and particularly prosodic parameters such as the vibration frequency, intensity, flow, timbre and also other parameters such as the relative age of the speaker.

In addition to voice analysis, the voice signal

SV is subject to the voice recognition module 4.

When the language of the voice signal SV is considered to be unknown, the known language determination unit 8 is inserted between the filter 2 and the voice recognition module 4. The unit 8 dynamically determines the language of the voice signal SV if it this is not previously known. For multi-language information for example, the language of the voice signal is thus recognized continuously. If the language of the audio signal is predetermined and taken as the default language, so the language determination unit 8 is not necessary. The voice recognition module 4 transforms the voice signal SV into a text signal ST, called the subtitling signal. Several speech recognition modules can be used to optimize processing.

In a variant, the module 4 considers the results of a context study carried out beforehand in order to refine the recognition and the transcription of the voice signal SV. The voice recognition module 4 considers the results of a context study carried out beforehand in order to refine the recognition and the translation of the voice signal. The context is translated into syntactic elements, that is to say key words and expressions, with high probabilities of being included in a portion of the voice signal. For example, the context of a relatively periodic or frequent advertising or news spot in an audio signal emitted by a sound broadcasting station is predicted by knowing the detailed program of this station, or by deducing it from advertising spots or previous news. Various contexts in the form of key words and expressions, as defined above, constitute contexts pre-stored and managed in a contextual database 45 linked to module 4 and to units 5 and 6. The contexts in base 45 are also completed and refined by automatic consultation of external databases according to the contexts recently detected. The contexts are thus gradually improved during the processing of the audio signal SA to facilitate recognition speech in the voice recognition module 4. Module 4 can rely on Natural Language Understanding NLU software. The segmentation unit 51 segments the text signal ST into temporal and periodic text segments ..., S _n , ... as the voice signal SV is received in a buffer memory and in synchronism with the time markers in the service signal. Indeed, the segmentation unit 51 further comprises a buffer memory continuously storing the voice signal SV for a duration greater than a predetermined duration DS of voice signal segments SV. In practice, the capacity of the buffer memory is such that it stores a maximum of a portion of the voice signal SV having a duration at least ten times approximately greater than that DS of the segments. The predetermined duration DS of the text signal segments depends on the ratio between the quality of the conversion and the processing time of the signal SA desired by the converter CL. A minimum duration of 15 seconds is typically sufficient for the system to ensure minimum quality.

In another preferred embodiment of the invention, the segmentation is not based on a temporal characteristic but depends on a syntactic element such as a word, or a group of words or a sentence.

The unit 5 determines one or more contexts CS _n of the current text segment S _n as a function of the average PVS _n of each voice parameter PVS over the current text segment and as a function of the content of the current text segment S _n . In a preferred variant, contexts established and stored previously are also used to determine the context in unit 5 and contribute to increasing the relevance of new segment contexts which will in turn participate in determining the contexts of next segments. In another variant, a general context is determined initially before any indexing for subtitling of the audio signal SA as a function of parameters external to the system and linked inter alia to the source of the audio video signal SAV. When the audio signal SA to be processed is that received by a radio or television receiver, program grids or information thereon as well as any information capable of informing the context of the voice signal SV enrich the contextual database 45. This general context is based by the unit 5 on the context of a determined number of segment preceding the current segment S _n when the context of the immediately preceding segment is not determined. The general context determination unit 6 compares the context CS _n of the current text segment S _n to the context CS _n -i of the preceding text segment S _n -ι in order to determine time limits of a current general context CG] ζ. The unit 6 determines an upper time bound of general context which is confused with an upper time bound of the current segment S _n when the contexts CS _n , CS _n -ι of the current segment and of the segment preceding the current segment are similar, and which is kept confused with the upper time bound of the segment S _n -ι preceding the current segment when the context CS _n of the current segment is not similar to the context CS _n - of the previous segment.

The general context CGk compared to a text segment context remains unchanged during one or more consecutive text segments whose contexts jointly define the general context. The set of consecutive textual segments defining the general context CG ^ is limited by time limits respectively confused with the lower bound, also called the anterior bound, of the first textual segment treated of the set and the upper bound BS ^, also called the bound posterior, of the last textual segment treated of the whole.

For the purpose of optimizing the conversion of the audio signal SA, periodic portions of the voice signal SV having a duration greater than and proportional to the duration DS of the periodic text segments S _n of the audio signal SA are each processed several times by the functional means 3 to 6. For example, passing a portion of the voice signal SV two to K times through means 2 to 6 refines the relevance of the contexts of this portion. The number K of processing cycles of an audio signal portion, as shown diagrammatically at 36 in FIG. 3, depends on the time constraints, on the quality of each processing in means 2 to 6 and on the memory capacity. buffer in the segmentation unit 51. The faster the linguistic converter CL must process the audio video signal SAV, the smaller the number K.

Also for the purpose of optimizing the linguistic converter, the unit 5 determines some contexts of the current text segment S _{n in order} to further segment the text signal ST into different general contexts in the unit 6. Thus intervals of different general contexts n ' not having a priori lower and upper time limits combined are juxtaposed during common voice segments, which increases the accuracy of general information about the audio signal.

As shown in FIG. 3, the linguistic converter CL also includes the audio comparator 7 in relation to an audio database 71 in which pieces of audio data such as music, songs, advertising jingles, flashes of light are stored. and sound effects. More generally, the database 71 has previously recorded any piece of audio data preferably qualified by audio parameters PASp and contexts CAp whose time limits are staggered with respect to a fixed reference point of audio data, such as the beginning of a song or a jingle. The database 71 thus contains pieces of typed audio data which are used to interrupt the continuous audio signal SA with respect to a general context, during a "context jump", such as an advertising spot, for a short insert having a context different from that of a relatively long subject or theme in the SA signal.

The audio comparator 7 comprises a buffer memory and a segmentation unit. The comparator compares samples of audio pieces contained in the audio database 71. The substantially identical samples allow the comparator to determine portions of audio signal SA corresponding to complete pieces or parts of audio pieces contained in the base 71. The parameters PASp and the context CAp of the identified portion of the audio signal SA are applied to unit 5 over the duration of the determined portion, replacing the PVS _n averages of the voice parameters on the current segment of the content of the text segment S _n . The textual segments S _n are thus qualified respectively by voice parameters PASp and audio contexts CAp read in the database 71.

The audio comparator 7 also participates in improving the quality of context determination since the parameters PASp and the contexts CAp associated with the audio data and contained in the audio database 71 are determined both manually and therefore very precisely, as well as automatically .

In order to improve the determination of contexts, the noisy signal SB comprising the residual non-vocal part of the current segment SA produced by the filter 2 is applied by the filter 2 to the audio comparator 7, in order to attempt to qualify the noisy signal SB by parameters PAS and contexts CA coming from the audio database 71 and thus to improve the context determination in the unit 5 and to inform the contextual base

45 through new contexts. In order to rapidly constitute audio data in the base 71, the machines hosting the management means managing the audio database 71 can be shared.

In another variant, the management means is associated with the audio comparator 7.

As a variant, the linguistic converter CL does not have an audio comparator 7 or an audio database 71.

In the case of subtitling of an audio signal emitted by a sound broadcasting station or the like, the audio extractor 1 can also be deleted. The linguistic converter CL comprises at least one translation module 41. The module 41 is activated when the unit 8 finds that the language designated by the language identifier IL read in correspondence with the user identifier IU in the database BD data is different from the language of the signal SV determined by the unit 8. The translation module 41 translates the text signal ST into a text signal translated STR into said designated language and applied to the segmentation unit 51. Preferably, the voice recognition module 4 and the translation module 41 use a common context analysis in order to improve the result of these two modules. In another embodiment, the language converter CL does not include a translation module.

Textual segments S _n of the audio video signal SAV possibly translated are thus continuously applied to the central unit UCs at the output of the converter CL. Preferably, the units 5 and 6 aggregate the text segments S _n into a subtitling signal ST. However as a variant, the text segments S _n are sent directly to the terminal installation IT via the networks RP and RA and are aggregated in the subtitling generator GS.

The captioning service offered by the system of the invention may be subject to billing according to the captioned channel, its frequency of listening, and the parameters selected by the user, such as those requiring a translation of the caption. titration in a language other than that of the original audio signal. The subtitling system is also applicable to any installation receiving an audio signal SA and having a means of displaying the subtitles ST and a means of listening to the audio signal. For example, the installation comprises at least one radio receiver, or else a telephone or radiotelephone terminal in particular for subtitling the speech signal, as an audio signal, of the distant interlocutor during a telephone conversation. According to other embodiments, the subtitling system is applicable to the field of audio conferencing or videoconferencing and more generally of a conference to subtitle the audio signal of a speaker during the conference.

All of these achievements are particularly useful for the hearing impaired attending a conference.

Claims

1 - System for dynamically captioning an audio signal (SAV) continuously received by a receiving equipment (EQm), comprising means (CL) for converting the received audio signal (SAV) into a subtitling signal (ST) including subtitles, and means for combining audio signal and subtitling signal, characterized in that it comprises: - means (BD) for storing display parameters (PAF) determined beforehand by a user of the equipment (EQm), and a buffer means (Mit) for temporarily storing the received audio signal (SAV) into a delayed audio signal (SAVR) of the conversion duration in the means for converting,

- and in that the combining means (GS) combines the delayed audio signal (SAVR) and the subtitling signal (ST) into a subtitled audio signal (SAVST) with subtitles formatted according to the parameters d display (PAF) in order to apply the audio signal subtitled with formatted subtitles to the equipment (EQm).

2 - System according to claim 1, comprising means (AV) for detecting a subtitling signal in the audio signal (SAV) so that the combining means (GS) formats subtitles of the subtitling signal detected based on display settings (PAF).

3 - System according to claim 2, characterized in that the means for storing (BD) stores an identifier (IL) defining a language determined beforehand by the user of the equipment (EQm), and in that the system comprises means (8) for determining an identifier of a language of the detected subtitling signal, means

(UCs) for comparing the stored language identifier with the language identifier of the subtitle signal, and at least one means (41) for translating the subtitles of the subtitle signal (ST) into subtitles -titles of the predetermined language when the language identifiers are different in order to apply the subtitles of the determined language in the form of the subtitling signal

(ST) by means of combination (GS).

4 - System according to any one of claims 1 to 3, wherein the means for converting (CL) comprises means (2) for filtering the continuous audio signal into a voice signal (SV) and a noisy signal (SB) , means (3) for analyzing the speech signal (SV) to produce speech parameters (PVS), speech recognition means (4) converting the speech signal (SV) into a text signal (ST), means ( 51) for segmenting the voice signal (SV) into periodic time text segments (S _n ), means (5, 6) for determining a context (CS _n ) of each text segment as a function of means (PVS _n ) of the parameters voice over the duration of the text segment and as a function of the text segment (S _n ) so that the contexts are involved in the conversion of the voice signal (SV) into the text signal (ST) executed by the voice recognition means (4), and means (5, 6) for aggregating the text segments (S _n ) into a closed captioning signal (ST).

5 - System according to claim 4, comprising means (8) for determining a language of the voice signal (SV) so that the means for converting (CL) dynamically determines the subtitle signal (ST) according to the determined language.

6 - System according to claim 4 or 5, characterized in that the means for storing (BD) stores an identifier (IL) defining a language determined beforehand by the user of the equipment (EQm), and in that the system includes at least one means (41) for translating the text signal (ST) into a translated signal (STR) according to the language designated by the language identifier (IL), the translated text signal (STR) being applied to the means for segment (51).

7 - System according to claim 6, wherein the voice recognition means (4) and the means for translating (41) use a common context analysis.

8 - System according to any one of claims 1 to 7, comprising means (BD) for memorizing activation parameters (PAC) determined by the user as a function of a duration of activation of the system, so that the means for converting (CL) converts and the combining means (GS) combines only during the activation period.

9 - System according to any one of claims 1 to 8, comprising means (UCit) for selecting a reception chain so that the received audio signal (SAV) to be converted corresponds to the selected reception chain. 10 - System according to any one of claims 1 to 9, comprising means (1) for extracting the audio signal (SA) from an audio video signal (SAV) which is received by the system and the equipment (EQm ) and which is applied to the converting means (CL) and to the buffer means (Mit) in place of the audio signal (SAV).

11 - System according to any one of claims 1 to 10, wherein the buffer means

(Mit) and the combining means (GS) are included in a terminal installation (IT) of the user connected at least to the receiving equipment (EQm), and the means for memorizing (BD) and the means for converting ( CL) are included in a server (STT).

12 - System according to any one of claims 1 to 10, included in a terminal installation (IT) of the user connected at least to the receiving equipment (EQm).

13 - System according to any one of claims 1 to 10, included in a server means (STT; TR) for transmitting the subtitled audio signal (SAVST) at least to the receiving equipment (EQm).