WO2018077987A1

WO2018077987A1 - Method of processing audio data from a vocal exchange, corresponding system and computer program

Info

Publication number: WO2018077987A1
Application number: PCT/EP2017/077373
Authority: WO
Inventors: Xavier Priem; Yvan RIDE; Ahmed GABAL; Xingjun Wang
Original assignee: Voxpass
Priority date: 2016-10-28
Filing date: 2017-10-25
Publication date: 2018-05-03
Also published as: FR3058253B1; FR3058253A1

Abstract

The invention relates to a method and a system for processing audio data, the audio data coming from a vocal exchange between at least two speakers. The method comprises a processing phase (P20) for processing audio data which includes a speaker identification step (P202) comprising a step of dividing said audio data, according to at least one pause model, delivering a set of segmented audio data. According to the invention, the method comprises, for at least one current audio data item of the set of segmented audio data, recognition of at least one uttered word and the identification of the speaker having uttered the word on the basis of reference vocal identification data and current vocal identification data coming from the segmented audio data.

Description

A method of processing audio data from a voice exchange, system and corresponding computer program.

1. Domain

The present technique relates to the processing of voice data of users. The present technique relates more particularly to the voice identification of users who wish to have access to a set of services via their voice prints. More specifically, it is presented a technique of access to an online service including segregation and multi-voice segmentation.

2. Prior Art

The growing interaction of man with the digital world will require a reinforced use of voice recognition, including extending to the relationship between human beings and objects. The most known modes of interaction are keyboards, mice, screens and tactile surfaces ... The voice is only used today as a simple service control tool (interactive voice response IVR, from English for "Interactive Voice Response", choice of choice on voice controllers having replaced the switchboard operators). The voice control also penetrates the automobile uses, for the non-vital functions like the selection of destination of the navigation system or the control of the sound volume of the car radio. Thus, the massive introduction of digital intelligence into everyday objects, from car-connected refrigerators to Smart City-type services, creates a crucial need for efficient interfacing. and acceptable by any type of audience with its environment.

It is not possible to converse in a complex way with an object or an automaton as many authors of science-fiction have been able to imagine. Progress has been made, but both the recognition rate and the complexity of the conversation are still to be improved. The digital world will only really be accessible to everyone when everyone can interact naturally and intuitively with it. To do this, it is necessary to improve the voice recognition rate on the one hand and to improve the ability of a system using it to render a correct transcription of the voice exchange. The purpose of the present is not to describe new recognition algorithms but rather to focus on the recognition system and its operation. The problem of audiovisual conference processing has been investigated for many years, in particular for the computerized conference processing including local participants and remote participants. In particular, we have described and presented systems for the spatialization of visual audio conferences, including a follow-up of the participants, in particular a follow-up of the participants taking the floor (speakers) in relation to spectator participants, each participant being able to alternatively play the role of speaker and spectator. Moreover, the problem of identification and recognition of speech has also been investigated for many years, essentially along two axes which are on the one hand the recognition and the transcription of speech into text and on the other hand the voice identification of speaker. Recognition techniques and the transcription of speech are essentially divided into two categories: on the one hand, "universal" techniques which are not particularly aimed at a user and which make it possible to recognize simple orders or instructions (these are IVR or SI RI ™ or Google Talk ™ voice assistants). The techniques implemented are not particularly aimed at recognizing a specific user, but are adapted to the recognition of any type of user, without requiring any particular learning. Conversely, the "specific" techniques, implemented for example in desktop recognition software, are based on prior learning of the voice, the diction and more generally the manner of speaking of a user, and this in order to transcribe in the most faithful way, the texts dictated by the user.

Each of the techniques described above is associated with a given and independent problem. There is no multiplex solution that can meet other needs. There is therefore a need to provide a flexible recognition solution that can adapt according to a given context, and more particularly according to a conference context.

3. Summary of the invention

The proposed technique does not have these disadvantages of the prior art. More particularly, the proposed technique relates to a method for processing audio data, said audio data being derived from a voice exchange between at least two speakers, said method comprising an audio data processing phase comprising a step of identifying audio data. speaker which comprises a step of cutting said data audio, according to at least one pause pattern, delivering a segmented audio data set. Said method comprises, for at least current audio data of said segmented audio data set:

at least one voice recognition step of said current audio data, implemented via at least one voice recognition engine, delivering at least one recognized word or an indicator of absence of recognition; when at least one speech recognition engine delivers at least one recognized word:

a step of searching, in a database of at least one reference word corresponding to said at least one recognized word, delivering at least one reference voice identification data item;

a step of obtaining, from said current audio data, a current voice identification data item;

a step of calculating at least one correspondence between said current voice identification data and said at least one reference voice identification data, delivering at least one correspondence score; and when a match score exceeds a predetermined threshold, a step of allocating said current audio data to one of said at least two interlocutors.

According to a particular characteristic, the step of obtaining, from said current audio data item, a current voice identification data item comprises:

an application step, on said current audio data item, which takes the form of a one-dimensional audio signal x (t), of a short-term Fourier transform (TFCT), delivering a matrix X, called energy matrix , of size m × n, m frequency channels and n time frames and whose frequency scale is between 0 and 1 kHz;

a step of normalizing the values of said matrix X, delivering a standardized matrix which constitutes the current voice identification data. According to a particular characteristic, the application of said short-term four-way transform to said mono-dimensional audio signal x (t) is carried out at a time step of 40 ms.

According to one particular characteristic, the values of the elements of the standardized matrix are between 0 and 100.

According to a particular characteristic, said at least one reference voice identification data is also in the form of a standardized energy matrix, previously recorded within said database.

According to a particular characteristic, the step of calculating at least one correspondence, delivering the match score, comprises a step of determining a Pearson correlation value of the current voice identification data item and said at least one voice reference identification data.

According to a particular characteristic, the method comprises, prior to said processing phase, a phase of obtaining, for each of said at least two speakers, an individual voiceprint comprising said voice identification data of reference.

According to a particular characteristic, said treatment phase further comprises the following steps:

recording, in the form of at least one main audio stream of said verbal exchange;

when no audio stream exists for the current speaker, creating an audio stream specific to the current speaker;

time stamping, in said at least one main audio stream and in said own audio stream, using data representative of said current speaker.

According to another aspect, the present technique also relates to a system designed for implementing the method described above.

According to a preferred implementation, the various steps of the methods according to the proposed technique are implemented by one or more software or computer programs, comprising software instructions intended to be executed by a user. data processor of a relay module according to the proposed technique and being designed to control the execution of the different steps of the processes.

Accordingly, the proposed technique is also directed to a program that can be executed by a computer or a data processor, which program includes instructions for controlling the execution of the steps of a method as mentioned above.

This program can use any programming language, and be in the form of source code, object code, or intermediate code between source code and object code, such as in a partially compiled form, or in any other form desirable shape.

The proposed technique is also aimed at a data carrier readable by a data processor, and including instructions of a program as mentioned above.

The information carrier may be any entity or device capable of storing the program. For example, the medium may comprise storage means, such as a ROM, for example a CD ROM or a microelectronic circuit ROM, or a magnetic recording medium, for example a floppy disk or a disk. hard.

On the other hand, the information medium may be a transmissible medium such as an electrical or optical signal, which may be conveyed via an electrical or optical cable, by radio or by other means. The program according to the proposed technique can be downloaded in particular on an Internet type network.

Alternatively, the information carrier may be an integrated circuit in which the program is incorporated, the circuit being adapted to execute or to be used in the execution of the method in question.

According to one embodiment, the proposed technique is implemented by means of software and / or hardware components. In this context, the term "module" may correspond in this document as well to a software component, a hardware component or a set of hardware and software components.

A software component corresponds to one or more computer programs, one or more subroutines of a program, or more generally to any element of a program or software capable of implementing a function or a program. together functions, as described below for the module concerned. Such a software component is executed by a data processor of a physical entity (terminal, server, gateway, router, etc.) and is capable of accessing the hardware resources of this physical entity (memories, recording media, bus communication cards, input / output electronic cards, user interfaces, etc.).

In the same way, a hardware component corresponds to any element of a hardware set (or hardware) able to implement a function or a set of functions, as described below for the module concerned. It may be a hardware component that is programmable or has an integrated processor for executing software, for example an integrated circuit, a smart card, a memory card, an electronic card for executing a firmware ( firmware), etc.

Each component of the previously described system naturally implements its own software modules.

The different embodiments mentioned above as well as the various characteristics that constitute them, are combinable with each other for the implementation of the proposed technique.

4. Figures

Other characteristics and advantages of the invention will appear more clearly on reading the following description of a preferred embodiment, given as a simple illustrative and nonlimiting example, and the appended drawings, among which:

Figure 1 shows the architecture of a system object of the present;

Figure 2 shows the different phases of processing an audio stream;

Figure 3 shows the identification of one of two possible speakers;

Figure 4 shows an example of energy matrix.

5. Description

5.1. General Elements.

As explained above, the general principle of the invention is to implement several processing and voice recognition engines, and this within a speech processing system accessible online and can meet general needs . The system object of the present is implemented for example using a distributed processing architecture (cloud type) in which access to resources (computer processing) is done on demand. The system comprises on the one hand a set of speech recognition engines, these engines being configured to perform on the one hand voice identification tasks for identifying a speaker (the voice identification feature is described later). ) and on the other hand voice recognition tasks (the voice recognition feature is described later). The system that is the subject of the present invention also comprises a speaker processing engine. This processing engine comprises on the one hand a registered speaker management component and a non-registered speaker processing component. This speaker processing engine comprises a set of communication interfaces (of "API" type) with the speech recognition engines. The registered speaker processing component allows the recording, updating of voice data (and in particular the recording and updating of voice signatures) of speakers of the system. The unregistered speaker processing component enables the automatic recording and processing of voice data from unknown speakers. Figure 1 globally illustrates the architecture of the system object of the present.

The processing system (SystTt) includes a speaker processing engine (MTrtUsrs) that includes at least one connection interface (IC) to at least one media processing server (SrvTTMed). The media processing server (SrvTTMed) is in charge of the processing (recording, multiplexing, equalization) of the audio signals that it receives. The media processing server (SrvTTMed) is directly or indirectly, via a communication network (WNTWK) to one (or more) VoIP server (s) (VOIP) comprising appropriate IP telephony software. The VoIP server is itself connected, via an optional gateway (GTWY) to sound production devices (micro, octopus conference room conference, smart phone, computer) that are implemented by example during conference calls.

The speaker processing engine (MTrtUsrs) also includes access and processing interfaces (APIs) to at least one voice recognition engine (ASR # 1, ASR # 2, etc.), each of which includes access to a voice data processing database (DB1, DB2, etc.). The speaker processing engine (MTrtUsrs) also includes access to its own database of speakers (DBU) which includes profiles of speakers (registered or unregistered) who have access to the system to which the registered and unregistered speaker processing components (CTUe and CTUne) have access.

The speaker processing engine (MTrtUsrs) also includes an access interface to a voice recorder server (SRVEU), the server comprising components for implementing a speaker registration service. Such a service allows speakers (who wish) to register in the system, as described below.

Within the framework of the present technique, two complementary tasks are distinguished which are implemented by the system and more particularly by the speaker processing engine (MTrtUsrs) coupled to the speech recognition engines (ASR # 1, ASR # 2, etc. .): in the first place, it is the speaker identification, the object of which is to allow a distinction between the speakers for whom a voice stream is processed. Voiceprint identification assumes that certain physical characteristics of vocal organs, which influence the quality of speech sound, are not exactly identical from one person to another. These features are the size of the vocal cavities, the throat, the nose and the mouth, and the shape of the articulating muscles of the tongue, jaw, lips and soft palate. The voice recognition engine (s) is used to generate (at least on the fly) at least one voice print of a speaker. This is then used by the system during different tasks, and especially when separating several speakers in the same conversation.

Second, it is voice recognition, the purpose of which is to allow a recognition of the words spoken by speakers, for example, to make a written transcription of their speech. To perform such a recognition, the techniques used are conventional, with the exception of a selection algorithm adapted for the purposes of the present technique, which algorithm, as presented hereinafter, allows not only to use recognition results from the recognition engine or engines, but also other parameters.

The speaker identification database (DBU) includes information relating to the various speakers that are recorded in the system; more particularly, this database includes at least one unique voiceprint, from each speaker, which identifies it, among all registered speakers. This unique voiceprint is generated either at the end of a speaker's (voluntary) enrollment phase with the system, or when the system is used by an unregistered speaker (for example at a conference with several speakers, some or all of whom are not registered) or both. The purpose of this database is notably to allow a differentiation of the speakers in order to improve the speech recognition of each of these speakers. The database also relates to the processing, grouping and classification of the various audio streams allocated to the user. Indeed, it is not enough to transcribe in the form of text, the words spoken by the speakers: these words, exchanges dialogues are also preserved in order to be reused later, for example to perform a second speech recognition , which could be of better quality, or for statistical analysis and comparison purposes.

Thus, the unique voiceprint that is recorded in the system for a given speaker can be obtained in at least two different ways.

One way to get a unique voice print is to manually register with the system. This manual recording includes for example a creation of a talker account (user) in the system, via a web service. The creation of this account is accompanied by the provision by the speaker of a set of voice data. This set of voice data can either be obtained directly by reading one or more predetermined sentences at the time of recording. In order to allow a simple and time consuming recording, the number of sentences to be read is voluntarily reduced (a dozen sentences at most). Another possibility, when recording manually, is to download a sound file that the speaker has pre-recorded by reading a certain number of sentences (as in the live online recording). This type of service registration makes it possible, on the one hand, as indicated above, to generate a unique fingerprint for the speaker, this fingerprint being subsequently used to identify the speaker; the recording also allows, on the other hand, to train one or more recognition system, to recognize the voice of the speaker.

In a specific embodiment, the online service has a plurality of speech recognition engines (different editors). The voice stream (live or pre-recorded) that is provided to the system during speaker registration is automatically transmitted to these speech recognition engines, which perform speech independently, learning the speaker's voice, based on the sentences read by the speaker. This embodiment, in which several speech recognition engines are used, is particularly suitable insofar as it makes it possible to benefit from several different recognition technologies. The modules used are commercial modules, each providing an open interface (API) for exchanging data with the system.

5.2. Identification and voice recognition

A key element of the system of the invention resides in the ability of the system to separate the speakers and to recognize the words and sentences pronounced by these speakers. The identification of the speaker is implemented via one or more voiceprints of this speaker, which have been prerecorded (in a prior phase). Voice prints are obtained by known techniques and do not require further description.

Speech recognition is implemented via one or more voice recognition engines that deliver, based on an input stream, recognition results to which the speaker processing engine (MTrtUsrs) accesses. Thus, for example, for the recognition of a word (or phrase) uttered by a speaker, each recognition module is called via its interface (API). Each provides, as output, the recognized word (or phrase), as well as a possible probability of success of the recognition, which is then used by the system to determine and select the most likely result, based on in particular other parameters and obtain a truth score which, when it exceeds a determined threshold, is considered as the result of the transcription.

The selection parameters are for example:

the number of occurrences of a word or term during the conversation, the exchange: if a term has already been spoken by the speaker, the probability of occurrence of that term is increased for the rest of the conversation. the conversation ;

the number of occurrences of a word or term during the conversation, the exchange: if a term has already been spoken by another speaker, the probability of occurrence of that term is increased for the rest conversation; the number of occurrences of a word or term during an archived conversation or exchange: if a term has already been spoken by the speaker, in the past, the probability of occurrence of that term is increased for the rest of the conversation; the number of recognition modules that deliver the same result (when multiple modules are used);

In other words, rather than relying solely on the results provided by the speech recognition engine (s) used by the system, the speaker processing engine uses past data, whether it is past conversation or current exchange or past data of conversations, exchanges or previous recordings, which have been stored and archived, in the speaker database. In connection with FIG. 2, a voice data processing method implemented by the present system is presented. Such a method comprises:

a obtaining phase (P10), for each speaker, of an individual voice print; the manner of obtaining this imprint is described infra and supra in examples of implementation;

a processing phase (P20) of at least one audio stream derived from a verbal exchange between the speakers comprising at least one iteration of the following steps:

recording (P201), in the form of at least one main audio stream of said verbal exchange;

- identification (P202), using the individual voiceprint, a current speaker, among the plurality of speaker;

when no audio stream exists for the current speaker (P203), creating (P204) an audio stream specific to the current speaker;

time stamping (P205) in said at least one main audio stream and in said own audio stream, using data representative of said current speaker; this representative datum of the current speaker may advantageously be a hash of the vocal print of the current speaker; This method allows, for a given exchange or conversation, to record, separately, the different streams from different speakers. They are marked temporally, in order to be able to segment the exchanges and to allow an individualisation of speaking during the exchanges. One embodiment of the speaker identification is presented below.

Using audio streams, recognition phase is implemented. Depending on the embodiments it is implemented in real time or deferred. This recognition phase comprises at least one iteration of the following steps:

transmitting, to at least one voice recognition engine, at least a portion of a clean audio stream (to a current speaker);

obtaining, from said at least one voice recognition engine, at least one recognition result, taking the form of at least one recognized word;

determining a truth score of said at least one recognized word based on at least one previous recognition result;

when said veracity score exceeds a determined threshold, adding said recognized word to a recognition data structure.

The determination of the truth score is implemented according to the various embodiments and in particular the presence (or not) of several recognition engines and the presence (or not) of a recognition probability (provided by the ( s) recognition engine (s) and the availability of previous transcription results 5.3 Voice recognition of a speaker

Hereinafter, the voice identification of a speaker, implemented by the speaker processing engine (MTrtUsrs), is described. Voice identification of a speaker is implemented through a plurality of energy matrix. These energy matrices are stored in the speaker identification database (DBU), either by being directly associated with a user who is registered within the system, or by being associated with an unregistered user (temporary user). , which is for example identified during a meeting or during a verbal exchange. As explained above, during the implementation of the system, it is sorted between registered users and non-registered users (unknown system). Energy matrices are created on the one hand to register the user and on the other hand to recognize the user when he speaks.

Speaker registration

The principle of user identification consists in creating, from the speech flow of a speaker, one or more energy matrices associated with this speaker. Suppose, as illustrative, we have several samples or audio files containing the word "la", each time pronounced by the same user (for example a user being recorded). These files or samples are in the time domain. For the analysis (and classification) of these signals, a short-term Fourier transform (TFCT) is used. The advantage of the TFCT is to be able to represent the frequency domain information in the time domain.

The choice of the time step of the TFCT is complex. If the time step is too small, the number of samples is not sufficient, therefore, the results of the Fourier transform are unreliable. If the time step is too big, some information is missing in the time domain. For this reason, the time step is chosen as 40 ms.

Using the TFCT, the energy matrix is calculated as shown in FIG. 3. In this illustrative representation, the word 'la' is ten times that is to say that the user has pronounced ten times the word 'la' with a more or less different intonation. For each of these matrices, the abscissa represents time, the unit is the second. The ordinate represents the frequency, the unit is the Hertz (Hz). The time step is 40 ms. The frequency step is 50 Hz. Since most of the energy is distributed in frequencies between 0 and 1000 Hz, the frequency band is between 0 and 1000 Hz.

After the analysis of the TFCT, for each word, we obtain an energy matrix. We have chosen as matrix of energy a matrix with 24 lines of frequency band and 8 columns correspond to the time. To compare these matrices, these energy matrices are normalized. The maximum value is set to 100.

In the example of Figure 3, there are ten energy matrices. Each matrix has 24 lines for the frequency step and 8 columns for the time step. These matrices are recorded in the database of identification of speakers (DBU), in connection with the given user (whether registered or not). Each matrix can be registered with a relatively small number of data: one byte suffices for each value of the matrix (value normalized to 100), or at most 24x8 = 192 bytes per matrix (without optimization), to which it is possible to apply a compression algorithm.

In a particular embodiment of the present technique, the speech imprint of a speaker is constructed from a plurality of energies matrix, pertaining to one or more several words uttered by the speaker. Other elements than the energy matrices can also be integrated and / or used to generate a speaker's voiceprint.

Speaker Recognition

It is assumed that there is a plurality of energy matrices associated with a user in the speaker identification database (DBU), and that user is being discussed and recorded by the user. the system. To recognize this user, the system is based on the matrices resulting from the words pronounced by the user. In a first phase, it is assumed that a certain number of words have been recognized by the system and that an audio sample is available for these recognized words. As an illustration, assume that one of the recognized words is the word "la". The system already has matrices of energies resulting from the pronunciation of this word "la" (as previously explained).

In a second phase, the system therefore searches for the matrices corresponding to the word "la" that were previously generated. This search is performed for all "potential" users (that is, a list of possible users, given the circumstances of the registration): if, for example, the registration concerns four users registered in the system, the search for the matrices is performed on these four users in order to shorten the processing times.

On the basis of the energy matrices obtained after the search, the identification of the user is implemented using a Pearson correlation to compare these energy matrices. Specifically, the Pearson correlation coefficient is used to compare the energy matrices. This coefficient is calculated as represented in the following equation:

in which :

- X {xl, ... xn} is a matrix present in the database, which we will compare;

Y {yl, ... yn} is the matrix to compare, the current matrix:

N is the number of values in the matrices.

The value "r" is between 0 and 1. If / r / <0.4 the two matrices have a weak correlation. When 0.4 = <\ r \ <0.7 the two matrices have a mean correlation. When 0, 7 <Irl <= 1, the two matrices have a much stronger correlation: the latter case is considered in the context of the present.

Thus, by using the value of the Pearson correlation coefficient, with a predetermined threshold equal to 0.7 (or close to 0.7), the similarity of two matrices is decided. The speaker identification is performed on the correlation calculations that result in the highest (among the results above the predetermined threshold).

In relation to FIG. 4, the method used to carry out a speaker identification is presented. The method implemented comprises a step of cutting (not shown) the audio data, according to at least one pause model, delivering a set of segmented audio data. The method is implemented on the basis of the segmented data and includes, for at least current audio data of said segmented audio data set:

at least one voice recognition step (10) of said current audio data, implemented via at least one voice recognition engine, delivering at least one recognized word or an indicator of absence of recognition; when at least one speech recognition engine delivers at least one recognized word

(20):

a search step (201), within a data frame, of at least one reference word corresponding to said at least one recognized word, delivering at least one reference voice identification data item;

a step of obtaining (202), from said current audio data, a current voice identification data item;

a calculation step (203) of at least one correspondence between said current voice identification data and said at least one reference voice identification data, delivering at least one match score; and when a match score exceeds a predetermined threshold (205), a step of assigning (206) said current audio data to one of said at least two interlocutors.

Thus, this method makes it possible to distinguish between the various speakers who intervene during the voice exchange. As indicated above, this identification method can advantageously be coupled with other identification methods. (especially sound spatialization), in particular to produce more convincing results and / or to obtain these results more quickly.

5.4. Description of a use case

In this case of use, a system implementation is described for the recognition and retranscription of the minutes of a conference call conference in which a plurality of speakers participate, some of them being located in the same room. room, around an octopus conference type device (for example two speakers) and a third participant remotely via a telephone handset, a computer, etc. A conference is implemented and processed at least indirectly by the system. A conference begins, for the system, a phase of recording a voice print of each of the participants in the conference. This phase comprises several stages, among which:

a step of transmitting, to each speaker participating in the conference, a request for the speaker's statement of his name and / or first name; the request takes for example the form of a message of the type "you enter in conference, please pronounce your names and first names":

when the speaker participates in the conference by telephone, the recording phase of the fingerprint is implemented at this telephone line;

when the speaker participates in the conference from a computer, the recording phase is implemented via the computer, either locally, on the computer, or on the system, by transmitting the audio stream resulting from the enunciation;

when the speaker participates from a conference room, the method implemented is as follows:

the system requires the statement of the number of participants around the table with a phrase such as "Please indicate how many participants are around the table"

once the number of participants has been obtained (either by typing on a keyboard or by spoken utterance), and for each participant, the system requires, for each, the stated name and / or surname of the participant; in addition, the system implements a location algorithm allowing, according to the statement made, to obtain a spatial location of the speaker;

for each speaker, there follows a step of calculating a voice print on the basis of the preceding enunciation; several cases are possible:

the speaker is unknown to the system: the calculated fingerprint is then used during the conference;

either the speaker is known to the system (because he has previously registered): in this case the stated one of his name and / or first name (and the recognition that is made of it) is used to search the prerecorded imprint of the speaker ; this allows one or other of these fingerprints to be used for speaker recognition;

The conference continues for the system with automatic registration and recognition of conference attendees. Again, there are several scenarios:

when the speaker participates via telephone, the audio stream from the microphone of this phone is recorded continuously and can be decoded (possibly continuously also); the speaker's fingerprint is used to assign that flow to the speaker in question; when the speaker participates via a computer (or a tablet, an IP connection of a smartphone), the process is the same as for a telephone connection;

when the speaker participates around a conference table, several processes are implemented concomitantly:

at. recognition of the speaker's fingerprint using the prerecorded fingerprint: this recognition also takes into account the spatialization of the sound picked up by the microphone or microphones;

b. a recording of the audio stream from the conference (all speakers combined);

vs. a recording of the audio stream from the speaker conference:

for example if two speakers are around the table for the conference, two separate audio streams are recorded and the portions audio are distributed over both streams based on the speaker's fingerprint recognition;

From a general point of view, the system of the invention intervenes in two stages:

it recognizes speakers in real time during the conference: the identification is performed in real time on the basis of previously calculated fingerprints; it is a matter of creating the audio streams of each speaker and marking (temporally) the flows to determine the moments at which the different speakers have spoken; a temporal synchronization of the audio streams is then implemented, which makes it possible, in particular, to assign a set of sounds picked up to a given user;

it recognizes the words uttered by each speaker: this recognition phase can either be implemented in real time, for example to perform a real-time transcription of the conference (for example to people who have lost hearing) or deferred time; the advantage of real time is to provide participants with an immediate record of verbal exchanges; the disadvantage is that it requires significant computing resources, and the risk of errors is higher; the advantage of delayed time is that it requires less computing resources because the streams can be processed one after the other;

From a general point of view, the system of the invention performs either direct processing of the conference or indirect processing thereof. The distinction between direct processing and idirect of the conference is as follows:

in direct processing, the system is in charge of the multiplexing of the conference, that is to say to mount the possible telephone bridges with the various interlocutors (when there is a conference call), in addition to the recording and the processing of audio streams;

In indirect processing, the system is in charge of the recording and processing of audio streams, including the identification and processing of speech speakers: it records the possible telephone flows and flows from the microphone or microphones the room in which the conference is held.

Claims

A method of processing audio data, said audio data being derived from a voice exchange between at least two speakers, said method comprising a processing phase (P20) of the audio data comprising a speaker identification step (P202) which comprises a a step of splitting said audio data, according to at least one pause pattern, outputting a segmented audio data set, said method being characterized by comprising, for at least current audio data of said segmented audio data set:

at least one voice recognition step (10) of said current audio data, implemented via at least one voice recognition engine, delivering at least one recognized word or an indicator of absence of recognition; when at least one speech recognition engine delivers at least one recognized word (20):

a step of searching (201), within a data frame, of at least one reference word corresponding to said at least one recognized word, delivering at least one reference voice identification data item;

A method according to claim 1, characterized in that the step of obtaining (202), from said current audio data, a current voice identification data comprises:

an application step, on said current audio data, which takes the form of a one-dimensional audio signal x (t), a short-term Fourier transform (TFCT), delivering a matrix X, called energy matrix, of size m × n, m frequency channels and n time frames and whose frequency scale is between 0 and 1 kHz;

a step of normalizing the values of said matrix X, delivering a standardized matrix which constitutes the current voice identification data.

A method according to claim 2, characterized in that the application of said short-term four-way transform to said mono-dimensional audio signal x (t) is performed at a time step of 40 ms.

Method according to claim 2, characterized in that the values of the elements of the standardized matrix are between 0 and 100.

Method according to claim 2, characterized in that said at least one reference voice identification data is also in the form of a standardized energy matrix, previously recorded within said database. Method according to claim 1, characterized in that the step of calculating (203) at least one correspondence, delivering the match score, comprises a step of determining a Pearson correlation value of the identification data. voice and said at least one voice reference identification data.

A method according to claim 1, characterized in that it comprises, prior to said processing phase (20), a obtaining phase (P10), for each of said at least two speakers, an individual voiceprint comprising said data reference voice identification.

Method according to claim 1, characterized in that said processing phase further comprises the following steps:

time stamping (P205), in said at least one main audio stream and in said own audio stream, using data representative of said current speaker. Processing system (SystTt) processing audio data, said audio data being derived from a voice exchange between at least two speakers, said system comprising a speaker processing engine (MTrtUsrs), which comprises speaker identification means comprising means for dividing said audio data, according to at least one pause model, delivering a set of segmented audio data, said system being characterized in that it comprises, for at least current audio data of said segmented audio data set voice recognition means of said current audio data, implemented via at least one voice recognition engine, delivering at least one recognized word or an indicator of absence of recognition;

search means, within a data frame, of at least one reference word corresponding to said at least one recognized word, delivering at least one reference voice identification data item;

means for obtaining from said current audio data, a current voice identification data;

means for calculating at least one correspondence between said current voice identification data and said at least one reference voice identification data, delivering at least one correspondence score; and means (206) for allocating said current audio data item to one of said at least two interlocutors, implemented when a match score exceeds a predetermined threshold.

Computer program product downloadable from a communication network and / or stored on a computer readable medium and / or executable by a microprocessor, characterized in that it comprises program code instructions for executing a method processing device according to claim 1 to 8 when executed on a processor.