EP4290514A1 - Maskierung der sprechersprache - Google Patents
Maskierung der sprechersprache Download PDFInfo
- Publication number
- EP4290514A1 EP4290514A1 EP23176415.0A EP23176415A EP4290514A1 EP 4290514 A1 EP4290514 A1 EP 4290514A1 EP 23176415 A EP23176415 A EP 23176415A EP 4290514 A1 EP4290514 A1 EP 4290514A1
- Authority
- EP
- European Patent Office
- Prior art keywords
- audio
- segment
- alteration
- signal
- voice
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000000873 masking effect Effects 0.000 title claims abstract description 43
- 230000004075 alteration Effects 0.000 claims abstract description 99
- 230000005236 sound signal Effects 0.000 claims abstract description 63
- 230000001174 ascending effect Effects 0.000 claims abstract description 23
- 238000000034 method Methods 0.000 claims description 88
- 238000012545 processing Methods 0.000 claims description 40
- 230000002123 temporal effect Effects 0.000 claims description 25
- 230000000694 effects Effects 0.000 claims description 24
- 238000004590 computer program Methods 0.000 claims description 12
- 241000288906 Primates Species 0.000 claims description 11
- 230000015572 biosynthetic process Effects 0.000 claims description 6
- 230000006870 function Effects 0.000 description 53
- 239000011295 pitch Substances 0.000 description 52
- 230000008569 process Effects 0.000 description 24
- 238000001228 spectrum Methods 0.000 description 20
- 230000003595 spectral effect Effects 0.000 description 13
- 238000010586 diagram Methods 0.000 description 9
- 230000004048 modification Effects 0.000 description 7
- 238000012986 modification Methods 0.000 description 7
- 230000002441 reversible effect Effects 0.000 description 7
- 230000009466 transformation Effects 0.000 description 7
- 238000004364 calculation method Methods 0.000 description 6
- 238000009826 distribution Methods 0.000 description 6
- 239000013598 vector Substances 0.000 description 6
- 238000013459 approach Methods 0.000 description 5
- 230000008901 benefit Effects 0.000 description 4
- 238000006243 chemical reaction Methods 0.000 description 4
- 238000004891 communication Methods 0.000 description 4
- 238000013500 data storage Methods 0.000 description 4
- 210000001260 vocal cord Anatomy 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 3
- 230000000670 limiting effect Effects 0.000 description 3
- 238000004519 manufacturing process Methods 0.000 description 3
- 230000001537 neural effect Effects 0.000 description 3
- 230000008520 organization Effects 0.000 description 3
- 238000005295 random walk Methods 0.000 description 3
- 238000011282 treatment Methods 0.000 description 3
- 230000001755 vocal effect Effects 0.000 description 3
- 241000665848 Isca Species 0.000 description 2
- 238000009825 accumulation Methods 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 2
- 238000012937 correction Methods 0.000 description 2
- 230000002427 irreversible effect Effects 0.000 description 2
- 238000012886 linear function Methods 0.000 description 2
- 230000002093 peripheral effect Effects 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 241000238558 Eucarida Species 0.000 description 1
- 239000000969 carrier Substances 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000001010 compromised effect Effects 0.000 description 1
- 238000000354 decomposition reaction Methods 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000009792 diffusion process Methods 0.000 description 1
- 238000006073 displacement reaction Methods 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000002349 favourable effect Effects 0.000 description 1
- 230000001135 feminizing effect Effects 0.000 description 1
- 230000002068 genetic effect Effects 0.000 description 1
- 238000009499 grossing Methods 0.000 description 1
- 210000000867 larynx Anatomy 0.000 description 1
- 210000004072 lung Anatomy 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000001356 masculinizing effect Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 210000000214 mouth Anatomy 0.000 description 1
- 238000006386 neutralization reaction Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 210000003800 pharynx Anatomy 0.000 description 1
- 230000006798 recombination Effects 0.000 description 1
- 238000005215 recombination Methods 0.000 description 1
- 238000009877 rendering Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 210000003437 trachea Anatomy 0.000 description 1
- 238000012549 training Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 238000000844 transformation Methods 0.000 description 1
- 238000009827 uniform distribution Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/003—Changing voice quality, e.g. pitch or formants
- G10L21/007—Changing voice quality, e.g. pitch or formants characterised by the process used
- G10L21/013—Adapting to target pitch
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/003—Changing voice quality, e.g. pitch or formants
- G10L21/007—Changing voice quality, e.g. pitch or formants characterised by the process used
Definitions
- the present invention relates to masking the voice of a speaker, in particular to protect the identity of the speaker by restricting the possibility of identifying him by analysis of an original recording of his voice.
- voice recognition software or more particularly speaker recognition software based on their voice signature
- voice recognition software is sometimes used by the police to identify people making telephone or other threats. the origin of anonymous calls.
- certain software of this type which makes it possible to identify a speaker with a level of reliability such that it can lead to a person being convicted by the courts.
- malicious uses of such software can have consequences for individuals who are much less so, sometimes creating irremediable harm such as in the case of harm to life. private. This is why audio-phonic processing techniques can be used to protect speakers whose voice recording is likely to be broadcast or intercepted on communication networks.
- Speech recognition is used to recognize what is being said. Therefore, it helps transform speech into text and that is why it is also known as speech to text conversion.
- the voice is masked, making it possible to meet the requirement for protection of the speaker(s), since the process can easily be implemented in the very first equipment involved in the acquisition and processing chain. audio.
- the process makes it possible to have a final rendering which remains intelligible, that is to say which is neither a voice of “Mickey Mouse” TM nor a voice of “Darth Vader” TM due to both alterations applied to each audio segment that produce changes in the frequency content that are in opposite directions to each other.
- the frequency alterations are limited, one always being ascending while the other is always descending. Therefore, the software or device adapted to implement the solution cannot itself do the opposite operation.
- two components of the spectrum of the audio signal are simultaneously altered compared to the original recording of the speaker's voice.
- This is a first element favorable to the irreversibility of the method, because a malicious third party wishing to return to the speaker's voice origin will have to play on these two characteristics of the voice in combination, which complicates the task compared to masking by pitch shift alone.
- the variability of alterations is not stationary. It varies over time. Thus there can be several variations over a second of processing.
- the proposed implementation modes provide masking of the voice which is irreversible in audio, that is to say by inverse audio processing.
- the voice masked by the proposed method is non-analyzable by known speaker recognition techniques, and does not expose the speaker to commercial practices violating his privacy using voice recognition techniques, given that the masked voice of the same speaker is never masked in the same way twice.
- the division of the audio signal into a series of successive audio segments of determined duration can be carried out by temporal windowing independent of the content of the audio signal.
- the division of the audio signal can be configured so that the duration of an audio segment is equal to a fraction of a second, so that successive changes in the parameters varying the first and second alterations occur several times per second.
- the alteration of the pitch of the audio signal corresponds to a variation in the fundamental frequency of the audio signal of any of the following values: ⁇ 6.25%, ⁇ 12.5%, ⁇ 25%, ⁇ 50% and ⁇ 100%.
- the first alteration and the second alteration are dependent on each other, fluctuating jointly so as to satisfy a determined criterion relative to their respective effects on the frequency content of the timbre of the audio segment and on the frequency content of the height of the audio segment, respectively.
- this criterion may consist of maintaining a minimal gap between the respective effects of the two alterations, and thus avoiding temporarily returning to the original voice.
- a second aspect of the invention relates to a computer program comprising instructions which, when the computer program is loaded into the memory of a computer and is executed by a processor of this computer, are adapted to implement implements all the steps of the process according to the first aspect of the invention above.
- the computer program for implementing the method can be recorded in a non-transitory manner on a tangible recording medium, readable by a computer.
- the computer program for implementing the method can advantageously be sold as a plugin, capable of being integrated into “host” software, for example audio-phonic or audio-production and/or processing software.
- visual such as Pro Tools TM , Media Composer TM , kan Pro TM or Audition TM , among others.
- This choice is particularly suited to the audiovisual world. In fact, this eliminates the need to transfer the original audio signal (unmasked, therefore in the clear), to a remote server or another computer. It is therefore the user's computer which alone holds the source file of the original voice, that is to say before execution of the masking process. This greatly reduces the risk of malicious interception of the original audio signal.
- the method can be implemented by audio-phonic processing software which can very well be executed on independent hardware equipment and having standard processing capacity, for example a computer for general use. general, because its processing takes place in real time. It does not require the implementation, in particular, of any artificial intelligence, of any voice data bank, nor of any learning process, unlike a number of solutions of the prior art, in particular some of those which were presented in the introduction.
- the computer program for implementing the method can advantageously be integrated, either ab initio or through a software update, into the internal software which is embedded in equipment dedicated to production and /or the processing of audio-phonic or audio-visual content (called “media” in the jargon of those skilled in the art), such as an audio and/or video mixing and/or editing console for example.
- Such equipment is intended more for producers, mixers, and other media post-production professionals.
- a third aspect of the invention relates to an audio-phonic or audio-visual processing device, comprising means for implementing the method.
- This device can be produced in the form, for example, of a general-purpose computer capable of executing the computer program according to the second aspect above.
- a fourth and final aspect of the invention relates to an audio-phonic or audio-visual processing device such as an editing and/or mixing console making it possible to produce media (namely audio, audiovisual content, or multimedia) corresponding to or incorporating a speech signal from a speaker, in particular from a speaker to be protected, the apparatus comprising a device according to the third aspect.
- the human voice is the set of sounds produced by the friction of the air of the lungs on the folds of the larynx of the human being.
- the pitch and resonance of the sounds made depend on the shape and size of not only their vocal cords, but also the rest of the person's body.
- Vocal cord size is one source of the difference between male and female voices, but it is not the only one.
- the trachea, mouth, pharynx, for example, define a cavity in which the sound waves emitted by the vocal cords resonate. Additionally, genetic factors cause the difference in vocal cord size within people of the same sex.
- the method makes it possible to mask a speaker's voice in order to protect their identity and/or their privacy.
- the protection of the identity and/or privacy of the speaker is achieved by an intentional alteration not only of the pitch but also of the timbre of the speaker's voice.
- This alteration is carried out by digital signal processing techniques, on the basis of processing algorithms implemented by computer.
- a complex sound of fixed pitch can be analyzed into a series of elementary vibrations, called natural harmonics, whose frequency is a multiple of that of the reference frequency, or fundamental frequency.
- natural harmonics whose frequency is a multiple of that of the reference frequency, or fundamental frequency.
- the fundamental frequency (from which the frequencies j ⁇ f of the harmonics arise) characterizes the perceived pitch of a note, for example an “A”.
- the distribution of the intensities of the different harmonics according to their rank j characterized by their envelope, defines the timbre. The same goes for a speech signal as for musical notes, speech being only a succession of sounds produced by the vocal apparatus of a human being.
- the timbre of a musical instrument or a voice designates the set of sound characteristics which allow an observer to identify by ear the sound produced, independently of the pitch and intensity of this sound. her.
- the timbre makes it possible, for example, to distinguish the sound of a saxophone from that of a trumpet playing the same note with the same intensity, these two instruments having their own resonances, which distinguish sounds when listening: the sound of a saxophone contains more energy on the harmonics of relatively lower frequencies which gives a relatively more “dull” sound timbre, while the timbre of the sound of a trumpet has more energy on the harmonics of relatively higher frequencies in order to give a “clearer” sound, although having the same fundamental frequency.
- we designate by vocal register the set of frequencies emitted with an identical resonance that is to say the part of the vocal range in which a singer, for example, emits sounds of respective pitches with a almost identical stamp.
- the organization chart of the figure 1 schematically illustrates the main stages of the process of masking a speaker's voice.
- the method can be implemented in an audio-phonic system 20 as represented very schematically in the figure 2 .
- This system may include hardware means 201 and software means 202 allowing this implementation.
- this signal can belong to an audiovisual program (mixing sound and images), such as a video of the interview. of a witness wishing and/or having to remain anonymous, filmed for example using a “hidden camera” or accompanied by blurring of the image of the witness to be protected.
- the speech signal can correspond to all or part of the soundtrack of a video, and generally of any audio-phonic, radio, audiovisual or multimedia program.
- the audio-phonic system 20 is for example audiovisual mixing equipment, used to edit video sequences in order to produce an audiovisual program from various video sequences and their respective “soundtracks”.
- the hardware means 201 of the audio-phonic system 20 comprise at least one calculator, such as a microprocessor associated with random access memory (or RAM), and means for reading and recording of digital data on digital recording media (mass memory such as an internal hard disk), and data interfaces for exchanging data with external devices.
- a calculator such as a microprocessor associated with random access memory (or RAM)
- mass memory such as an internal hard disk
- data interfaces for exchanging data with external devices.
- an audio signal acquisition device 31 such as a microphone (or micro)
- the system 20 can communicate in reading and/or writing with other external data carriers, in order to read thereon the data of an audio signal to be processed and/or to record the data of the audio signal thereon. audio signal after processing.
- the system 20 may include means of communication such as a modem or an Ethernet, 4G, 5G network card, etc., or even a Wi-Fi or Bluetooth® communication interface.
- the software means 201 of the audio-phonic system 20 comprise a computer program which, when loaded into the RAM and executed by the processor of the audio-phonic system 20, is adapted to execute the steps of the method of masking the signal from a speaker.
- step 11 the sound of the speaker's voice is captured via the microphone 31 of the system 20, either for immediate processing in the system 20, or for deferred processing.
- immediate processing we mean processing carried out during the acquisition of the audio signal, without an intermediate step of fixing this audio signal on any permanent recording medium.
- the data from the original audio signal then only passes through the RAM (non-permanent memory) of the system 20.
- deferred processing means processing which is carried out from a recording, made within or under the control of the audio-phonic system 20, of the speaker's speech signal acquired via the microphone 31.
- This recording is fixed on a mass data storage medium, for example a hard disk internal to the system 20. It can also be a peripheral hard disk, that is to say external, coupled to this system. It may also be another peripheral data storage device with permanent memory capable of storing the audio data of the speech signal permanently, such as a USB key, a memory card (Flash type or other) or an optical or magnetic recording medium (audio CD, CD-Rom, DVD, Blu-Ray disc, etc.).
- the mass data storage medium can also be a data server with which the audio-phonic system 20 can communicate to download (" upload " in English) the data of the audio signal so that they are stored there, and for later download them for subsequent processing.
- This server can be local, that is to say part of a local network of the LAN type (from the English " Local Area Network ") to which the audio-phonic system 20 also belongs.
- the data server can also be a remote server, such as a data server in the Cloud which is accessible via the open Internet.
- the speech signal corresponding to the speaker's speech sequence may have been acquired via other equipment, distinct from the audio-phonic system 20 which implements the method of masking the speaker's voice.
- an audio data file encoding the speaker's voice may have been recorded on a removable data medium, which can then, in step 11, be coupled to the audio-phonic system 20 for reading the audio data.
- This audio data file may also have been downloaded to a data server in the Cloud, to which the audio-phonic system 20 can also access in order to download the audio data of the audio signal to be processed.
- step 11 of the method then consists solely, for the audio-phonic system 20, of accessing the audio data of the speaker's speech signal.
- step 11 of the method comprises a (temporal) division of the original speech signal, into a series of successive audio segments of determined duration, which is constant from one segment to another in the series of segments. thus produced.
- the division of the audio signal into a series of successive audio segments of the same determined duration is carried out by temporal windowing which is independent of the content of the audio signal, and which can be done "on the fly”.
- the windowing is independent of both the frequency content, that is to say the distribution of energy in the frequency spectrum of the audio signal , and the informational or linguistic content, that is to say the semantics and/or the grammatical structure of the speech contained in this audio signal, in the language spoken by the speaker.
- the method is therefore very simple to implement, since no physical or linguistic analysis of the signal is necessary to generate signal segments to be processed.
- a temporal windowing operation makes it possible to process a signal of length voluntarily limited to a duration ⁇ , knowing that any calculation can only be done on a finite number of values.
- an observation window function also called a weighting window and denoted h(t).
- the simplest, but not necessarily the most used or preferred, is the rectangular window (or door) of size m defined as follows: h t ⁇ 1 , if t ⁇ 0 m 0 , Otherwise
- the duration D of an audio segment s k ( ⁇ ) is equal to a fraction of a second, for example between 10 milliseconds (ms) and 100 ms (in other words, D ⁇ [10 ms, 100 ms ]).
- An audio segment then has a duration less than that of a word of the language spoken by the speaker, whatever the language in which he or she speaks. This duration is a fortiori less than the duration of a sentence or even a portion of a sentence in this language.
- the duration of an audio segment s k ( ⁇ ) is then, at most, of the order of the duration of a phoneme, that is to say the duration of the smallest speech unit ( vowel or consonant).
- An audio segment s k ( ⁇ ) therefore does not carry, in itself, any informational content with regard to spoken language, because its duration is far too short for that. This gives the masking process the advantage of simplicity, and in addition good robustness against the risk of reversion.
- Step 11 also includes the formation of a series of pairs of audio segments each comprising a primate and a duplicate of an audio segment of the series of audio segments above.
- these couples can more particularly be defined in the frequency domain, after Fourier transform (TF) applied to the segments s k ( ⁇ ) of the audio signal in the time domain.
- TF Fourier transform
- the series of primates and the series of duplicates of the audio segments of the speech signal undergo processing for each primate and each duplicate of the audio segment of a pair, to extract on the one hand, the envelope of the harmonics characterizing the timbre of the audio segment, and on the other hand, the signal characterizing the pitch of the audio segment.
- the series of timbres and the series of pitches are designated indifferently by the letters A and B, or vice versa).
- the signals characterizing the pitch and the timbre extracted from the primate and the duplicate undergo parallel processing, essentially independent of each other. These treatments are illustrated by steps 12a and 13a of the left branch and by steps 12b and 13b of the right branch, respectively, of the algorithm illustrated schematically by the flowchart of the figure 1 .
- Step 12a is a first ascending alteration (denoted MODa, in the following and in the drawings), applied to each element of the series A of the audio segments.
- This ascending alteration is not identical from one element to another in the A series. On the contrary, it evolves as a function of at least a first masking parameter.
- this first ascending alteration always has the effect of raising a determined part of the frequency content of the primacy of the audio segment to which it is applied. By this we mean that all or part of the primacy frequencies of the segment considered are shifted towards high frequencies, compared to the corresponding audio segment of the original speech signal. Applying the first alteration generates an altered timbre (here towards the top) of the audio segment.
- Step 12b is a second, descending alteration (denoted MOD B in the following and in the drawings), applied to each element of series B of audio segments.
- this descending alteration MOD B is not identical from one element to another in series B. This means that it evolves, depending on at least one second masking parameter.
- this downward alteration always has the effect of lowering a determined part of the frequency content of the element of the audio segment to which it is applied. By this we mean that all or part of the frequencies of the audio segment considered are shifted towards low frequencies, compared to the corresponding audio segment of the original speech signal. Applying the second alteration generates an altered pitch (here towards the bottom) of the audio segment.
- each of the alterations MOD A and MOD B it is then advantageous for each of the alterations MOD A and MOD B to be restricted from the point of view of the evolution of the frequency content of the elements of the audio segment to which it is applied.
- these alterations of the frequency spectrum are each only ascending or only descending, without any inflection in the direction of movement of the frequencies concerned in the spectrum considered. In fact, this prevents the audio-phonic system 20 from being able to be used itself by malicious people to whom it has been provided or made available, or who could have access to it by any other means, in order to reverse the alteration of the audio signal.
- Such a reversion could in fact consist of applying to the masked audio signal (which the malicious third party would have copied or intercepted in any way), alterations with judiciously chosen masking parameters to return to the original speech signal, this is i.e. to the audio signal corresponding to the natural voice of the speaker. But thanks to the modes of implementation described above, such a maneuver is not possible with the audio-phonic system 20 according to the invention itself. Indeed, no change in the values of the masking parameters of the ascending alteration MOD A and the descending alteration MOD B that the malicious third party could attempt can have the effect of reversing the unidirectional movements of the pitch (height) and of the timbre, respectively, of the original speech signal.
- the audio system 20 does not offer the possibility of reversibility of the alteration that it produces. This does not prohibit a malicious third party from attempting this fraud with other means, but at least the system used to mask the audio signal containing the natural voice of a speaker cannot be diverted from its function, in fact. “returned”, in order to remove the protection of the speaker that it provides.
- the method then comprises a step 15 of combining the timbre of the audio segment, altered by the MOD A alteration and which was obtained in step 12a, on the one hand, and the pitch of the audio segment, altered by the alteration MOD B and which was obtained in step 12b, on the other hand, to form a single resulting altered audio segment.
- a step 15 we mean here an operation having, from a physical point of view, the effect of recombining the respective altered spectra, that is to say of merging the respective frequency contents of the altered timbre of the audio segment and of the altered pitch. of said audio segment, possibly with averaging and/or smoothing.
- this can be obtained by multiplication (“ ⁇ ” symbol) or by convolution (“ ⁇ ” symbol), either in the time domain or in the frequency domain after transformation of the audio signal(s). s) from the time domain into the frequency domain by a Fourier transform.
- steps 12a and 12b on the one hand, and steps 13a and 13b on the other hand can be carried out in the reverse order to that presented in the figure 2 .
- steps 13a and 13b can be executed after (as shown) or before steps 12a and 12b.
- the alteration of the height of the audio signal can thus correspond to an “oriented” variation, namely a rise or fall, of the fundamental frequency of the audio signal, which can take any of the following determined values: ⁇ 6.25%, ⁇ 12.5%, ⁇ 25%, ⁇ 50% and ⁇ 100%.
- These example values correspond approximately to variations of a semitone, tone, third, fifth, or octave, respectively, in pitch (i.e. of the fundamental frequency, or “pitch”) of the original speech signal.
- step 14 The sequential repetition of step 14 for the successive pairs of primates and duplicates of the audio segments generated in step 11, generates a series of altered audio segments.
- the method finally comprises, in step 15, the recomposition of the masked audio signal from the series of altered audio segments obtained by the repetition of the previous steps, 12a-12b, 13a-13b and 14.
- This recomposition is carried out by superposition-addition, in the time domain, of the successive elements of the series of altered audio segments produced in step 14, as they are transformed.
- the frequency content is doubly altered, compared to the spectrum of the considered segment of the original speech signal. This results from the accumulation of the respective effects of the MOD A and MOD B functions.
- the successive changes of the first masking parameter and the second masking parameter which occur at each occurrence of steps 13a and 13b, respectively, induce random variations of said first parameter and second parameter, of a pair to the other in the series of pairs of audio segments generated in step 11.
- MOD A and MOD B alterations relate to different components of the spectrum of the segment considered of the original speech signal, as in addition they use distinct masking parameters, and as finally their respective masking parameters evolve independently of one of the other randomly, the masking effect which is obtained is very difficult, if not impossible, to reverse.
- the variations of the first and second masking parameters are themselves fluctuating randomly, from one pair of segments to another in the series of pairs of audio segments.
- the variations denoted VAR A and VARs in steps 13a and 13b of the parameters of the modifications denoted MOD A and MOD B introduced in steps 12a and 12b fluctuate, as a function of time.
- this fluctuation occurs from a segment to another of the original speech signal. Therefore, on the figure 1 , this fluctuation is symbolized by an operation denoted VAR A+B in step 14.
- FIG. 3A and the Figure 3B illustrate a mode of implementation of the descending alteration and the ascending alteration, respectively, which can be applied to the timbre and the height (or pitch) of an audio signal segment, in step 12a and at step 12b, respectively, of the method illustrated by the flowchart of the figure 1 .
- the ascending alteration MOD A is applied to the pitch of the voice, symbolized at the Figure 3A by a tuning fork.
- a tuning fork is known as an object whose acoustic resonance produces a sound with a pure frequency, such as the fundamental frequency (or pitch) of a human voice.
- the descending alteration MOD B is applied to the timbre of the voice, symbolized at the figure 4A by the envelope of the frequency spectrum of an audio signal.
- the example represented by the figures 3A And 3B is not limiting.
- the ascending alteration MOD A can conversely be applied to the fundamental frequency (pitch) while the descending alteration MOD B would be applied to all or part of the harmonic envelope (timbre).
- the two alterations MOD A and MOD B each produce displacements of certain frequencies (namely, in the example considered here, the pitch for one, and the envelope of the harmonics for the other) following opposite directions in the frequency spectrum (i.e. an ascending direction towards the treble for one, and a descending direction towards the bass for the other).
- these effects operating in two different directions allow good protection while preserving a certain intelligibility of the audio signal.
- the “masculinizing” effect of a frequency shift towards the bass which results from the ascending alteration MOD A is partly counterbalanced by the “feminizing” effect of a frequency shift towards the treble which results from the descending alteration MOD A. This avoids generating a masked signal close to the voice of “Darth Vader” TM or close to the voice of “Mickey Mouse” TM .
- the audio file obtained after implementing the process of figure 1 may be transmitted by electronic mail, posted online on social networks or on a website, broadcast on the airwaves, or distributed on any recording medium.
- the process makes it possible to mask the voice a posteriori, on a recording of the speaker's voice, as can easily be done with audio editing software.
- the method does not allow audio or video calls to be made with a masked voice.
- the process is implemented on the audio-phonic or audio-visual platform with which the speaker's voice is acquired, the original voice does not circulate on any computer network, which avoids a risk of interception by a malicious third party of data corresponding to the unmasked voice.
- the computer program which implements the masking method, by carrying out the calculations of the corresponding digital processing can be included in a host software, for example the operational software of an audio-phonic processing environment, such as a audio mixing or audio-visual editing console.
- FIG. 4A is a frequency diagram of a recorded audio sequence, showing the distribution of energy as a function of time (on the abscissa) and frequency (on the ordinate).
- FIG 4B is a frequency diagram of the audio sequence of the Figure 4A after the implementation of a voice masking process according to the prior art, by simple pitch shift. We sometimes speak of a “pitched” signal to designate the signal having undergone such an offset. We can clearly see by comparing these two frequency diagrams that there is a very strong analogy of the harmonics of the signal between the original signal and the pitched signal.
- ⁇ s k ( ⁇ ) Such a segment is denoted s k ( ⁇ ) at the top of the Figure 6 .
- the segment s k ( ⁇ ) is the subject in step 61 of a Fourier Transform (TF), for example a short-term Fourier transform known by the acronym TFCT (or STFT, from the English " Short Term Fourier Transform ”) in order to move into the time-frequency domain.
- TFCT short-term Fourier transform
- STFT short Term Fourier Transform
- step 62 the segment S k ( t , f ) is decomposed into a module term denoted X k ( t, f) and a phase term denoted Q k (t, f).
- X k t f X k t f ⁇ Q k t f
- X k t f ⁇ S k t f ⁇
- Q k t f exp i ⁇ Arg S k t f , where Ar g denotes the argument of a complex number.
- the timbre component A k ( t , f ) can be obtained by the cepstrum method.
- IFFT Inverse Fourier Transform
- the cepstrum which is a dual temporal form of the logarithmic spectrum (the frequency domain spectrum becomes cepstrum in time domain).
- the fundamental frequency can be calculated from the cepstral signal by determining the index of the main peak of the cepstrum and we obtain, by windowing the cepstrum, the envelope of the spectrum which corresponds to the timbre component A k ( t , f ) .
- the height component (or pitch) B k ( t , f ), for its part, can then be obtained by dividing the signal X k ( t , f) point to point by the value of the timbre component A k ( t , f ) .
- the height (or pitch) component B k ( t , f ) we can “subtract” (which is carried out by a division calculation in the time-frequency space) from the term of module _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ characterizing the height (or pitch) or more generally what is called the fine structure of the Spectral Power Density (PSD).
- PSD Spectral Power Density
- the frequency alteration functions ⁇ A ( f ) and ⁇ B ( f ) correspond to the alterations MOD A and MOD B , respectively, which were presented above with reference to the figure 1 .
- the time variation functions ⁇ A ( t ) and ⁇ B ( t ) correspond to the variations VAR A and VARs, respectively, which were presented above with reference to the figure 1 .
- step 64a comprises the application to the signal A k ( t , f) which corresponds to the timbre component, on the frequency scale f , of the temporal variation function ⁇ A ( t ), to generate a signal intermediate, denoted A' k ( t, f ), of the timbre component A k ( t , f ) .
- the function ⁇ A ( t ) is a linear function. Preferably, and as has already been mentioned above, it fluctuates over time in a random manner, varying from one original audio signal segment to another in the series of segments S k ( t , f ) which are processed in sequence. In other words, it changes as a function of the value of the index k, according to a random process whose refreshing is governed by a parameter ⁇ , so that the alteration of the timbre is not stationary.
- the function ⁇ B ( t ) is a linear function.
- it fluctuates over time in a random manner, varying from one original audio signal segment to another in the series of segments S k ( t , f ) which are processed in sequence.
- it changes according to the value of the index k according to a random process whose refreshing is governed by a parameter ⁇ so that the alteration of the height is not stationary.
- the temporal variation function ⁇ A ( t ) can vary according to a random walk within a determined amplitude range [ ⁇ A min , ⁇ A max ] and with a temporal refresh rate corresponding to the parameter ⁇ mentioned above, where ⁇ A min , ⁇ A max and ⁇ are first masking parameters, associated with the temporal variation function ⁇ A ( t ) .
- the temporal variation function ⁇ B ( t ) can for example vary according to a random walk within an amplitude range [ ⁇ B min , ⁇ B max ] and with a temporal refresh rate corresponding to the aforementioned parameter ⁇ , where ⁇ B min , ⁇ B max and ⁇ are second parameters, associated with the temporal variation function ⁇ B ( t ) .
- the fluctuations of the two temporal variation functions ⁇ A ( t ) and ⁇ B ( t ) are preferably independent of each other, in order to reinforce the irreversibility of the alterations. In other words, the temporal variation functions ⁇ A ( t ) and ⁇ B ( t ) are uncorrelated.
- the parameter ⁇ is the parameter of the fluctuation denoted VAR A+B at the figure 1 .
- This parameter defines, for example, the number of random variations per second of alterations in the spectrum of an audio segment. For example, if ⁇ were equal to zero, the variations VAR A and VARs are stationary, so that the results of the alterations MOD A and MOD B would be fixed, which is not the case in practice.
- ⁇ has a value between 1 and 10. This value being homogeneous at a frequency, we can say that ⁇ is between 1 and 10 Hz. This value is lower than the frequency of the temporal division of the original speech signal into audio segments (by windowing) , which is more of the order of 100 Hz.
- steps 65a and 65b frequency alteration functions ⁇ A ( f ) and ⁇ B ( f ) are applied, respectively, to the timbre component A k (t, f) and to the pitch component B k ( t , f ), respectively, to generate a timbre component of the masked audio segment, denoted A" k ( t , f), and a pitch component of the masked audio segment, denoted B" k ( t , f ), respectively.
- These frequency alteration functions ⁇ A ( f ) and ⁇ B ( f ) correspond to the alterations noted MOD A and MOD B on the figure 1 .
- the alteration functions ⁇ A ( f ) and ⁇ B ( f ) are monotonic, that is to say that the deformation that they introduce on the frequency axis is either ascending with the effect of raising a determined part of the frequency content of the audio segment s k ( ⁇ ), or descending with the effect of lowering a determined part of the frequency content of the audio segment s k ( ⁇ ). Furthermore, they are constrained in the opposite direction, in the sense that, if one is ascending monotonous, the other is descending monotonous, and vice versa. This prevents the software which implements the masking process from being used itself to attempt a reversion of the speaker's voice masking process, as has already been explained above with reference to steps 12a and 12b of the figure 1 .
- the following two steps make it possible to keep the temporality of the original by resynthesizing the audio signal masked by the index k .
- step 67 includes the reconstruction of each modified audio segment, denoted X " k ( t , f ), in the time-frequency domain, by recombination of the new envelope A " k ( t, f ) and the new fine structure of the frequency spectrum B " k ( t , f ) of the audio segment considered.
- the term "new" used here in reference to the envelope and the fine structure means that it is the envelope and the fine structure after masking, that is to say after application of the frequency alteration functions ⁇ A ( f ) and ⁇ B ( f ) correspond to the alterations MOD A and MOD B , respectively, and the temporal variation functions ⁇ A ( t ) and ⁇ B ( t ), respectively.
- Step 68 includes the recomposition of each masked audio segment denoted S" k ( t , f ), in the time-frequency domain.
- the corrected phase component Q " k ( t , f ) of the masked audio segment S " k ( t , f ) is obtained, in the example shown in Figure 6 , in step 66 from the phase term Q k ( t , f ) of the audio segment considered S k ( t , f ), which phase term was generated in step 62.
- Step 66 has for function of providing a correction of the phase term Q k ( t , f) of the audio segment S k ( t , f) as a function of the random variations ⁇ B ( t ) and the alteration function ⁇ B ( f ) which were applied to the pitch term B(t, f).
- phase correction is known per se and is generally implemented in any signal transformation processing as soon as the power spectral density of a signal is modified.
- it is generated in step 66 only as a function of the modifications made to the pitch component B " k ( t , f ) of the power spectral density of the audio segment masked S " k ( t , f ) with respect to the pitch component Bk ( t , f ) of the power spectral density of the original audio segment Sk ( t , f ) .
- the modifications made to the height (pitch) which call for a phase adjustment of the frequency components of the spectrum.
- step 66 could also take into account the modifications made to the timbre component A " k ( t , f ) of the frequency spectral density of the masked audio segment S " k ( t , f) with respect to the timbre component A k ( t , f ) of the frequency spectral density of the original audio segment S k ( t , f ) .
- This is not shown on the organization chart of the Figure 6 in order not to overload it, which would harm its readability, but the person skilled in the art understands, on the basis of his usual knowledge and in view of the indications provided here, the way in which this can be implemented in practice.
- step 69 consists of generating the masked signal s " k ( ⁇ ) in the time domain, from the signal S " k ( t , f ) in the time-frequency domain. For example, this can be obtained by an OLA method (from the English “OverLap-and-Add ”) on the successive inverse Fourier Transforms of s" k ( ⁇ ) .
- the OLA method also called the superposition and addition method, is based on the linearity property of linear convolution, the principle of this method consisting of decomposing the linear convolution product into a sum of linear convolution products.
- other methods can be considered by those skilled in the art to carry out this inverse Fourier transform, in order to generate s" k ( ⁇ ) in the time domain from S " k ( t , f ) in the time domain. time-frequency domain.
- the method which has been presented in the preceding description can be implemented by a computer program, for example as a plugin which can be integrated into audio-phonic or audio-visual processing software.
- the reference 60 collectively designates the parameters of masking the voice of a speaker, namely ⁇ A min , ⁇ A max , ⁇ B min , ⁇ B max , ⁇ , ⁇ A and ⁇ B which can be adjusted by a user, via a man-machine interface adapted from the device on which the speaker's voice masking software is executed.
Landscapes
- Engineering & Computer Science (AREA)
- Quality & Reliability (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Soundproofing, Sound Blocking, And Sound Damping (AREA)
- Electrophonic Musical Instruments (AREA)
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
FR2205507A FR3136581B1 (fr) | 2022-06-08 | 2022-06-08 | Masquage de la voix d’un locuteur |
Publications (1)
Publication Number | Publication Date |
---|---|
EP4290514A1 true EP4290514A1 (de) | 2023-12-13 |
Family
ID=84053089
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP23176415.0A Pending EP4290514A1 (de) | 2022-06-08 | 2023-05-31 | Maskierung der sprechersprache |
Country Status (3)
Country | Link |
---|---|
US (1) | US20230410825A1 (de) |
EP (1) | EP4290514A1 (de) |
FR (1) | FR3136581B1 (de) |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10141008B1 (en) * | 2016-01-19 | 2018-11-27 | Interviewing.io, Inc. | Real-time voice masking in a computer network |
-
2022
- 2022-06-08 FR FR2205507A patent/FR3136581B1/fr active Active
-
2023
- 2023-05-31 EP EP23176415.0A patent/EP4290514A1/de active Pending
- 2023-06-06 US US18/206,602 patent/US20230410825A1/en active Pending
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10141008B1 (en) * | 2016-01-19 | 2018-11-27 | Interviewing.io, Inc. | Real-time voice masking in a computer network |
Non-Patent Citations (6)
Also Published As
Publication number | Publication date |
---|---|
FR3136581A1 (fr) | 2023-12-15 |
FR3136581B1 (fr) | 2024-05-31 |
US20230410825A1 (en) | 2023-12-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Qian et al. | Hidebehind: Enjoy voice input with voiceprint unclonability and anonymity | |
EP2202723B1 (de) | Verfahren und System zur Authentifizierung einem Sprecher | |
Nematollahi et al. | An overview of digital speech watermarking | |
Eskimez et al. | Front-end speech enhancement for commercial speaker verification systems | |
Peer et al. | Phase-aware deep speech enhancement: It's all about the frame length | |
Saleem et al. | Spectral phase estimation based on deep neural networks for single channel speech enhancement | |
Lattner et al. | Stochastic restoration of heavily compressed musical audio using generative adversarial networks | |
Kadiri et al. | Analysis of aperiodicity in artistic Noh singing voice using an impulse sequence representation of excitation source | |
Lyu | DeepFake the menace: mitigating the negative impacts of AI-generated content | |
Fan et al. | Subband fusion of complex spectrogram for fake speech detection | |
EP4290514A1 (de) | Maskierung der sprechersprache | |
Kai et al. | Lightweight and irreversible speech pseudonymization based on data-driven optimization of cascaded voice modification modules | |
Singh et al. | Modified group delay function using different spectral smoothing techniques for voice liveness detection | |
Lan et al. | Research on speech enhancement algorithm of multiresolution cochleagram based on skip connection deep neural network | |
Gaultier | Design and evaluation of sparse models and algorithms for audio inverse problems | |
Xu et al. | Channel and temporal-frequency attention UNet for monaural speech enhancement | |
KR20060029663A (ko) | 다중 레벨 양자화를 이용한 음악 요약 장치 및 방법 | |
Gao et al. | Black-box adversarial attacks through speech distortion for speech emotion recognition | |
Pilarczyk et al. | Multi-objective noisy-based deep feature loss for speech enhancement | |
Liu et al. | Speech enhancement with stacked frames and deep neural network for VoIP applications | |
Jassim et al. | Estimation of a priori signal-to-noise ratio using neurograms for speech enhancement | |
Su et al. | Learning an adversarial network for speech enhancement under extremely low signal-to-noise ratio condition | |
Zong et al. | Black-box audio adversarial example generation using variational autoencoder | |
Kuang et al. | A lightweight speech enhancement network fusing bone-and air-conducted speech | |
Ballesteros L et al. | On the ability of adaptation of speech signals and data hiding |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE APPLICATION HAS BEEN PUBLISHED |
|
AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC ME MK MT NL NO PL PT RO RS SE SI SK SM TR |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE |
|
17P | Request for examination filed |
Effective date: 20240516 |
|
RBV | Designated contracting states (corrected) |
Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC ME MK MT NL NO PL PT RO RS SE SI SK SM TR |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: EXAMINATION IS IN PROGRESS |