WO2021152566A1 - Système et procédé de protection d'empreinte vocale de locuteur dans des signaux audio - Google Patents
Système et procédé de protection d'empreinte vocale de locuteur dans des signaux audio Download PDFInfo
- Publication number
- WO2021152566A1 WO2021152566A1 PCT/IB2021/050794 IB2021050794W WO2021152566A1 WO 2021152566 A1 WO2021152566 A1 WO 2021152566A1 IB 2021050794 W IB2021050794 W IB 2021050794W WO 2021152566 A1 WO2021152566 A1 WO 2021152566A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- audio signal
- processor
- audio
- speech feature
- speech
- Prior art date
Links
- 230000005236 sound signal Effects 0.000 title claims abstract description 239
- 238000000034 method Methods 0.000 title claims abstract description 71
- 238000001228 spectrum Methods 0.000 claims description 34
- 239000003795 chemical substances by application Substances 0.000 claims description 28
- 238000007781 pre-processing Methods 0.000 claims description 22
- 230000015654 memory Effects 0.000 claims description 21
- 230000003993 interaction Effects 0.000 claims description 17
- 230000006870 function Effects 0.000 claims description 15
- 230000003595 spectral effect Effects 0.000 claims description 14
- 238000010801 machine learning Methods 0.000 claims description 10
- 238000012549 training Methods 0.000 claims description 9
- 239000013598 vector Substances 0.000 claims description 7
- 239000000284 extract Substances 0.000 claims description 5
- 238000001914 filtration Methods 0.000 claims description 3
- 230000011218 segmentation Effects 0.000 claims description 3
- 230000001131 transforming effect Effects 0.000 claims 3
- 238000004891 communication Methods 0.000 description 23
- 238000012545 processing Methods 0.000 description 22
- 238000010586 diagram Methods 0.000 description 16
- 238000000605 extraction Methods 0.000 description 13
- 241000282412 Homo Species 0.000 description 7
- 238000004590 computer program Methods 0.000 description 7
- 230000000694 effects Effects 0.000 description 7
- 230000008929 regeneration Effects 0.000 description 7
- 238000011069 regeneration method Methods 0.000 description 7
- 230000001010 compromised effect Effects 0.000 description 5
- 238000002360 preparation method Methods 0.000 description 4
- 238000013528 artificial neural network Methods 0.000 description 3
- 238000006243 chemical reaction Methods 0.000 description 3
- 230000002427 irreversible effect Effects 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 230000004044 response Effects 0.000 description 3
- 230000002441 reversible effect Effects 0.000 description 3
- 239000004065 semiconductor Substances 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- 238000013518 transcription Methods 0.000 description 3
- 230000035897 transcription Effects 0.000 description 3
- 229920001621 AMOLED Polymers 0.000 description 2
- 238000013459 approach Methods 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 230000000903 blocking effect Effects 0.000 description 2
- 238000013500 data storage Methods 0.000 description 2
- 230000002452 interceptive effect Effects 0.000 description 2
- 230000000717 retained effect Effects 0.000 description 2
- 241001465754 Metazoa Species 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 230000000712 assembly Effects 0.000 description 1
- 238000000429 assembly Methods 0.000 description 1
- 230000002238 attenuated effect Effects 0.000 description 1
- 239000000090 biomarker Substances 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 239000003550 marker Substances 0.000 description 1
- 230000000873 masking effect Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 229910044991 metal oxide Inorganic materials 0.000 description 1
- 150000004706 metal oxides Chemical class 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 230000033458 reproduction Effects 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 230000008054 signal transmission Effects 0.000 description 1
- 238000010183 spectrum analysis Methods 0.000 description 1
- 238000013179 statistical model Methods 0.000 description 1
- 239000010409 thin film Substances 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/003—Changing voice quality, e.g. pitch or formants
Definitions
- the present technology generally relates to the field of audio processing, more particularly, to a system and method for shielding speaker voice print in audio signals.
- Audio signals are electronic representation of sound waves that are audible to humans.
- the sources of sound waves may include humans, animals, machines, and the like. Audio signals, especially those related to human voice, are processed to cater to a wide variety of applications. For example, audio signals are processed in telecommunication networks for facilitating communication between remote users. In another illustrative example, audio signals are processed for generating high fidelity musical reproductions.
- the processing of audio signals in some applications, may involve converting audio signals corresponding to human speech into a textual form.
- Such Speech-To-Text (STT) processing of audio signals involves generating textual transcripts from audio input. In some scenarios, the textual transcripts are used to train acoustic and language models for building Automatic Speech Recognition (ASR) engines.
- ASR Automatic Speech Recognition
- the ASR engines are used in a wide range of applications, such as for example, in smart electronic devices like smart phones, home assistants, and the like, for interpreting human voice commands and performing desired actions.
- the ASR engines are also deployed in customer service centers to enable automated agents, such as interactive voice response (IVR) systems and chat bots, to understand customer queries and to provide desired assistance to the customers.
- automated agents such as interactive voice response (IVR) systems and chat bots
- a large volume of recorded conversations are subjected to STT conversion.
- the audio signals corresponding to human speech are manually transcribed to generate textual transcripts.
- the textual transcripts are then used to train and test acoustic and language models.
- the manual transcription of recorded conversations presents a privacy issue, which is not completely addressed by conventional solutions. For example, an identity of a speaker may be recognized from a recorded conversation and sensitive information related to the individual (i.e. the speaker) may be compromised.
- sensitive information such as an individual’s personal details (for example, name, address, email, phone number, etc.) and financial information (such as credit card details or bank information) may be hidden or removed for protecting the identity of the person.
- financial information such as credit card details or bank information
- sensitive information in a recorded conversation is concealed for protecting the identity of the person.
- concealing is ineffective as, even though humans may not be able to recognize the concealed information, audio processing tools can easily interpret the information.
- existing audio tools can also reverse the concealing of information to generate the original audio. As such, conventional solutions fail to protect the identity of the speaker and an individual’s privacy may be compromised.
- a computer-implemented method for shielding speaker voice prints in audio signals receives, by a processor, an audio signal corresponding to a voice input of a speaker.
- the audio signal includes a voice print of the speaker.
- the method generates, by the processor, a plurality of audio frames from the audio signal.
- the method extracts, by the processor, a first set of speech feature coefficients in relation to each audio frame from among the plurality of audio frames.
- the method generates, by the processor, a second set of speech feature coefficients by randomizing at least one speech feature coefficient from among the first set of speech feature coefficients extracted in relation to each audio frame. Randomizing the at least one speech feature coefficient is configured to irreversibly randomize the voice print of the speaker.
- the method generates, by the processor, a set of extracted speech features based on the second set of speech feature coefficients in relation to each audio frame.
- the method generates, by the processor, a modified audio signal based on the set of extracted speech features.
- the modified audio signal is configured to serve as a voice print randomized representation of the audio signal.
- a system for shielding speaker voice prints in audio signals includes a processor and a memory.
- the memory stores machine executable instructions, that when executed by the processor, cause the system to receive an audio signal corresponding to a voice input of a speaker.
- the audio signal includes a voice print of the speaker.
- the system generates a plurality of audio frames from the audio signal.
- the system extracts a first set of speech feature coefficients in relation to each audio frame from among the plurality of audio frames.
- the system generates a second set of speech feature coefficients by randomizing at least one speech feature coefficient from among the first set of speech feature coefficients extracted in relation to each audio frame. Randomizing the at least one speech feature coefficient is configured to irreversibly randomize the voice print of the speaker.
- the system generates a set of extracted speech features based on the second set of speech feature coefficients in relation to each audio frame.
- the system generates a modified audio signal based on the set of extracted speech features.
- the modified audio signal is configured to serve as a voice print randomized representation of the audio signal.
- a computer-implemented method for shielding speaker voice prints in audio signals receives, by a processor, an audio signal corresponding to a voice input of a customer of an enterprise.
- the voice input is provided by the customer during a voice interaction with an agent of the enterprise.
- the audio signal includes a voice print of the customer.
- the method generates, by the processor, a plurality of audio frames from the audio signal.
- the method extracts, by the processor, a first set of speech feature coefficients in relation to each audio frame from among the plurality of audio frames.
- the method generates, by the processor, a second set of speech feature coefficients by randomizing at least one speech feature coefficient from among the first set of speech feature coefficients extracted in relation to each audio frame.
- Randomizing the at least one speech feature coefficient is configured to irreversibly randomize the voice print of the customer.
- the method generates, by the processor, a set of extracted speech features based on the second set of speech feature coefficients in relation to each audio frame.
- the method generates, by the processor, a modified audio signal based on the set of extracted speech features.
- the modified audio signal is configured to serve as a voice print randomized representation of the audio signal.
- the method generates, by the processor, a textual transcript based on the modified audio signal.
- the textual transcript is used, at least in part, to train a machine learning model for building an Automatic Speech Recognition (ASR) engine.
- ASR Automatic Speech Recognition
- FIG. 1 is a block diagram of a system for shielding speaker voice print in audio signals, in accordance with an embodiment of the invention.
- FIG. 1 is a block diagram of a pre-processing module for illustrating pre-processing of an audio signal, in accordance with an embodiment of the invention.
- FIG. 1 is a block diagram for illustrating generation of the set of extracted speech features, in accordance with an embodiment of the invention.
- FIG. 1 is a block diagram of an audio regeneration module for illustrating a processing of a set of extracted speech features to generate a modified audio signal, in accordance with an embodiment of the invention.
- FIG. 1 shows a representation for illustrating an example use of a modified audio signal, in accordance with an embodiment of the invention.
- ASR Automatic Speech Recognition
- FIG. 1 shows a flow diagram of a method for shielding speaker voice print in audio signals, in accordance with an embodiment of the invention.
- FIG. 1 shows a flow diagram of a method for shielding speaker voice print in audio signals, in accordance with another embodiment of the invention.
- an audio signal which includes the voice print of the speaker, is pre-processed to generate audio frames.
- the pre-processing of the audio signal includes digitizing the audio signal and subjecting the audio signal to pre-emphasis to generate a pre-emphasized audio signal.
- the pre-emphasized audio signal is segmented using a window function to generate the plurality of audio frames.
- a first set of speech feature coefficients is extracted from each audio frame.
- the extraction may involve applying a Discrete Fourier Transform (DFT) to each audio frame to generate a spectral representation.
- the spectral representation is filtered using a bank of Mel filters to generate a Mel-scale power spectrum.
- a log function is applied to the Mel-scale power spectrum to determine log energies in each audio frame.
- a Discrete Cosine Transform (DCT) is applied to the log energies in each audio frame to generate Mel-spectrum coefficients.
- the Mel-spectrum coefficients extracted in relation to each audio frame are referred to as first set of speech feature coefficients.
- a second set of speech feature coefficients is generated by randomizing at least one speech feature coefficient from among the first set of speech feature coefficients extracted in relation to each audio frame. Randomizing the at least one speech feature coefficient is configured to irreversibly randomize the voice print of the speaker. Though the voice print of the speaker is randomized, there is no associated loss of data or information included therein. As such, only the information capable of revealing the identity of the speaker is irreversibly concealed.
- At least one of a first order derivative and a second order derivative of the second set of speech feature coefficients are determined.
- Vector representations of the second set of speech feature coefficients, the first order derivative and the second order derivative of the second set of speech feature coefficients of the plurality of frames configure a set of extracted speech features.
- a modified audio signal in the time domain is then generated based on the set of extracted features.
- an inverse discrete cosine transform is performed on the set of speech extracted features for determining a power spectrogram and the power spectrogram is converted to a linear frequency power spectrogram using Non-Negative Least Squares (NLS) solver.
- the linear frequency power spectrogram is transformed to the time domain for constructing the modified audio signal based, at least in part, on the inverse short-time Fourier transform.
- a phase of the modified audio signal is calculated from the linear frequency power spectrogram using phase reconstruction.
- the modified audio signal is configured to serve as a voice print randomized representation of the audio signal.
- the speaker corresponds to a customer of an enterprise and the voice input corresponds to utterances provided by the customer during a voice interaction with an agent of the enterprise.
- the modified audio signal corresponding to a customer’s voice interaction with the enterprise agent, is provided to a manual transcriber for generating a textual transcript of the voice interaction.
- the textual transcript is provided as part of a plurality of textual transcripts for training a machine learning model.
- the machine learning model is selected to be one of an acoustic model and language model, which in at least some embodiments, is used in training to build, at least in part, an Automatic Speech Recognition (ASR) engine.
- ASR Automatic Speech Recognition
- the representation depicts a user 102 engaged in a voice interaction with a customer support representative 104.
- the customer support representative 104 may be employed with a customer support center (not shown in ) associated with an enterprise selling products, services and/or information to customers, such as the user 102.
- the customer support representative 104 is hereinafter referred to as an agent 104.
- the customer support center may include several human voice agents such as the agent 104.
- the customer support center may also include a plurality of human chat agents, a plurality of automated voice agents (for example, Interactive Voice Response or IVR systems) and a plurality of automated chat agents (for example, chatbots).
- several customers such as the user 102 may call the customer support center to seek assistance from the customer support personnel deployed at the customer support center.
- the voice interaction between the user 102 and the agent 104 may be facilitated over a communication network 106.
- the communication network 106 may be embodied as a wired network, a wireless network or a combination of wired and wireless networks.
- Examples of a wired network may include, but is not limited to, an Ethernet, a Local Area Network (LAN), and the like.
- Examples of a wireless network may include a cellular network, a wireless LAN, and the like.
- An example of a combination of wired and wireless networks may include, but is not limited to, the Internet.
- the user 102 may initiate an interaction with the agent 104 to seek assistance from the agent 104.
- the agent 104 may seek the user’s permission for recording the conversation and using the recorded conversation for training and testing purposes.
- the voice interaction between the user 102 and the agent 104 may be recorded and stored in a database 108. It is understood that a plurality of such recorded conversations may be stored in the database 108.
- the audio signals corresponding to human speech in the recorded conversations of the database 108 are manually transcribed to generate textual transcripts.
- the textual transcripts are then used to train and test acoustic and language models.
- the manual transcription of recorded conversations presents a privacy issue as an identity of a speaker, such as the user 102 in this case, may be recognized from the recorded conversation and sensitive information related to the user 102 may be compromised.
- sensitive information such as an individual’s personal details (for example, name, address, email, phone number, etc.) and financial information (such as credit card details or bank information) are hidden or removed for protecting the identity of the person.
- the removal of information may result in a loss of audio and the generated transcript from a clipped audio may not accurately represent the original conversation.
- sensitive information in a recorded conversation is concealed for protecting the identity of the person.
- concealing is ineffective as even though, humans may not be able to recognize the concealed information, audio processing tools can easily interpret the information.
- existing audio tools can also reverse the concealing of information to generate the original audio. As such, conventional solutions fail to protect the identity of the speaker and individual’s privacy may be compromised.
- a system such as the system 150.
- the system 150 is configured to shield speaker voice print in audio signals.
- the system 150 is explained in further detail with reference to .
- audio signal refers to an electronic representation of sound waves corresponding to human speech.
- electronic representation e.g. analog representation
- electronic representation of a user may configure audio signals corresponding to the speech input provided by the user 102.
- electronic representation e.g. analog representation
- any audio content may configure audio signals for the purposes of the description.
- voice print refers to measurable characteristics (such as a biomarker) in the human voice that are unique to the speaker and may help in identification of the speaker.
- shielding speaker voice print refers to concealing the speaker’s voice print in a manner that makes it almost impossible to identify the speaker (i.e. the user 102) from the audio signals, in effect, making the identity of the speaker anonymous.
- the system 150 may be implemented in a server accessible over a communication network, such as the communication network 106 shown in .
- the server may be communicably coupled over the Internet with other remote entities connected to the communication network, such as for example, the database 108 (shown in ), electronic devices of agents deployed at the customer support center, user devices, and the like.
- the system 150 includes at least one processor, such as a processor 152 and a memory 154. It is noted that although the system 150 is depicted to include only one processor, the system 150 may include more number of processors therein.
- the memory 154 is capable of storing machine executable instructions, referred to herein as platform instructions 155.
- the processor 152 is capable of executing the platform instructions 155.
- the processor 152 may be embodied as a multi-core processor, a single core processor, or a combination of one or more multi-core processors and one or more single core processors.
- the processor 152 may be embodied as one or more of various processing devices, such as a coprocessor, a microprocessor, a controller, a digital signal processor (DSP), a processing circuitry with or without an accompanying DSP, or various other processing devices including integrated circuits such as, for example, an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), a microcontroller unit (MCU), a hardware accelerator, a special-purpose computer chip, or the like.
- the processor 152 may be configured to execute hard-coded functionality.
- the processor 152 is embodied as an executor of software instructions, wherein the instructions may specifically configure the processor 152 to perform the algorithms and/or operations described herein when the instructions are executed.
- the processor 152 includes a pre-processing module 156, a speech feature extraction module 158, a voice print randomization module 160 and an audio regeneration module 162.
- the modules of the processor 152 may be implemented as software modules, hardware modules, firmware modules or as a combination thereof.
- the memory 154 may be embodied as one or more volatile memory devices, one or more non-volatile memory devices, and/or a combination of one or more volatile memory devices and non-volatile memory devices.
- the memory 154 may be embodied as semiconductor memories (such as mask ROM, PROM (programmable ROM), EPROM (erasable PROM), flash memory, RAM (random access memory), etc.), magnetic storage devices (such as hard disk drives, floppy disks, magnetic tapes, etc.), optical magnetic storage devices (e.g., magneto-optical disks), CD-ROM (compact disc read only memory), CD-R (compact disc recordable), CD-R/W (compact disc rewritable), DVD (Digital Versatile Disc) and BD (BLU-RAY® Disc).
- semiconductor memories such as mask ROM, PROM (programmable ROM), EPROM (erasable PROM), flash memory, RAM (random access memory), etc.
- magnetic storage devices such as hard disk drives, floppy
- the memory 154 stores instructions for generating audio frames from audio signals, for generating the first set of speech feature coefficients, for randomizing one or more feature coefficients from among the first set of speech feature coefficients to generate the second set of speech feature coefficients, and for generating modified audio signals from the extracted set of speech features.
- the instructions stored in the memory 154 are used by the modules of the processor 152 to shield speaker voice print in audio signals as will be explained in further detail later.
- the system 150 also includes an input/output module 164 (hereinafter referred to as an ‘I/O module 164’) and at least one communication module such as a communication module 166.
- the I/O module 164 may include mechanisms configured to receive inputs from and provide outputs to the user of the system 150. To that effect, the I/O module 164 may include at least one input interface and/or at least one output interface. Examples of the input interface may include, but are not limited to, a keyboard, a mouse, a joystick, a keypad, a touch screen, soft keys, a microphone, and the like.
- Examples of the output interface may include, but are not limited to, a display such as a light emitting diode display, a thin-film transistor (TFT) display, a liquid crystal display, an active-matrix organic light-emitting diode (AMOLED) display, a microphone, a speaker, a ringer, a vibrator, and the like.
- a display such as a light emitting diode display, a thin-film transistor (TFT) display, a liquid crystal display, an active-matrix organic light-emitting diode (AMOLED) display, a microphone, a speaker, a ringer, a vibrator, and the like.
- TFT thin-film transistor
- AMOLED active-matrix organic light-emitting diode
- the processor 152 may include I/O circuitry configured to control at least some functions of one or more elements of the I/O module 164, such as, for example, a speaker, a microphone, a display, and/or the like.
- the processor 152 and/or the I/O circuitry may be configured to control one or more functions of the one or more elements of the I/O module 164 through computer program instructions, for example, software and/or firmware, stored on a memory, for example, the memory 154, and/or the like, accessible to the processor 152.
- the communication module 166 may include communication circuitry such as for example, a transceiver circuitry including antenna and other communication media interfaces to connect to a wired and/or wireless communication network.
- the communication circuitry may, in at least some example embodiments, enable reception/transmission of signals (such as audio signals) from remote network entities, such as the database 108 (shown in ) or a server at a customer support center configured to maintain real-time information related to interactions between customers and agents.
- the communication module 166 is configured to receive audio signals corresponding to recorded conversations.
- the communication module 166 may receive audio signals corresponding to stored conversations, i.e. conversations conducted between customers and agents of the customer support center.
- the communication module 166 may be configured to forward the audio signals to the processor 152.
- the modules of the processor 152 in conjunction with the instructions stored in the memory 154 may be configured to process the audio signals and generate modified audio signals, i.e. audio signals with shielded speaker voice prints.
- various components of the system 150 are configured to communicate with each other via or through a centralized circuit system 168.
- the centralized circuit system 168 may be various devices configured to, among other things, provide or enable communication between the components of the system 150.
- the centralized circuit system 168 may be a central printed circuit board (PCB) such as a motherboard, a main board, a system board, or a logic board.
- PCAs printed circuit assemblies
- audio signals i.e. electronic representation of a single voice input or a single utterance of a user. It is understood that other voice inputs of the user within a single conversation may similarly be processed to generate respective modified audio signals. Furthermore, audio signals corresponding to several conversations may be processed similarly to generate a plurality of modified audio signals to facilitate training of acoustic and language models, as will be explained in further detail later.
- the pre-processing module 156 in conjunction with the instructions in the memory 154 is configured to perform at least one pre-processing operation on an audio signal received from the database 108 for shielding speaker voice print in the audio signal.
- the audio signal is suitably adjusted or modified prior to actual processing of the audio signal. Pre-processing of the audio signal is explained next with reference to .
- FIG. 300 is a block diagram 300 of the pre-processing module 156 for illustrating pre-processing of an audio signal 250, in accordance with an embodiment of the invention.
- an audio signal may be received by the system 150 from the database 108 for processing of the audio signal, such that the voice print of the speaker included in the audio signal is removed or shielded.
- the communication module 166 of the system 150 is configured to receive the audio signal and forward the audio signal to the processor 152 (shown in ).
- the pre-processing module 156 of the processor 152 may receive the audio signal, such as the audio signal 250 and initiate pre-processing of the audio signal 250. The processing steps executed by the pre-processing module 156 are explained hereinafter.
- the audio signal 250 is received and converted into a digital form. More specifically, an analog-to-digital (A/D) converter may be used to convert the analog form of the audio signal 250 into a digital form (i.e., digital audio signal). In an illustrative example, an A/D convertor may be used to sample a frequency of the audio signal 250 at 8kHz or 16 kHz to generate the digital form of the audio signal 250, also referred to herein as a ‘digital audio signal’.
- A/D converter may be used to sample a frequency of the audio signal 250 at 8kHz or 16 kHz to generate the digital form of the audio signal 250, also referred to herein as a ‘digital audio signal’.
- the digital audio signal is subjected to a next processing stage, referred to herein as ‘pre-emphasis’, wherein energies in the high-frequencies of the digital audio signal are amplified.
- Pre-emphasis aims to compensate for attenuation of high frequencies in the sampling process (i.e., the digitization process). More specifically, during pre-emphasis the high frequency components of the digital audio signal are emphasized and low frequency components are attenuated.
- the digital audio signal is passed through a high pass filter, which is usually a first-order Finite Impulse Response (FIR) filter to generate a pre-emphasized audio signal.
- FIR Finite Impulse Response
- the pre-emphasized audio signal is sliced into discrete time segments.
- the digital audio signal representing the digital form of the audio signal is segmented or blocked to configure a plurality of audio frames of typically 20-30 msec timeframe.
- speech signal slowly varies over time (quasi-stationary)
- short-time spectral analysis may be performed to capture concise and exact speech feature representation in smaller audio frames. Segmentation is performed such that adjacent audio frames normally overlap each other (e.g., 30-50% overlap). The overlapping of adjacent audio frames is done in order to not lose any vital information of the original audio signal due to windowing which is explained later.
- a window function w(n) is applied on each audio frame to generate windowed audio frames 310 (e.g., A1, A2, ..., An).
- the window function represents a time window of a specific shape and is applied on an audio frame to stress pre-defined characteristics of the original audio signal in the audio frame.
- the discontinuity of speech signal at the beginning and end of each audio frame is tapered to zero or close to zero to minimize signal discontinuities between adjacent frames.
- a hamming window is used to prevent edge effects (i.e., signal discontinuities) during windowing.
- the A/D conversion at 302, pre-emphasis at 304, the frame blocking at 306 and windowing at 308 performed by the pre-processing module 156 is configured to generate a plurality of audio frames 310.
- the audio frames 310 are provided to the speech feature extraction module 158.
- the processing of the audio frames 310 by the speech feature extraction module 158 is explained next with reference to .
- FIG. 400 is a block diagram 400 for illustrating generation of the set of extracted speech features 416, in accordance with an embodiment of the invention.
- the audio signal 250 is pre-processed to generate the audio frames 310.
- the audio frames 310 are provided to the speech feature extraction module 158.
- the speech feature extraction module 158 in conjunction with the instructions in the memory 154 (shown in ) is configured to generate a set of extracted speech features from the audio frames 310 generated in relation to the audio signal 250.
- the audio frames 310 may include a plurality of windowed audio frames A1, A2, ..., An.
- Discrete Fourier Transform is applied to an audio frame A1 to generate a spectral representation. While the resulting spectrum of the DFT contains information in each frequency, human hearing is less sensitive at frequencies above 1000 Hz. This concept also has a direct effect on the performance of speech recognition systems; therefore, the spectrum is warped using a logarithmic Mel scale.
- a bank of Mel filters known as triangular filters are constructed with filters distributed equally below 1000 Hz and spaced logarithmically above 1000 Hz.
- the output of DFT is squared to obtain an DFT power spectrum.
- the DFT power spectrum corresponds to the power of the speech at each frequency.
- the triangular Mel-scale (or similar) filter banks are applied to transform the DFT power spectrum to Mel-scale power spectrum.
- the output for each Mel-scale power spectrum bin denotes the energy from the range of frequency bands that particular bin covers.
- the Mel Power Spectrum is associated with 26 values as a final output.
- a log function is applied to the Mel-scale power spectrum to determine log energies in each audio frame.
- a logarithm of the 26 values is taken to generate Mel spectrum coefficients, resulting in 26 log filter bank energies.
- DCT Discrete Cosine Transform
- ASR Automatic Speech Recognition
- MFCC Mel-Frequency Cepstral Coefficients
- first set of speech feature coefficients 408, such as c1-1, c1-2, c1-3, ..., c1-12 from an audio frame A1 is explained herein using MFCC as it closely represents the human auditory system.
- feature representation of the audio signal i.e., first set of speech feature coefficients
- DWT Discrete Wavelet Transform
- LPCC Linear Prediction Cepstral Coefficients
- LSF Line Spectral Frequencies
- PLP Perceptual Linear Prediction
- the first set of speech feature coefficients 408 is then provided by the speech extraction module 158 to the voice print randomization module 160.
- the voice print randomization module 160 is configured to add a random value ‘x’ (e.g., 0.21) without any public or private key to at least one feature coefficient from among the first set of speech feature coefficients 408 to generate a second set of speech feature coefficients 410.
- a random value ‘x’ e.g., 0.21
- 12 MFCC coefficients such as c1-1, c1-2, c1-3, ..., c1-12 configuring the first set of speech feature coefficients 408 for the audio frame A1
- the generated second set of speech feature coefficients 410 for the audio frame A1 may be represented as c1-1+x, c1-2, c1-3 +x, ..., c1-12.
- a pseudo random number generator is configured to generate the random value ‘x’.
- the choice of the random value may be configured to ensure the voice print cannot be reproduced through any techniques. This irreversible change randomizes the voice print, so all the biometric markers of an individual are lost. Moreover, though the speaker’s voice print is randomized the audio data is not distorted to levels that are not understandable by a trained human listener.
- the second set of speech feature coefficients 410 with the randomized voice print is returned to the speech feature extraction module 158.
- the second set of speech feature coefficients 410 is subjected to a processing stage referred to herein as ‘derivatives’, wherein to capture the changes in speech between subsequent audio frames, the first-order and second-order derivative of the second set of speech feature coefficients 410 of each audio frame are calculated and used for generating 36-39 MFCCs (i.e., speech feature coefficients of a speech feature for each audio frame).
- the first-order derivative of speech features measures the changes in speech feature coefficients from the previous frame to the next frame (e.g., Audio frames A1 and A2).
- the second-order derivative of the speech features captures the dynamic changes of the first-order derivatives from the last frame to the next frame.
- the randomized MFCCs corresponding to the audio frames are transformed to configure vector representations, which represent the set of extracted speech features 416. More specifically, the second set of speech feature coefficients 410 along with the first-order derivative and the second-order derivative of the second set of speech feature coefficients of the audio frames A1, A2, ..., An together configure the set of extracted speech features.
- the set of extracted speech features 416 represents a distorted form of the audio signal due to the introduction of the random value. However, it must be noted that even though the set of extracted speech features 416 is randomized and carries no biometric marker of a person, the audio information is retained and can be discerned by a trained human listener.
- the speech feature extraction process transforms the processed speech waveform (i.e., audio signal) into a concise logical representation where it extracts the most relevant and important portions of the speech with high reliability.
- the set of extracted speech features 416 is provided to the audio regeneration module 162 (shown in ).
- the audio regeneration module 162 is configured to generate a modified audio signal based on the set of extracted speech features 416 (shown in ).
- the modified audio signal serves as a voice print randomized representation of the audio signal 250.
- the audio regeneration module 162 is configured to use a time-series of the MFCCs audio signal to generate the modified audio signal. The generation of the modified audio signal is explained in detail with reference to .
- FIG. 500 is a block diagram 500 of the audio regeneration module 162 for illustrating a processing of the set of extracted speech features 416 to generate a modified audio signal 510, in accordance with an embodiment of the invention.
- the set of extracted speech features 416 is processed to generate an approximate spectrogram. Thereafter, the modified audio signal 510 is recovered from the spectrogram using phase reconstruction techniques.
- the processing of the set of extracted speech features 416 by the audio regeneration module 162 is explained hereinafter. The processing starts at 502.
- an Inverse Discrete Cosine Transform and decibel-scaling of the set of extracted speech features 416 is performed to obtain an approximate mel power spectrogram.
- a Non-Negative Least Squares (NLS) solver is used to convert the mel power spectrogram into a linear frequency power spectrogram.
- phase reconstruction techniques are used for estimating phase of the modified audio signal 510.
- an Inverse Short-Time Fourier transform is applied for constructing the modified audio signal 510 based on the linear frequency power spectrogram and the estimated phase.
- Griffin-Lim algorithm is used to reconstruct the modified audio signal 510 from a spectrogram. This is an iterative algorithm that attempts to find the signal having a Short-time Fourier Transform such that the magnitude part is as close as possible to the linear frequency power spectrogram. In one embodiment, Griffin-Lim algorithm is used to estimate phase, and transform the resulting complex spectrogram to the time domain using the inverse Short-time Fourier Transform (STFT).
- STFT inverse Short-time Fourier Transform
- the modified audio signal 510 is voice print randomized, making the voice biometric of speakers indiscernible. Moreover, the modified audio signal 510 is at human recognizable intelligibility but cannot be processed by generic speech recognition tools. Such a modified audio signal protects the identity of the speaker, while precluding loss of audio data. Further, as the shielding of the voice print is irreversible by humans or machines, the concern related to the privacy of the speaker is addressed. Such generation of audio signals with shielded speaker voice print may be utilized in several ways, as will be explained hereinafter.
- FIGS. 4 and 5 shows a representation 600 for illustrating an example use of a modified audio signal, in accordance with an embodiment of the invention.
- the modified audio signal serves as a voice print randomized representation of the audio signal.
- the representation 600 depicts an audio signal 602 being provided as an input to the system 150 explained with reference to FIGS. 2-5.
- the system 150 is configured to shield speaker voice print in the audio signal 602 as explained with reference to FIGS. 2-5 and generate a modified audio signal 604.
- the modified audio signal 604 is provided to a human agent 606 for manual transcription and review at 608.
- the human agent 606 is configured to generate a textual transcript 610 corresponding to a conversation (such as a voice interaction between the user 102 and the agent 104 shown in ) based on one or more modified audio signals, such as the modified audio signal 604, associated with the conversation.
- automated systems such as, the system 150 may be configured to generate a textual transcript based on modified audio signals (e.g., modified audio signal 604). More specifically, the automated systems may use speech recognition software for generating textual transcripts based on the audio signals.
- the textual transcript 610 is used along with a plurality of other similarly generated transcripts as an input for building a model 612, such as a model for training an Automatic Speech Recognition (ASR) engine.
- ASR Automatic Speech Recognition
- the modified audio signal 604 shields the speaker voice print
- the generated textual transcript 610 which is used for model building, also shields the identity of the speaker.
- the present invention offers anonymity to speakers in the data preparation step itself.
- Modulated audio signals in addition to being used in data preparation phase, may also be used in building and deployment stage of ASR engines, as will be explained next with reference to .
- FIGS. 2 to 5 shows a representation 700 for illustrating another example use of a modified audio signal 702 in building an ASR engine, in accordance with an embodiment of the invention.
- the generation of the modified audio signal 702 is explained with reference to FIGS. 2 to 5.
- features are extracted from a plurality of modified audio signals, such as the modified audio signal 702.
- the generation of the set of extracted speech features may be performed as explained with reference to , while precluding the operation of adding the random value to at least one feature coefficient of the first set of speech feature coefficients 408 (performed by the voice print randomization module 160) to randomize the coefficients and thereby shield the speaker’s voice print.
- the operation performed by the voice print randomization module 160 may be skipped while executing the processing steps (explained with reference to ) to extract the speech features corresponding to the respective modified audio signal.
- the speech features extracted from modified audio signals are depicted using a block 704 in the representation 700.
- the extracted speech features are used to train an acoustic model 706 and a language model 708.
- the training flow is shown using a block 710.
- the acoustic model 706 is used to represent the relationship between the audio data and the linguistic units such as words, phones, subparts of phones, etc.
- the likelihood of the observed spectral feature vectors is computed in given linguistic units.
- the different acoustic modeling approaches may use a combination of AI algorithms such as Gaussian Mixture Model – Hidden Markov Model (GMM-HMM) system, Subspace GMM-HMM (SGMM-HMM) system, Deep Neural Network HMM (DNN-HMM) system, Convoluted Neural Network (CNN), Sequence to Sequence, (Seq2Seq), etc.
- GMM-HMM Hidden Markov Model
- SGMM-HMM Subspace GMM-HMM
- DNN-HMM Deep Neural Network HMM
- CNN Convoluted Neural Network
- Sequence to Sequence (Seq2Seq)
- the language model 708 is used to derive the best sentence hypothesis over a sequence of words and provide context to distinguish words and phrases that sound similar.
- the language model 708 essentially models the transition between words by estimating the prior probabilities.
- the data obtained from the acoustic model 706 is matched with language model 708 in order to match the sounds with word sequences.
- the language model 708 can either use a statistical model where statistical techniques such as N-grams, HMM, etc., are used to learn the probability distribution of words or they can use AI algorithms such as Long Short Term Memory (LSTM), Recurrent Neural Network (RNN), etc.
- LSTM Long Short Term Memory
- RNN Recurrent Neural Network
- the acoustic model 706 and the language model 708 configure the main components of an ASR engine 750.
- the ASR engine 750 may be deployed in either cloud or locally to transcribe the modified audio signal 702.
- speech features 704 extracted from the modified audio signal 702 are fed to the acoustic model 706 first to find the matching text at phonetic/character/word level at 714.
- the output of the acoustic model 706 is fed to the language model 708, which performs matching at the symbol level at 716, to predict corresponding word/phrase and the sentence 718.
- the speech feature extraction and acoustic modeling are performed on the modulated audio signals. This is achieved without compromising on the accuracy of the ASR engine 750. Such an approach ensures that no individual or a machine is involved in building the ASR engine 750 and the ASR engine 750 can identify/recognize any speech data from the modified audio signals, such as the modified audio signal 702.
- the ASR engine 750 is configured to work with only modulated audio signals.
- the anonymity of the speakers is ensured.
- the identity of the speaker is protected at each stage namely at the data preparation stage, at the ASR engine building stage and at the ASR engine deployment stage.
- FIG. 800 shows a flow diagram of a method 800 for shielding speaker voice print in audio signals, in accordance with an embodiment of the invention.
- the various steps and/or operations of the flow diagram, and combinations of steps/operations in the flow diagram may be implemented by, for example, hardware, firmware, a processor, circuitry and/or by an apparatus such as the system 150 explained with reference to FIGS. 1 to 7 and/or by a different device associated with the execution of software that includes one or more computer program instructions.
- the method 800 starts at 802.
- an audio signal corresponding to a voice input of a speaker is received by a processor of a system, such as a processor 152 of the system 150 explained with reference to .
- the audio signal may be received from a database, such as the database 108 shown in , and includes a voice print of the speaker.
- a plurality of audio frames are generated from the audio signal by the processor. More specifically, the audio signal received from the database may be subjected to at least one pre-processing operation for shielding speaker voice print in the audio signal. As part of signal pre-processing, the audio signal is suitably adjusted or modified prior to actual processing of the audio signal.
- the pre-processing of the audio signal is configured to involve conversion of the analog form of the audio signal into a digital form, performing pre-emphasis of the digital audio signal, frame blocking and windowing of the pre-emphasized signal, i.e. segmenting the audio signal using a window function to generate the plurality of audio frames.
- the pre-processing of the audio signal, resulting in the generation of the plurality of audio frames, may be performed as explained in detail with reference to .
- a first set of speech feature coefficients is extracted by the processor in relation to each audio frame from among the plurality of audio frames.
- the extraction may involve applying a Discrete Fourier Transform (DFT) to each audio frame to generate a spectral representation.
- the spectral representation is filtered using a bank of Mel filters to generate a Mel-scale power spectrum.
- a log function is applied to the Mel-scale power spectrum to determine log energies in each audio frame.
- a Discrete Cosine Transform (DCT) is applied to the log energies in each audio frame to generate Mel-spectrum coefficients.
- the Mel-spectrum coefficients or Mel Frequency Cepstral Coefficients (MFCC) extracted in relation to each audio frame are referred to as the first set of speech feature coefficients.
- the extraction of the first set of speech feature coefficients may be performed as explained in detail with reference to .
- a second set of speech feature coefficients is generated by the processor by randomizing at least one speech feature coefficient from among the first set of speech feature coefficients extracted in relation to each audio frame. Randomizing at least one speech feature coefficient is configured to irreversibly randomize the voice print of the speaker. Though the voice print of the speaker is randomized, there is no associated loss of data. As such, only the information capable of revealing the identity of the speaker is irreversibly concealed.
- the generation of the second set of speech feature coefficients may be performed as explained in detail with reference to .
- a set of extracted speech features is generated based on the second set of speech feature coefficients by the processor.
- at least one of a first order derivative and a second order derivative of the second set of speech feature coefficients are determined.
- Vector representations of the second set of speech feature coefficients, the first order derivative and the second order derivative of the second set of speech feature coefficients together configure the set of extracted speech features.
- a modified audio signal is generated by the processor based on the set of extracted speech features.
- an inverse discrete cosine transform is performed on the set of speech extracted features for determining a power spectrogram and the power spectrogram is converted to a linear frequency power spectrogram using Non-Negative Least Squares (NLS) solver.
- the linear frequency power spectrogram is transformed to the time domain for constructing the modified audio signal based, at least in part, on the inverse short-time Fourier transform.
- a phase of the modified audio signal is calculated from the linear frequency power spectrogram using phase reconstruction.
- the modified audio signal is configured to serve as a voice print randomized representation of the audio signal.
- the speaker corresponds to a customer of an enterprise and the voice input corresponds to utterances provided by the customer during a voice interaction with an agent of the enterprise.
- the modified audio signal corresponding to a customer’s voice interaction with the enterprise agent, is provided to a manual transcriber for generating a textual transcript of the voice interaction.
- the textual transcript is provided as part of a plurality of textual transcripts for training a machine learning model.
- the machine learning model is selected to be one of an acoustic model and language model, which in at least some embodiments, is trained to build, at least in part, an Automatic Speech Recognition (ASR) engine.
- ASR Automatic Speech Recognition
- FIG. 902. shows a flow diagram of a method 900 for shielding speaker voice print in audio signals, in accordance with an embodiment of the invention.
- the various steps and/or operations of the flow diagram, and combinations of steps/operations in the flow diagram may be implemented by, for example, hardware, firmware, a processor, circuitry and/or by an apparatus such as the server system 150 explained with reference to FIGS. 1 to 7.
- the method 900 starts at operation 902.
- an audio signal corresponding to a voice input of a customer of an enterprise is received by a processor of a system, such as the processor 152 of the system 150 explained with reference to .
- the voice input is provided by the customer during a voice interaction with an agent of the enterprise as exemplarily explained in .
- the audio signal corresponding to the voice input includes a voice print of the customer.
- a plurality of audio frames is generated from the audio signal.
- the generation of the plurality of audio frames may be performed as explained with reference to operation 804 of the method 800 of and is not explained again herein for sake of brevity.
- a first set of speech feature coefficients is extracted by the processor in relation to each audio frame from among the plurality of audio frames.
- a second set of speech feature coefficients is generated by the processor by randomizing at least one speech feature coefficient from among the first set of speech feature coefficients extracted in relation to each audio frame. Randomizing at least one speech feature coefficient is configured to irreversibly randomize the voice print of the speaker.
- a set of extracted speech features is generated based on the second set of speech feature coefficients by the processor.
- a modified audio signal is generated by the processor based on the set of extracted speech features.
- the modified audio signal is configured to serve as a voice print randomized representation of the audio signal.
- a textual transcript is generated by the processor based on the modified audio signal.
- the textual transcript is used, at least in part, to train a machine learning model for building an Automatic Speech Recognition (ASR) engine.
- ASR Automatic Speech Recognition
- the embodiments disclosed herein provide numerous advantages. More specifically, the embodiments disclosed herein suggest techniques for shielding speaker voice print in audio signals.
- the speaker’s voice print is irreversibly randomized, so all the biometric markers of the speaker are lost.
- the speaker’s voice print is randomized, the audio information is not distorted to levels that are not understandable by a trained human listener. Such modification to the audio signal protects the identity of the speaker, while precluding the loss of audio data.
- the shielding of the voice print is irreversible by humans or machines, the concern related to the privacy of the speaker is addressed.
- the identity of the speaker is protected at each stage namely at the data preparation stage, at the ASR engine building stage and at the ASR engine deployment stage, as explained with reference to FIGS. 6 and 7.
- CMOS complementary metal oxide semiconductor
- ASCI application specific integrated circuit
- DSP Digital Signal Processor
- the system 150 and its various components such as the processor 152, the memory 154, the I/O module 164, and the communication module 166 may be enabled using software and/or using transistors, logic gates, and electrical circuits (for example, integrated circuit circuitry such as ASIC circuitry).
- Various embodiments of the present invention may include one or more computer programs stored or otherwise embodied on a computer-readable medium, wherein the computer programs are configured to cause a processor or computer to perform one or more operations (for example, operations explained herein with reference to FIGS. 8 and 9).
- a computer-readable medium storing, embodying, or encoded with a computer program, or similar language may be embodied as a tangible data storage device storing one or more software programs that are configured to cause a processor or computer to perform one or more operations. Such operations may be, for example, any of the steps or operations described herein.
- the computer programs may be stored and provided to a computer using any type of non-transitory computer readable media.
- Non-transitory computer readable media include any type of tangible storage media.
- non-transitory computer readable media include magnetic storage media (such as floppy disks, magnetic tapes, hard disk drives, etc.), optical magnetic storage media (e.g., magneto-optical disks), CD-ROM (compact disc read only memory), CD-R (compact disc recordable), CD-R/W (compact disc rewritable), DVD (Digital Versatile Disc), BD (Blu-ray (registered trademark) Disc), and semiconductor memories (such as mask ROM, PROM (programmable ROM), EPROM (erasable PROM), flash ROM, RAM (random access memory), etc.).
- magnetic storage media such as floppy disks, magnetic tapes, hard disk drives, etc.
- optical magnetic storage media e.g., magneto-optical disks
- CD-ROM compact disc read only memory
- CD-R compact disc recordable
- CD-R/W compact disc rewritable
- DVD Digital Versatile Disc
- BD Blu-ray (registered trademark)
- a tangible data storage device may be embodied as one or more volatile memory devices, one or more non-volatile memory devices, and/or a combination of one or more volatile memory devices and non-volatile memory devices.
- the computer programs may be provided to a computer using any type of transitory computer readable media. Examples of transitory computer readable media include electric signals, optical signals, and electromagnetic waves. Transitory computer readable media can provide the program to a computer via a wired communication line (e.g., electric wires, and optical fibers) or a wireless communication line.
Landscapes
- Engineering & Computer Science (AREA)
- Quality & Reliability (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Soundproofing, Sound Blocking, And Sound Damping (AREA)
Abstract
La présente invention concerne un procédé et un système de protection d'empreintes vocales de locuteur dans des signaux audio. Un signal audio comprenant une empreinte vocale du locuteur est reçu. Une pluralité de trames audio sont générées à partir du signal audio. Un premier ensemble de coefficients de caractéristique de parole est extrait par rapport à chaque trame audio. Un second ensemble de coefficients de caractéristique de parole est généré par randomisation d'au moins un coefficient de caractéristique de parole parmi le premier ensemble de coefficients de caractéristique de parole. La randomisation d'au moins un coefficient de caractéristique de parole est configurée pour randomiser de manière irréversible l'empreinte vocale du locuteur. Un ensemble de caractéristiques de parole extraites est généré sur la base du second ensemble de coefficients de caractéristique de parole. Un signal audio modifié est généré sur la base de l'ensemble de caractéristiques de parole extraites. Le signal audio modifié est configuré pour servir de représentation randomisée d'empreinte vocale du signal audio.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
IN202041004497 | 2020-02-01 | ||
IN202041004497 | 2020-02-01 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021152566A1 true WO2021152566A1 (fr) | 2021-08-05 |
Family
ID=74844937
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/IB2021/050794 WO2021152566A1 (fr) | 2020-02-01 | 2021-02-01 | Système et procédé de protection d'empreinte vocale de locuteur dans des signaux audio |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2021152566A1 (fr) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
NO20220759A1 (en) * | 2022-07-01 | 2024-01-02 | Pexip AS | Method and audio processing device for voice anonymization |
CN117648717A (zh) * | 2024-01-29 | 2024-03-05 | 知学云(北京)科技股份有限公司 | 用于人工智能语音陪练的隐私保护方法 |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140278366A1 (en) * | 2013-03-12 | 2014-09-18 | Toytalk, Inc. | Feature extraction for anonymized speech recognition |
-
2021
- 2021-02-01 WO PCT/IB2021/050794 patent/WO2021152566A1/fr active Application Filing
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140278366A1 (en) * | 2013-03-12 | 2014-09-18 | Toytalk, Inc. | Feature extraction for anonymized speech recognition |
Non-Patent Citations (3)
Title |
---|
COHEN-HADRIA ALICE ET AL: "Voice Anonymization in Urban Sound Recordings", 2019 IEEE 29TH INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING (MLSP), IEEE, 13 October 2019 (2019-10-13), pages 1 - 6, XP033645862, DOI: 10.1109/MLSP.2019.8918913 * |
JOSE PATINO ET AL: "Speaker anonymisation using the McAdams coefficient", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 2 November 2020 (2020-11-02), XP081805642 * |
MAGARINOS CARMEN ET AL: "Piecewise linear definition of transformation functions for speaker de-identification", 2016 FIRST INTERNATIONAL WORKSHOP ON SENSING, PROCESSING AND LEARNING FOR INTELLIGENT MACHINES (SPLINE), IEEE, 6 July 2016 (2016-07-06), pages 1 - 5, XP032934636, DOI: 10.1109/SPLIM.2016.7528408 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
NO20220759A1 (en) * | 2022-07-01 | 2024-01-02 | Pexip AS | Method and audio processing device for voice anonymization |
NO348059B1 (en) * | 2022-07-01 | 2024-07-08 | Pexip AS | Method and audio processing device for voice anonymization |
CN117648717A (zh) * | 2024-01-29 | 2024-03-05 | 知学云(北京)科技股份有限公司 | 用于人工智能语音陪练的隐私保护方法 |
CN117648717B (zh) * | 2024-01-29 | 2024-05-03 | 知学云(北京)科技股份有限公司 | 用于人工智能语音陪练的隐私保护方法 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
El-Moneim et al. | Text-independent speaker recognition using LSTM-RNN and speech enhancement | |
Singh et al. | Multimedia utilization of non-computerized disguised voice and acoustic similarity measurement | |
US8160877B1 (en) | Hierarchical real-time speaker recognition for biometric VoIP verification and targeting | |
Sadjadi et al. | Blind spectral weighting for robust speaker identification under reverberation mismatch | |
Ismail et al. | Mfcc-vq approach for qalqalahtajweed rule checking | |
WO2021152566A1 (fr) | Système et procédé de protection d'empreinte vocale de locuteur dans des signaux audio | |
Barua et al. | Neural network based recognition of speech using MFCC features | |
Shahnawazuddin et al. | Pitch-normalized acoustic features for robust children's speech recognition | |
Kumar et al. | Hindi speech recognition in noisy environment using hybrid technique | |
Revathy et al. | Performance comparison of speaker and emotion recognition | |
Koolagudi et al. | Speaker recognition in the case of emotional environment using transformation of speech features | |
CN116564279A (zh) | 一种语音关键词识别方法、装置及相关设备 | |
Jawarkar et al. | Effect of nonlinear compression function on the performance of the speaker identification system under noisy conditions | |
Abka et al. | Speech recognition features: Comparison studies on robustness against environmental distortions | |
Jain et al. | Speech features analysis and biometric person identification in multilingual environment | |
Upadhyay et al. | Robust recognition of English speech in noisy environments using frequency warped signal processing | |
CN113658599A (zh) | 基于语音识别的会议记录生成方法、装置、设备及介质 | |
Kaur et al. | Power-Normalized Cepstral Coefficients (PNCC) for Punjabi automatic speech recognition using phone based modelling in HTK | |
Shareef et al. | Comparison between features extraction techniques for impairments arabic speech | |
Dai et al. | 2D Psychoacoustic modeling of equivalent masking for automatic speech recognition | |
Maged et al. | Improving speaker identification system using discrete wavelet transform and AWGN | |
Nair et al. | A reliable speaker verification system based on LPCC and DTW | |
Sharma et al. | Speech recognition of Punjabi numerals using synergic HMM and DTW approach | |
Mukherjee | Speaker recognition using shifted MFCC | |
Bawa et al. | Spectral-warping based noise-robust enhanced children ASR system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21709092 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM DATED 19.12.2022) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21709092 Country of ref document: EP Kind code of ref document: A1 |