WO2021042537A1 - Procédé et système d'authentification de reconnaissance vocale - Google Patents

Procédé et système d'authentification de reconnaissance vocale Download PDF

Info

Publication number
WO2021042537A1
WO2021042537A1 PCT/CN2019/117554 CN2019117554W WO2021042537A1 WO 2021042537 A1 WO2021042537 A1 WO 2021042537A1 CN 2019117554 W CN2019117554 W CN 2019117554W WO 2021042537 A1 WO2021042537 A1 WO 2021042537A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
short
information
speaker
audio information
Prior art date
Application number
PCT/CN2019/117554
Other languages
English (en)
Chinese (zh)
Inventor
王健宗
苏雪琦
彭话易
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021042537A1 publication Critical patent/WO2021042537A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band

Definitions

  • the embodiments of the present application relate to the field of speech recognition, and in particular, to a speech recognition authentication method, a speech recognition authentication system, a computer device, and a readable storage medium.
  • the home intelligent voice robot completes the received voice commands by recognizing the voices of family members; the conference recording system records the speeches of the participants at the meeting by recognizing the voices of the participants.
  • the inventor found that most of the existing speech recognition systems have problems such as unclear speech recognition and speaker recognition errors. For example, the sound of typing on the keyboard is regarded as a valid human speech, so the speech recognition system gives invalid Or record speaker A’s speech as speaker B’s speech.
  • This application aims to solve the problems of unclear speech recognition, speaker recognition errors and low recognition accuracy.
  • an embodiment of the present application provides a voice recognition authentication method, and the method includes:
  • the matched identity information of the speaker corresponding to the voice feature of the speaker is output to obtain the speaker corresponding to the voice information.
  • an embodiment of the present application also provides a voice recognition authentication system, including:
  • a preprocessing module configured to preprocess the audio information, so as to obtain voice information from the audio information according to the short-term energy and spectrum center of the audio information;
  • the feature extraction module is used to perform voice feature extraction on the voice information
  • the processing module is used to process the voice features to obtain target voice features that are closer to the speaker;
  • the matching module is used to match the target voice feature with the speaker's voice feature stored in the database.
  • the output module is configured to output the matched identity information of the speaker corresponding to the voice feature of the speaker according to the matching result, so as to obtain the speaker corresponding to the voice information.
  • an embodiment of the present application also provides a computer device, the computer device memory, a processor, and computer-readable instructions stored on the memory and running on the processor, the computer When the readable instruction is executed by the processor, the following steps are implemented:
  • the matched identity information of the speaker corresponding to the voice feature of the speaker is output to obtain the speaker corresponding to the voice information.
  • the embodiments of the present application also provide a non-volatile computer-readable storage medium, the non-volatile computer-readable storage medium stores computer-readable instructions, and the computer-readable instructions may Is executed by at least one processor, so that the at least one processor executes the following steps:
  • the matched identity information of the speaker corresponding to the voice feature of the speaker is output to obtain the speaker corresponding to the voice information.
  • the speech recognition authentication method, speech recognition authentication system, computer equipment, and readable storage medium provided by the embodiments of the present application perform preprocessing on the acquired audio information to obtain information from the short-term energy and spectrum center of the audio information. Acquire voice information from audio information, extract voice features from the voice information, process the voice features to obtain target voice features closer to the speaker, and compare the target voice features with those stored in the database The speaker's voice features are matched, and according to the matching result, the matched speaker identity information is output to obtain the speaker corresponding to the voice information.
  • the accuracy of the speech recognition technology can be improved, and the user experience can be greatly improved.
  • FIG. 1 is a flowchart of the steps of the voice recognition authentication method according to the first embodiment of the application.
  • Figure 2 is a normalized audio feature splicing diagram in Example 1 of this application.
  • FIG. 3 is a diagram of a specific splicing method in Embodiment 1 of the application.
  • FIG. 4 is a schematic diagram of the hardware architecture of the computer device according to the second embodiment of the application.
  • Fig. 5 is a schematic diagram of program modules of a speech recognition authentication system according to the third embodiment of the application.
  • FIG. 1 there is shown a flow chart of the steps of the voice recognition authentication method according to the first embodiment of the present application. It can be understood that the flowchart in this method embodiment is not used to limit the order of execution of the steps. It should be noted that, in this embodiment, the computer device 2 is used as the execution subject for exemplary description. details as follows:
  • Step S100 Acquire audio information.
  • the voice recognition and authentication system acquires these sounds, that is, audio information, due to the presence of the speaker's voice, silence, environmental noise, and non-environmental noise in the environment.
  • non-ambient noise and the speech spoken by the speaker have different short-term energy and spectral centers.
  • Step S102 Pre-processing the audio information to obtain voice information from the audio information according to the short-term energy and spectrum center of the audio information.
  • the audio information needs to be processed to obtain all the information from the audio information.
  • Said voice information refers to the part that is not pronounced due to silence.
  • the environmental noise includes, but is not limited to, the sound generated by the opening and closing of doors and windows, and the collision of objects.
  • the non-ambient noise includes, but is not limited to, for example, coughing, mouse clicking, or keyboard typing. Short-term energy and spectrum center are two important indicators of audio information in silence detection technology.
  • the short-term energy reflects the strength of signal energy and can distinguish silence and environmental noise in a segment of audio.
  • the spectrum center Able to distinguish parts of non-environmental noise.
  • the short-term energy and the center of the spectrum are combined to filter out effective audio, that is, voice information, from the audio information.
  • the audio information when the audio information is preprocessed to obtain the voice information from the audio information according to the short-term energy and the spectral center of the audio information, the audio information is obtained from the audio information according to preset rules.
  • Multi-frame short-term signals are extracted from the information, wherein the preset rule includes a preset signal extraction time interval.
  • the short-term energy and the center of the spectrum are calculated according to the silence detection algorithm for the multi-frame short-term signal.
  • the short-term energy is compared with the first preset value stored in the database
  • the frequency spectrum center is compared with the second preset value stored in the database.
  • E represents the short-term energy
  • N represents the number of frames of the short-term signal
  • N ⁇ 2 the number of frames of the short-term signal
  • s(n) represents the signal amplitude of the nth frame of the short-term signal in the time domain.
  • the short-term energy is the sum of the squares of the signal of each frame, which reflects the strength of the signal energy. When the signal energy is too weak, it is determined that the signal is silent or environmental noise.
  • the frequency spectrum center of the audio information is calculated according to the silence detection algorithm, wherein the calculation formula of the frequency spectrum center is: Wherein, C represents the center of the spectrum, K represents the number of frequencies corresponding to N frames s(n), K ⁇ 2, and is an integer, S(k) represents the frequency domain corresponding to the s(n) The spectral energy distribution obtained by the discrete Fourier transform.
  • the center of the spectrum is also called the first order distance of the spectrum. The smaller the value of the center of the spectrum, the more spectrum energy is concentrated in the low frequency range. You can use the center of the spectrum to remove the non-environmental noise part, for example: The sound of coughing, clicking the mouse, or typing on the keyboard.
  • the audio information is effective audio, that is, the voice information of the speaker.
  • Most of the environmental noise and non-environmental noise are Remove to make the retained voice information more pure and higher quality, and reduce a lot of interference factors for the process of voice recognition.
  • high-quality voice information is obtained by setting the first preset value and the second preset value to be higher than conventional values.
  • the audio information is invalid audio information, and the audio information is deleted.
  • the invalid audio information includes at least: silence, environmental noise, and non-environmental noise.
  • the short-term energy is lower than the first preset value, it represents a quiet environment, silence of the audio information or environmental noise. If the center of the frequency spectrum is lower than the second preset value, it represents a non-quiet environment, and the audio information is non-ambient noise.
  • Step S104 Perform voice feature extraction on the voice information.
  • the voice information is windowed by using a Hamming window with a window length of 10 frames (100 milliseconds) and a frame skip distance of 3 frames (30 milliseconds), and then the corresponding Voice characteristics.
  • the voice features include, but are not limited to, spectrum features, sound quality features, and voiceprint features.
  • the frequency spectrum feature distinguishes different voice data, such as target voice and interference voice, according to the frequency of sound vibration.
  • the voice quality feature and the voiceprint feature identify the speaker corresponding to the voice data to be tested according to the voiceprint and the timbre feature of the voice. Since the voice distinction is used to distinguish the target voice and the interfering voice in the voice data, it is only necessary to obtain the frequency spectrum characteristics of the voice information to complete the voice distinction.
  • the frequency spectrum is the abbreviation of frequency spectrum density, and the frequency spectrum characteristic is a parameter that reflects the frequency spectrum density.
  • the voice information includes a plurality of single frames of voice data.
  • the single frame of voice data is first subjected to fast Fourier transform processing to obtain the The power spectrum of the voice information is then used to perform dimensionality reduction processing on the power spectrum by using a mel filter bank to obtain a mel spectrum, and finally a cepstrum analysis is performed on the mel spectrum to obtain the voice feature.
  • the human auditory perception system can simulate a complex nonlinear system, the acquired power spectrum cannot well show the nonlinear characteristics of the voice data, so it is necessary to use a Mel filter bank to reduce the dimensionality of the spectrum.
  • the frequency spectrum of the acquired voice data to be tested is closer to the frequency perceived by the human ear.
  • the Mel filter bank is composed of multiple overlapping triangular bandpass filters, and the triangular bandpass filter carries three frequencies: the lower limit frequency, the cutoff frequency and the center frequency.
  • the center frequencies of these triangular bandpass filters are equidistant on the mel scale.
  • the mel scale increases linearly before 1000HZ, and increases logarithmically after 1000HZ.
  • Cepstrum refers to the inverse Fourier transform of the Fourier transform spectrum of a signal after logarithmic operation. Since the general Fourier spectrum is a complex spectrum, the cepstrum is also called complex cepstrum.
  • Step S106 Process the voice features to obtain target voice features that are closer to the speaker.
  • the step of processing the voice feature to obtain a target voice feature closer to the speaker specifically includes: normalizing the voice feature using the Z-score standardization method , In order to unify the voice features, wherein the normalization processing formula is: ⁇ is the mean value of the multiple voice information, ⁇ is the standard deviation of the multiple voice information, x is the multiple single-frame voice data, and x* is the voice feature after normalization processing. Then, the normalized processing result features are spliced to form a spliced frame with a long overlapping part. Finally, input the spliced frame into a neural network to train the spliced frame to obtain the target voice feature, so as to reduce the loss of the voice information.
  • the normalization processing formula is: ⁇ is the mean value of the multiple voice information, ⁇ is the standard deviation of the multiple voice information, x is the multiple single-frame voice data, and x* is the voice feature after normalization processing.
  • the normalized processing result features are spliced to form a
  • a Hamming window with a window length of 10 frames and a jump distance of 3 frames is used to splice the normalized processing result features to form a 390-dimensional feature. Then, the 10 frames are spliced with every 10 frames as a splicing unit. Please refer to FIG. 3 for a specific splicing method.
  • each frame is 39 bits, the dimension of 10 frames spliced together is 390 dimensions. Since the jump distance is 3 frames, starting from the first frame and then skipping 3 steps, the next number of frames to be spliced is from the 4th frame to the 13th frame, and so on.
  • the embodiment of the application unifies the voice features, solves the comparability between data indicators, reduces the different effects caused by singular sample data, and helps to comprehensively compare and evaluate the voice features and improve better voice training. effect.
  • features are spliced to form a frame with a longer overlapping part, so as to capture excessive information and reduce the loss of information in a longer duration.
  • the target voice features are also input into the pre-trained speaker detection model and the intruder In the detection model. Then, according to the output result, it is verified whether the voice information is the voice of one preset speaker among the multiple preset speakers stored in the speaker detection model, and when the voice information is the voice of the preset speaker When the time, the voice information is acquired.
  • the speaker's voice feature is extracted, it is verified whether the voice feature is one of the preset speakers in the pre-trained speaker detection model, and the speaker is selected to be accepted or rejected according to the verification result. If it is recognized that the voice feature has been imposted by the intruder, the voice information of the speaker is rejected.
  • Step S108 matching the target voice feature with the speaker's voice feature stored in the database.
  • the processed voice feature is compared with the speaker's voice feature stored in the database to obtain the speaker's voice feature that matches the voice feature.
  • the speaker's voice features are also collected in advance, and the voice features and the corresponding speaker's voice features are collected in advance.
  • the identity information is stored in the database.
  • the environment is a quiet environment during the process of collecting the speaker's voice features, it is easy to obtain the speaker's voice feature, and save the voice feature and the corresponding speaker's identity information in the database.
  • Step S110 according to the matching result, output the identity information of the speaker corresponding to the voice feature of the speaker that is matched to obtain the speaker corresponding to the voice information.
  • the identity information 1 is output, and then the speaker A represented by the identity information 1 is obtained.
  • the accuracy of the speech recognition technology can be improved, and the user experience can be greatly improved.
  • FIG. 2 shows a schematic diagram of the hardware architecture of the computer device according to the second embodiment of the present application.
  • the computer device 2 includes, but is not limited to, a memory 21, a processing 22, and a network interface 23 that can communicate with each other through a system bus.
  • FIG. 2 only shows the computer device 2 with components 21-23, but it should be understood that it is not It is required to implement all the illustrated components, and more or fewer components may be implemented instead.
  • the memory 21 includes at least one type of readable storage medium, the readable storage medium includes flash memory, hard disk, multimedia card, card type memory (for example, SD or DX memory, etc.), random access memory (RAM), static memory Random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory, magnetic disks, optical disks, etc.
  • the memory 21 may be an internal storage unit of the computer device 2, for example, a hard disk or a memory of the computer device 2.
  • the memory may also be an external storage device of the computer device 2, such as a plug-in hard disk equipped on the computer device 2, a smart media card (SMC), a secure digital ( Secure Digital, SD card, Flash Card, etc.
  • the memory 21 may also include both the internal storage unit of the computer device 2 and its external storage device.
  • the memory 21 is generally used to store the operating system and various application software installed in the computer device 2, such as the program code of the voice recognition authentication system 20.
  • the memory 21 can also be used to temporarily store various types of data that have been output or will be output.
  • the processor 22 may be a central processing unit (CPU), a controller, a microcontroller, a microprocessor, or other data processing chips.
  • the processor 22 is generally used to control the overall operation of the computer device 2.
  • the processor 22 is used to run the program code or process data stored in the memory 21, for example, to run the voice recognition authentication system 20 and so on.
  • the network interface 23 may include a wireless network interface or a wired network interface, and the network interface 23 is generally used to establish a communication connection between the computer device 2 and other electronic devices.
  • the network interface 23 is used to connect the computer device 2 with an external terminal through a network, and establish a data transmission channel and a communication connection between the computer device 2 and the external terminal.
  • the network may be Intranet, Internet, Global System of Mobile Communication (GSM), Wideband Code Division Multiple Access (WCDMA), 4G network, 5G Network, Bluetooth (Bluetooth), Wi-Fi and other wireless or wired networks.
  • FIG. 3 shows a schematic diagram of program modules of a voice recognition authentication system according to the third embodiment of the present application.
  • the speech recognition authentication system 20 may include or be divided into one or more program modules, and the one or more program modules are stored in a storage medium and executed by one or more processors to complete
  • This application can also implement the above-mentioned voice recognition authentication method.
  • the program module referred to in the embodiments of the present application refers to a series of computer-readable instruction instruction segments capable of completing specific functions, and is more suitable for describing the execution process of the voice recognition authentication system 20 in the storage medium than the program itself. The following description will specifically introduce the functions of each program module in this embodiment:
  • the obtaining module 201 is used to obtain audio information.
  • the acquisition module 201 acquires these sounds, that is, audio information.
  • non-ambient noise and the speech spoken by the speaker have different short-term energy and spectral centers.
  • the preprocessing module 202 is configured to preprocess the audio information, so as to obtain voice information from the audio information according to the short-term energy and spectrum center of the audio information.
  • the preprocessing module 202 needs to analyze the audio information Processing is performed to obtain the voice information from the audio information.
  • the silence refers to the part that is not pronounced due to silence. For example, the speaker will think and breathe while speaking, because the speaker will not make a sound when thinking and breathing.
  • the environmental noise includes, but is not limited to, the sound generated by the opening and closing of doors and windows, and the collision of objects.
  • the non-ambient noise includes, but is not limited to, for example, coughing, mouse clicking, or keyboard typing. Short-term energy and spectrum center are two important indicators of audio information in silence detection technology.
  • the short-term energy reflects the strength of signal energy and can distinguish silence and environmental noise in a segment of audio. Able to distinguish parts of non-environmental noise.
  • the short-term energy and the center of the spectrum are combined to filter out effective audio, that is, voice information, from the audio information.
  • the preprocessing module 202 is further configured to extract a multi-frame short-term signal from the audio information according to a preset rule, wherein the preset rule includes a preset signal extraction time interval. Then, the short-term energy and the center of the spectrum are calculated according to the silence detection algorithm for the multi-frame short-term signal. Then, the short-term energy is compared with the first preset value stored in the database, and the center of the spectrum is compared with the second preset value stored in the database. When the short-term energy is higher than the first preset value and the frequency spectrum center is higher than the second preset value, determine that the audio information is voice information, and obtain the voice information.
  • E represents the short-term energy
  • N represents the number of frames of the short-term signal
  • N ⁇ 2 the number of frames of the short-term signal
  • s(n) represents the signal amplitude of the nth frame of the short-term signal in the time domain.
  • the preprocessing module 202 extracts multiple frames of short-term signals s(1), s(2), s(3), s(4) according to a preset time interval (for example: 0.2ms) from the audio information ...S(N), and then calculate the short-term energy of the extracted multi-frame short-term signal to determine the energy intensity of the audio information.
  • a preset time interval for example: 0.2ms
  • the short-term energy is the sum of the squares of the signal of each frame, which reflects the strength of the signal energy. When the signal energy is too weak, it is determined that the signal is silent or environmental noise.
  • the preprocessing module 202 is further configured to obtain frequencies corresponding to the multi-frame short-term signals, and according to the frequency and the multi-frame short-term signals, according to the mute
  • the detection algorithm calculates the frequency spectrum center of the audio information, wherein the calculation formula of the frequency spectrum center is: Wherein, C represents the center of the spectrum, K represents the number of frequencies corresponding to N frames s(n), K ⁇ 2, and is an integer, S(k) represents the frequency domain corresponding to the s(n)
  • the spectral energy distribution obtained by the discrete Fourier transform.
  • the center of the spectrum is also called the first order distance of the spectrum. The smaller the value of the center of the spectrum, the more spectrum energy is concentrated in the low frequency range. You can use the center of the spectrum to remove the non-environmental noise part, for example: The sound of coughing, clicking the mouse, or typing on the keyboard.
  • the audio information is effective audio, that is, the voice information of the speaker.
  • Most of the environmental noise and non-environmental noise are Remove to make the retained voice information more pure and higher quality, and reduce a lot of interference factors for the process of voice recognition.
  • high-quality voice information is obtained by setting the first preset value and the second preset value to be higher than conventional values.
  • the preprocessing module 202 is further configured to: when the short-term energy is lower than the first preset value, and/or the center of the spectrum is lower than the second preset value At the time, it is determined that the audio information is invalid audio information, and the audio information is deleted.
  • the invalid audio information includes at least: silence, environmental noise, and non-environmental noise.
  • the short-term energy is lower than the first preset value, it represents a quiet environment, silence of the audio information, or environmental noise. If the center of the frequency spectrum is lower than the second preset value, it represents a non-quiet environment, and the audio information is non-ambient noise.
  • the feature extraction module 203 is configured to perform voice feature extraction on the voice information.
  • the feature extraction module 203 performs windowing processing on the voice information by using a Hamming window with a window length of 10 frames (100 milliseconds) and a sound frame skip distance of 3 frames (30 milliseconds). , And then extract the corresponding voice features.
  • the voice features include, but are not limited to, spectrum features, sound quality features, and voiceprint features.
  • the frequency spectrum feature distinguishes different voice data, such as target voice and interference voice, according to the frequency of sound vibration.
  • the voice quality feature and the voiceprint feature identify the speaker corresponding to the voice data to be tested according to the voiceprint and the timbre feature of the voice. Since the voice distinction is used to distinguish the target voice and the interfering voice in the voice data, it is only necessary to obtain the frequency spectrum characteristics of the voice information to complete the voice distinction.
  • the frequency spectrum is the abbreviation of frequency spectrum density, and the frequency spectrum characteristic is a parameter that reflects the frequency spectrum density.
  • the voice information includes multiple single frames of voice data
  • the feature extraction module 203 is further configured to perform fast Fourier transform processing on the single frame of voice data first to obtain the voice information
  • a mel filter bank is used to perform dimensionality reduction processing on the power spectrum to obtain a mel spectrum
  • a cepstrum analysis is performed on the mel spectrum to obtain the voice feature.
  • the human auditory perception system can simulate a complex nonlinear system, the acquired power spectrum cannot well show the nonlinear characteristics of the voice data, so it is necessary to use a Mel filter bank to reduce the dimensionality of the spectrum.
  • the frequency spectrum of the acquired voice data to be tested is closer to the frequency perceived by the human ear.
  • the Mel filter bank is composed of multiple overlapping triangular bandpass filters, and the triangular bandpass filter carries three frequencies: the lower limit frequency, the cutoff frequency and the center frequency.
  • the center frequencies of these triangular bandpass filters are equidistant on the mel scale.
  • the mel scale increases linearly before 1000HZ, and increases logarithmically after 1000HZ.
  • Cepstrum refers to the inverse Fourier transform of the Fourier transform spectrum of a signal after logarithmic operation. Since the general Fourier spectrum is a complex spectrum, the cepstrum is also called complex cepstrum.
  • the processing module 204 is configured to process the voice features to obtain target voice features that are closer to the speaker.
  • the speech recognition authentication system further includes a normalization module 207, a splicing module 208, and a training module 209.
  • the normalization module 207 is configured to use the Z-score standardization method to perform normalization processing on the voice features to unify the voice features, wherein the normalization processing formula is: ⁇ is the mean value of the multiple voice information, ⁇ is the standard deviation of the multiple voice information, x is the multiple single-frame voice data, and x* is the voice feature after normalization processing.
  • the splicing module 208 is used for splicing the normalized processing result features to form a spliced frame with a long overlapping part.
  • the training module 209 is configured to input the spliced frame into a neural network to train the spliced frame to obtain the target voice feature, so as to reduce the loss of the voice information.
  • Fig. 2 uses a Hamming window with a window length of 10 frames and a jump distance of 3 frames to splice the normalized result features to form a 390-dimensional feature. Then, the 10 frames are spliced with every 10 frames as a splicing unit. Please refer to FIG. 3 for a specific splicing method.
  • each frame is 39 bits, the dimension of 10 frames spliced together is 390 dimensions. Since the jump distance is 3 frames, starting from the first frame and then skipping 3 steps, the next number of frames to be spliced is from the 4th frame to the 13th frame, and so on.
  • the embodiment of the application unifies the voice features, solves the comparability between data indicators, reduces the different effects caused by singular sample data, and helps to comprehensively compare and evaluate the voice features and improve better voice training. effect.
  • features are spliced to form a frame with a longer overlapping part, so as to capture excessive information and reduce the loss of information in a longer duration.
  • the voice recognition authentication system further includes a voice verification module 210, which is used to input the voice features into a pre-trained speaker detection model and an intruder detection model, and verify according to the output result Whether the voice information is the voice of one preset speaker among a plurality of preset speakers stored in the speaker detection model, and when the voice information is the voice of the preset speaker, the voice message.
  • a voice verification module 210 which is used to input the voice features into a pre-trained speaker detection model and an intruder detection model, and verify according to the output result Whether the voice information is the voice of one preset speaker among a plurality of preset speakers stored in the speaker detection model, and when the voice information is the voice of the preset speaker, the voice message.
  • the voice verification module 210 verifies whether the voice feature is one of the preset speakers in the pre-trained speaker detection model, and chooses to accept or reject the speaker according to the verification result. If it is recognized that the voice feature has been imposted by the intruder, the voice information of the speaker is rejected.
  • the matching module 205 is configured to match the target voice feature with the speaker's voice feature stored in the database.
  • the matching module 205 compares the processed voice feature with the speaker's voice feature stored in the database to obtain the speaker's voice feature that matches the voice feature.
  • the voice recognition and authentication system 20 also collects the speaker's voice features in advance, and saves the voice features and the corresponding speaker's identity information in a database.
  • the environment is a quiet environment during the process of collecting the speaker's voice features, it is easy to obtain the speaker's voice feature, and save the voice feature and the corresponding speaker's identity information in the database.
  • the output module 206 is configured to output the matched identity information of the speaker corresponding to the voice feature of the speaker according to the matching result, so as to obtain the speaker corresponding to the voice information.
  • the output module 206 outputs the identity information 1 to obtain the identity information 1 The speaker represented by A.
  • the accuracy of the speech recognition technology can be improved, and the user experience can be greatly improved.
  • This application also provides a computer device, such as a smart phone, a tablet computer, a notebook computer, a desktop computer, a rack server, a blade server, a tower server, or a cabinet server (including independent servers, or more) that can execute programs.
  • a server cluster composed of two servers) and so on.
  • the computer device in this embodiment at least includes, but is not limited to, a memory, a processor, etc. that can be communicatively connected to each other through a system bus.
  • This embodiment also provides a non-volatile computer-readable storage medium, such as flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static random access memory ( SRAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), programmable read-only memory (PROM), magnetic memory, magnetic disks, optical disks, servers, App application malls, etc., on which storage There are computer-readable instructions, and the corresponding functions are realized when the program is executed by the processor.
  • the non-volatile computer-readable storage medium of this embodiment is used to store the voice recognition authentication system 20, and when executed by a processor, the following steps are implemented:
  • the matched identity information of the speaker corresponding to the voice feature of the speaker is output to obtain the speaker corresponding to the voice information.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Telephonic Communication Services (AREA)

Abstract

La présente invention concerne un procédé d'authentification de reconnaissance vocale, un dispositif informatique et un support de stockage lisible. Le procédé consistante à : acquérir des informations audio (S100); prétraiter les informations audio pour acquérir des informations vocales à partir des informations audio selon une énergie de courte durée et un centre de spectre des informations audio (S102); effectuer une extraction de caractéristique vocale sur les informations vocales (S104); traiter des caractéristiques vocales pour acquérir une caractéristique vocale cible plus proche d'un locuteur (S106); mettre en correspondance la caractéristique vocale cible avec des caractéristiques vocales de locuteur stockées dans une base de données (S108); et délivrer en sortie, selon un résultat de mise en correspondance, des informations d'identité d'un locuteur mis en correspondance correspondant à la caractéristique vocale du locuteur pour acquérir le locuteur correspondant aux informations vocales (S110). Au moyen du procédé précité, la précision de la technologie de reconnaissance vocale peut être améliorée, et l'expérience de l'utilisateur est considérablement enrichie.
PCT/CN2019/117554 2019-09-04 2019-11-12 Procédé et système d'authentification de reconnaissance vocale WO2021042537A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910832042.4A CN110473552A (zh) 2019-09-04 2019-09-04 语音识别认证方法及系统
CN201910832042.4 2019-09-04

Publications (1)

Publication Number Publication Date
WO2021042537A1 true WO2021042537A1 (fr) 2021-03-11

Family

ID=68514996

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/117554 WO2021042537A1 (fr) 2019-09-04 2019-11-12 Procédé et système d'authentification de reconnaissance vocale

Country Status (2)

Country Link
CN (1) CN110473552A (fr)
WO (1) WO2021042537A1 (fr)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112053695A (zh) * 2020-09-11 2020-12-08 北京三快在线科技有限公司 声纹识别方法、装置、电子设备及存储介质
CN112348527A (zh) * 2020-11-17 2021-02-09 上海桂垚信息科技有限公司 一种基于语音识别在银行交易系统中的身份认证方法
CN112927680B (zh) * 2021-02-10 2022-06-17 中国工商银行股份有限公司 一种基于电话信道的声纹有效语音的识别方法及装置
CN113879931B (zh) * 2021-09-13 2023-04-28 厦门市特种设备检验检测院 一种电梯安全监测方法
CN113716246A (zh) * 2021-09-16 2021-11-30 安徽世绿环保科技有限公司 一种居民垃圾投放溯源系统
CN114697759B (zh) * 2022-04-25 2024-04-09 中国平安人寿保险股份有限公司 虚拟形象视频生成方法及其系统、电子设备、存储介质
CN115214541B (zh) * 2022-08-10 2024-01-09 海南小鹏汽车科技有限公司 车辆控制方法、车辆及计算机可读存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1662956A (zh) * 2002-06-19 2005-08-31 皇家飞利浦电子股份有限公司 大量说话人识别(id)系统及其相应方法
JP4392805B2 (ja) * 2008-04-28 2010-01-06 Kddi株式会社 オーディオ情報分類装置
CN102820033A (zh) * 2012-08-17 2012-12-12 南京大学 一种声纹识别方法
CN104078039A (zh) * 2013-03-27 2014-10-01 广东工业大学 基于隐马尔科夫模型的家用服务机器人语音识别系统
CN106782565A (zh) * 2016-11-29 2017-05-31 重庆重智机器人研究院有限公司 一种声纹特征识别方法及系统
CN108877775A (zh) * 2018-06-04 2018-11-23 平安科技(深圳)有限公司 语音数据处理方法、装置、计算机设备及存储介质

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103705333A (zh) * 2013-08-30 2014-04-09 李峰 一种智能止鼾方法及装置
CN104538036A (zh) * 2015-01-20 2015-04-22 浙江大学 一种基于语义细胞混合模型的说话人识别方法
US10535000B2 (en) * 2016-08-08 2020-01-14 Interactive Intelligence Group, Inc. System and method for speaker change detection
CN106356052B (zh) * 2016-10-17 2019-03-15 腾讯科技(深圳)有限公司 语音合成方法及装置
CN106782564B (zh) * 2016-11-18 2018-09-11 百度在线网络技术(北京)有限公司 用于处理语音数据的方法和装置

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1662956A (zh) * 2002-06-19 2005-08-31 皇家飞利浦电子股份有限公司 大量说话人识别(id)系统及其相应方法
JP4392805B2 (ja) * 2008-04-28 2010-01-06 Kddi株式会社 オーディオ情報分類装置
CN102820033A (zh) * 2012-08-17 2012-12-12 南京大学 一种声纹识别方法
CN104078039A (zh) * 2013-03-27 2014-10-01 广东工业大学 基于隐马尔科夫模型的家用服务机器人语音识别系统
CN106782565A (zh) * 2016-11-29 2017-05-31 重庆重智机器人研究院有限公司 一种声纹特征识别方法及系统
CN108877775A (zh) * 2018-06-04 2018-11-23 平安科技(深圳)有限公司 语音数据处理方法、装置、计算机设备及存储介质

Also Published As

Publication number Publication date
CN110473552A (zh) 2019-11-19

Similar Documents

Publication Publication Date Title
WO2021042537A1 (fr) Procédé et système d'authentification de reconnaissance vocale
WO2021128741A1 (fr) Procédé et appareil d'analyse de fluctuation d'émotion dans la voix, et dispositif informatique et support de stockage
WO2020177380A1 (fr) Procédé, appareil et dispositif de détection d'empreinte vocale sur la base d'un texte court, et support d'enregistrement
WO2020181824A1 (fr) Procédé, appareil et dispositif de reconnaissance d'empreinte vocale et support de stockage lisible par ordinateur
EP2763134B1 (fr) Procédé et dispositif de reconnaissance de la parole
US8160877B1 (en) Hierarchical real-time speaker recognition for biometric VoIP verification and targeting
US20120143608A1 (en) Audio signal source verification system
CN110880329B (zh) 一种音频识别方法及设备、存储介质
WO2021051572A1 (fr) Procédé et appareil de reconnaissance vocale et dispositif informatique
WO2021179717A1 (fr) Procédé et appareil de traitement frontal de reconnaissance vocale, et dispositif terminal
CN109599117A (zh) 一种音频数据识别方法及人声语音防重放识别系统
US20230401338A1 (en) Method for detecting an audio adversarial attack with respect to a voice input processed by an automatic speech recognition system, corresponding device, computer program product and computer-readable carrier medium
CN113823293B (zh) 一种基于语音增强的说话人识别方法及系统
CN113035202B (zh) 一种身份识别方法和装置
WO2018095167A1 (fr) Procédé d'identification d'empreinte vocale et système d'identification d'empreinte vocale
CN113223536A (zh) 声纹识别方法、装置及终端设备
CN110570870A (zh) 一种文本无关的声纹识别方法、装置及设备
CN113112992B (zh) 一种语音识别方法、装置、存储介质和服务器
CN114302301B (zh) 频响校正方法及相关产品
CN112216285B (zh) 多人会话检测方法、系统、移动终端及存储介质
CN114171032A (zh) 跨信道声纹模型训练方法、识别方法、装置及可读介质
CN113838469A (zh) 一种身份识别方法、系统及存储介质
Komlen et al. Text independent speaker recognition using LBG vector quantization
CN114512133A (zh) 发声对象识别方法、装置、服务器及存储介质
Chakraborty et al. An improved approach to open set text-independent speaker identification (OSTI-SI)

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19943879

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19943879

Country of ref document: EP

Kind code of ref document: A1