WO2021051572A1 - Procédé et appareil de reconnaissance vocale et dispositif informatique - Google Patents

Procédé et appareil de reconnaissance vocale et dispositif informatique Download PDF

Info

Publication number
WO2021051572A1
WO2021051572A1 PCT/CN2019/117761 CN2019117761W WO2021051572A1 WO 2021051572 A1 WO2021051572 A1 WO 2021051572A1 CN 2019117761 W CN2019117761 W CN 2019117761W WO 2021051572 A1 WO2021051572 A1 WO 2021051572A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
frame
voice
segment
voiceprint
Prior art date
Application number
PCT/CN2019/117761
Other languages
English (en)
Chinese (zh)
Inventor
王健宗
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021051572A1 publication Critical patent/WO2021051572A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window

Definitions

  • This application relates to the field of speech recognition technology, and in particular to a speech recognition method, device, computer equipment, and non-volatile computer-readable storage medium.
  • Voice recognition is a type of biometric recognition technology, which is a technology that automatically recognizes the user's identity corresponding to the voice based on the voice parameters that reflect the physiological or behavioral characteristics of the voice in the voice waveform.
  • speech recognition generally uses the voiceprint features in the voice signal for recognition.
  • the existing windowing process such as the use of Hanning window, Hamming window, and triangular window , Gaussian window, etc. to add windows to voice data.
  • the inventor realizes that the existing windowing processing methods almost always modify the original speech signal, which causes the loss of part of the voiceprint feature information and reduces the accuracy of speech recognition.
  • this application proposes a speech recognition method, device, computer equipment, and non-volatile computer-readable storage medium, which can obtain speech fragments and then divide into frames to obtain each frame of speech data, and then smoothly add according to a preset
  • the windowing algorithm windows each frame of speech data to obtain a windowed speech frame; then, extract the Mel frequency cepstrum feature vector MFCC of the windowed speech frame of the speech segment, and calculate the MFCC and voiceprint
  • the distance of the discrimination vector when the distance is less than a preset threshold, it is determined that the recognition result of the speech segment is passed.
  • the present application provides a voice recognition method, which includes:
  • a voice segment divide the voice segment into frames to obtain each frame of voice data; according to a preset smooth windowing algorithm, sequentially window each frame of the voice segment of the voice segment to obtain the voice segment Windowed speech frame; extracting the Mel frequency cepstrum feature vector MFCC of the windowed speech frame of the speech segment; calculating the distance between the MFCC and the voiceprint discrimination vector, wherein the voiceprint discrimination vector is the
  • the sampled voice information of the user is input into the voiceprint feature training model for training; when the distance is less than a preset threshold, it is determined that the recognition result of the voice segment is passed.
  • a voice recognition device which includes:
  • the framing module is used to obtain speech fragments, and the speech fragments are divided into frames to obtain each frame of speech data; the windowing module is used to sequentially calculate each frame of the speech fragments according to a preset smooth windowing algorithm
  • the speech data is windowed to obtain the windowed speech frame of the speech segment;
  • the extraction module is used to extract the Mel frequency cepstrum feature vector MFCC of the windowed speech frame of the speech segment;
  • the calculation module is used to calculate the The distance between the MFCC and the voiceprint identification vector, where the voiceprint identification vector is obtained by pre-inputting sampled voice information of the user into the voiceprint feature training model for training; the identification module is used for when the distance is less than a preset When the threshold is set, it is determined that the recognition result of the speech segment is passed.
  • this application also proposes a computer device, the computer device includes a memory and a processor, the memory stores computer-readable instructions that can run on the processor, and the computer-readable instructions are The implementation steps when the processor is executed:
  • a voice segment divide the voice segment into frames to obtain each frame of voice data; according to a preset smooth windowing algorithm, sequentially window each frame of the voice segment of the voice segment to obtain the voice segment Windowed speech frame; extracting the Mel frequency cepstrum feature vector MFCC of the windowed speech frame of the speech segment; calculating the distance between the MFCC and the voiceprint discrimination vector, wherein the voiceprint discrimination vector is the
  • the sampled voice information of the user is input into the voiceprint feature training model for training; when the distance is less than a preset threshold, it is determined that the recognition result of the voice segment is passed.
  • the present application also provides a non-volatile computer-readable storage medium, the non-volatile computer-readable storage medium stores computer-readable instructions, and the computer-readable instructions can be At least one processor executes, so that the at least one processor executes the steps:
  • a voice segment divide the voice segment into frames to obtain each frame of voice data; according to a preset smooth windowing algorithm, sequentially window each frame of the voice segment of the voice segment to obtain the voice segment Windowed speech frame; extracting the Mel frequency cepstrum feature vector MFCC of the windowed speech frame of the speech fragment; calculating the distance between the MFCC and the voiceprint discrimination vector, wherein the voiceprint discrimination vector is the
  • the sampled voice information of the user is input into the voiceprint feature training model for training; when the distance is less than a preset threshold, it is determined that the recognition result of the voice segment is passed.
  • the speech recognition method, device, computer equipment, and non-volatile computer-readable storage medium proposed in this application can obtain speech fragments and then divide into frames to obtain each frame of speech data, and then compare all the speech data according to a preset smooth windowing algorithm.
  • Each frame of speech data is windowed to obtain a windowed speech frame; then, the Mel frequency cepstrum feature vector MFCC of the windowed speech frame of the speech segment is extracted, and the distance between the MFCC and the voiceprint discrimination vector is calculated ; When the distance is less than a preset threshold, it is determined that the recognition result of the speech segment is passed.
  • the feature vector in the speech segment can be calculated more accurately with a small amount of modification to the speech signal, thereby improving the accuracy of speech recognition.
  • Fig. 1 is a schematic diagram of an optional hardware architecture of the computer equipment of the present application
  • FIG. 2 is a schematic diagram of program modules of an embodiment of the speech recognition device of the present application.
  • Fig. 3 is a schematic flowchart of an embodiment of a speech recognition method according to the present application.
  • FIG. 1 is a schematic diagram of an optional hardware architecture of the computer device 1 of the present application.
  • the computer device 1 may include, but is not limited to, a memory 11, a processor 12, and a network interface 13 that can communicate with each other through a system bus.
  • the computer device 1 is connected to the network through the network interface 13 (not shown in FIG. 1), and connected to other terminal devices such as mobile terminals (Mobile Terminal), mobile phones (Mobile Telephone), user equipment (User Equipment, UE), and Mobile phones (handset) and portable equipment (portable equipment), PC terminal, etc.
  • the network may be Intranet, Internet, Global System of Mobile Communication (GSM), Wideband Code Division Multiple Access (WCDMA), 4G network, 5G Network, Bluetooth (Bluetooth), Wi-Fi, call network and other wireless or wired networks.
  • FIG. 1 only shows the computer device 1 with components 11-13, but it should be understood that it is not required to implement all the illustrated components, and more or fewer components may be implemented instead.
  • the memory 11 includes at least one type of non-volatile computer-readable storage medium, and the non-volatile computer-readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX). Memory, etc.), random access memory (RAM), static random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory , Disks, CDs, etc.
  • the memory 11 may be an internal storage unit of the computer device 1, for example, a hard disk or a memory of the computer device 1.
  • the memory 11 may also be an external storage device of the computer device 1, such as a plug-in hard disk, a smart media card (SMC), and a secure digital ( Secure Digital, SD card, Flash Card, etc.
  • the memory 11 may also include both the internal storage unit of the computer device 1 and its external storage device.
  • the memory 11 is generally used to store an operating system and various application software installed in the computer device 1, such as the program code of the voice recognition apparatus 200.
  • the memory 11 can also be used to temporarily store various types of data that have been output or will be output.
  • the memory 11 stores computer readable instructions, and the computer readable instructions can be executed by at least one processor, so that the at least one processor executes the steps:
  • a voice segment divide the voice segment into frames to obtain each frame of voice data; according to a preset smooth windowing algorithm, sequentially window each frame of the voice segment of the voice segment to obtain the voice segment Windowed speech frame; extracting the Mel frequency cepstrum feature vector MFCC of the windowed speech frame of the speech segment; calculating the distance between the MFCC and the voiceprint discrimination vector, wherein the voiceprint discrimination vector is the
  • the sampled voice information of the user is input into the voiceprint feature training model for training; when the distance is less than a preset threshold, it is determined that the recognition result of the voice segment is passed.
  • the processor 12 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips.
  • the processor 12 is generally used to control the overall operation of the computer device 1, such as performing data interaction or communication-related control and processing.
  • the processor 12 is configured to run the program code or process data stored in the memory 11, for example, to run the voice recognition device 200.
  • the network interface 13 may include a wireless network interface or a wired network interface.
  • the network interface 13 is usually used to connect the computer device 1 with other terminal devices such as mobile terminals, mobile phones, user equipment, mobile phones and portable devices, PC terminals, etc. Establish a communication connection between.
  • a voice recognition device 200 when a voice recognition device 200 is installed and running in the computer device 1, when the voice recognition device 200 is running, it can obtain voice fragments and then divide into frames to obtain each frame of voice data, and then according to the preset
  • the stationary windowing algorithm of ”performs windowing on each frame of speech data to obtain a windowed speech frame; then, extracts the Mel frequency cepstrum feature vector MFCC of the windowed speech frame of the speech segment, and calculates the MFCC The distance to the voiceprint discrimination vector; when the distance is less than a preset threshold, it is determined that the recognition result of the voice information is passed.
  • this application proposes a voice recognition device 200.
  • FIG. 2 is a program module diagram of an embodiment of the speech recognition device 200 of the present application.
  • the speech recognition device 200 includes a series of computer-readable instructions stored on the memory 11, and when the computer-readable instructions are executed by the processor 12, the speech recognition functions of the various embodiments of the present application can be implemented. .
  • the speech recognition apparatus 200 may be divided into one or more modules based on specific operations implemented by various parts of the computer-readable instructions. For example, in FIG. 2, the speech recognition device 200 may be divided into a frame module 201, a windowing module 202, an extraction module 203, a calculation module 204 and a recognition module 205. among them:
  • the framing module 201 is used to obtain a voice segment, and framing the voice segment to obtain each frame of voice data.
  • the computer device 1 is connected to a user terminal, such as a mobile phone, a mobile terminal, a PC terminal, etc., and then the user's voice information is obtained through the user terminal.
  • the computer device 1 may also directly provide a pickup unit to collect the user's voice data.
  • the voice data includes at least one voice segment. Therefore, the framing module 201 can obtain the voice segment. After the framing module 201 obtains the speech segment, it further divides the speech segment to obtain the speech data of each frame. Of course, due to the physiological characteristics of the human body, the high frequency part of the speech segment is often suppressed. Therefore, in other embodiments, the framing module 201 also performs pre-emphasis processing on the speech segment to compensate for the The high frequency components.
  • the windowing module 202 is configured to sequentially window each frame of speech data of the speech segment according to a preset smooth windowing algorithm to obtain a windowed speech frame of the speech segment.
  • the windowing module 202 further performs windowing on each frame of speech data of the speech segment.
  • the windowing module 202 sequentially windows each frame of speech data of the speech segment according to a preset smooth windowing algorithm, and then obtains the windowed speech frame of the speech segment.
  • the stable windowing algorithm is:
  • T1 is the time length of the windowed speech frame
  • w(t) represents the weighted value of the speech signal at time t within the time length range of the speech frame
  • the computer device 1 when the computer device 1 performs windowing on each frame of speech data, it first obtains the frequency distribution information of the environmental noise in the speech data, and then automatically adjusts the variable K, and then divides the frame according to the variable K.
  • Segmented windowing includes: adopting cosine waveform-like windowing for the beginning and end of the speech frame to reduce environmental noise interference in the low-frequency part; adopting rectangular-like windowing for the middle part of the speech frame to avoid sudden bursts High frequency noise generated by mutation.
  • the computer device 1 may randomly select two voice sub-frames from the voice sub-frames in the voice fragment in advance, and then convert them to the frequency domain by Fourier transform, and detect the The frequency distribution of environmental noise, and then set the KT1 at a position higher than the maximum frequency of the environmental noise.
  • the extraction module 203 is configured to extract the Mel frequency cepstrum feature vector MFCC of the windowed speech frame of the speech segment.
  • the extraction module 203 further processes the windowed voice frames of the voice segment to extract the Mel frequency cepstrum feature Vector MFCC.
  • the extraction module 203 first performs discrete Fourier transform on the windowed speech frame, from the time domain to the frequency domain; then according to the formula:
  • the calculation module 204 is configured to calculate the distance between the MFCC and a voiceprint identification vector, where the voiceprint identification vector is obtained by pre-inputting sampled voice information of the user into a voiceprint feature training model for training.
  • the recognition module 205 is configured to determine that the recognition result of the speech segment is passed when the distance is less than a preset threshold.
  • the computer device 1 will sample the user's voice information in advance, and then input the adopted voice information into a voiceprint feature training model for training, so as to obtain a voiceprint identification vector corresponding to the user. Therefore, after the extraction module 203 extracts the MFCC of the speech segment, the calculation module 204 further calculates the distance between the MFCC and the voiceprint discrimination vector.
  • the distance is the cosine distance, and the calculation formula corresponding to the distance is:
  • x represents the standard voiceprint discrimination vector
  • y represents the current voiceprint discrimination vector.
  • the calculation module 204 uses the cosine distance formula to calculate the distance between the MFCC of the speech segment and the preset voiceprint discrimination vector, and then the recognition module 205 compares the distance with the preset When the distance is less than the threshold, it is determined that the recognition result of the speech segment is passed.
  • the computer device 1 preliminarily trains the voiceprint identification vectors of different users through GMM and calculates the distances of the MFCC respectively, thereby selecting the first voiceprint identification corresponding to the smallest distance which is less than the preset threshold.
  • Vector, the first user corresponding to the first voiceprint discrimination vector is used as the target user corresponding to the voice segment.
  • the computer device 1 will also pre-train a GMM (Gaussian Mixture Model) with higher accuracy, where the GMM serves as a universal background model (UBM, Universal Background Model) , Can be used to extract the voiceprint identification vector in speech, wherein the GMM can be trained through a series of sample data, so as to improve the training accuracy of the voiceprint identification vector.
  • GMM Global System for Mobile Communications
  • UBM Universal Background Model
  • Each voice data sample can be collected from the voices of different people in different environments (that is, corresponding to a voiceprint identification vector), such voice data samples It is used to train a general background model that can characterize general speech characteristics.
  • each voice data sample separately to extract the preset type voiceprint feature corresponding to each voice data sample, and construct the voiceprint feature corresponding to each voice data sample based on the preset type voiceprint feature corresponding to each voice data sample vector;
  • the model training ends, otherwise, increase the number of voice data samples, and re-execute the above steps B2, B3, B4, B5 based on the increased voice data samples .
  • the preset accuracy rate for example, 98.5%
  • the computer device 1 first trains the collected user's voice information according to the trained GMM to obtain the corresponding voiceprint discrimination vector, and then the calculation module 204 uses the voiceprint discrimination vector to calculate the difference between the voice segment and the voice segment. The distance of the corresponding MFCC to improve accuracy.
  • the computer device 1 can obtain the voice segment and then divide the frame to obtain each frame of voice data, and then perform windowing on each frame of voice data according to a preset smooth windowing algorithm to obtain a windowed voice frame ; Next, extract the Mel frequency cepstrum feature vector MFCC of the windowed speech frame of the speech segment, and calculate the distance between the MFCC and the voiceprint discrimination vector; when the distance is less than a preset threshold, determine the The recognition result of the speech fragment is passed.
  • the feature vector in the speech segment can be calculated more accurately with a small amount of modification to the speech signal, thereby improving the accuracy of speech recognition.
  • this application also proposes a voice recognition method, which is applied to computer equipment.
  • FIG. 3 is a schematic flowchart of an embodiment of a speech recognition method according to the present application.
  • the execution order of the steps in the flowchart shown in FIG. 3 can be changed, and some steps can be omitted.
  • Step S500 Acquire a voice segment, divide the voice segment into frames, and obtain each frame of voice data.
  • the computer device is connected to a user terminal, such as a mobile phone, a mobile terminal, a PC terminal, and other devices, and then the user's voice information is obtained through the user terminal.
  • the computer device can also directly provide a pickup unit to collect the user's voice data, and the voice data includes at least one voice segment. Therefore, the computer device can obtain the voice segment. After acquiring the voice segment, the computer device further divides the voice segment into frames to obtain the voice data of each frame.
  • the computer device also performs pre-emphasis processing on the speech segment to compensate for the high frequency in the speech segment. Frequency components.
  • each frame of voice data of the voice segment is sequentially windowed according to a preset smooth windowing algorithm to obtain a windowed voice frame of the voice segment.
  • the computer device After the computer device divides the speech segment into frames, it further performs windowing on each frame of speech data of the speech segment.
  • the computer device sequentially windows each frame of speech data of the speech segment according to a preset smooth windowing algorithm, and then obtains a windowed speech frame of the speech segment.
  • the stable windowing algorithm is:
  • T1 is the time length of the windowed speech frame
  • w(t) represents the weighted value of the speech signal at time t within the time length range of the speech frame
  • the computer device when the computer device performs windowing on each frame of speech data, it first obtains the frequency distribution information of the environmental noise in the speech data, and then automatically adjusts the variable K, and then divides the frame according to the variable K.
  • Segmented windowing includes: adopting cosine waveform-like windowing for the start and end of the speech frame to reduce environmental noise interference in the low frequency part; adopting rectangular-like windowing for the middle part of the speech frame to avoid sudden mutation High-frequency noise generated.
  • the computer device may randomly select two voice sub-frames from the voice sub-frames in the voice segment in advance, and then convert them to the frequency domain by Fourier transform, and detect the environment therein. The frequency distribution of the noise, and then set the KT1 at a position higher than the maximum frequency of the environmental noise.
  • Step S504 Extract the Mel frequency cepstrum feature vector MFCC of the windowed speech frame of the speech segment.
  • the computer device after the computer device performs windowing on all voice sub-frames of the voice segment, it further processes the windowed voice frames of the voice segment to extract the Mel frequency cepstrum feature vector MFCC.
  • the computer device first performs discrete Fourier transform on the windowed speech frame, from the time domain to the frequency domain; then according to the formula:
  • Step S506 Calculate the distance between the MFCC and the voiceprint identification vector, where the voiceprint identification vector is obtained by pre-inputting sampled voice information of the user into a voiceprint feature training model for training.
  • Step S508 When the distance is less than a preset threshold, it is determined that the recognition result of the speech segment is passed.
  • the computer device will sample the user's voice information in advance, and then input the adopted voice information into the voiceprint feature training model for training, so as to obtain the voiceprint identification vector corresponding to the user. Therefore, after the computer device extracts the MFCC of the speech segment, it will further calculate the distance between the MFCC and the voiceprint discrimination vector.
  • the distance is the cosine distance, and the calculation formula corresponding to the distance is:
  • x represents the standard voiceprint discrimination vector
  • y represents the current voiceprint discrimination vector.
  • the computer device uses the cosine distance formula to calculate the distance between the MFCC of the speech segment and the preset voiceprint discrimination vector, and then the computer device compares the distance with a preset threshold. Compare; when the distance is less than the threshold, it is determined that the recognition result of the speech segment is passed.
  • the computer device preliminarily trains the voiceprint identification vectors of different users through the GMM and calculates the distances of the MFCC respectively, thereby selecting the first voiceprint identification vector corresponding to the smallest distance that is less than a preset threshold. , Taking the first user corresponding to the first voiceprint discrimination vector as the target user corresponding to the voice segment.
  • the computer device will also pre-train a GMM (Gaussian Mixture Model) with higher accuracy, where the GMM is used as a universal background model (UBM, Universal Background Model), It can be used to extract the voiceprint discrimination vector in speech, where the GMM can be trained through a series of sample data, so as to improve the training accuracy of the voiceprint discrimination vector.
  • GMM Global System for Mobile Communications
  • UBM Universal Background Model
  • Each voice data sample can be collected from the voices of different people in different environments (that is, corresponding to a voiceprint identification vector), such voice data samples It is used to train a general background model that can characterize general speech characteristics.
  • each voice data sample separately to extract the preset type voiceprint feature corresponding to each voice data sample, and construct the voiceprint feature corresponding to each voice data sample based on the preset type voiceprint feature corresponding to each voice data sample vector;
  • the model training ends, otherwise, increase the number of voice data samples, and re-execute the above steps B2, B3, B4, B5 based on the increased voice data samples .
  • the preset accuracy rate for example, 98.5%
  • the computer device first trains the collected user's voice information according to the trained GMM to obtain the corresponding voiceprint identification vector, and then the calculation module 204 uses the voiceprint identification vector to calculate the corresponding voice segment MFCC distance, thereby improving accuracy.
  • the speech recognition method proposed in this embodiment can obtain each frame of speech data by framing after acquiring a speech fragment, and then windowing each frame of speech data according to a preset smooth windowing algorithm to obtain a windowed speech frame; Then, extract the Mel frequency cepstrum feature vector MFCC of the windowed speech frame of the speech segment, and calculate the distance between the MFCC and the voiceprint discrimination vector; when the distance is less than a preset threshold, determine the speech The recognition result of the fragment is passed.
  • the feature vector in the speech segment can be calculated more accurately with a small amount of modification to the speech signal, thereby improving the accuracy of speech recognition.
  • the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, The optical disc) includes several instructions to enable a terminal device (which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to execute the method described in each embodiment of the present application.
  • a terminal device which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Telephonic Communication Services (AREA)

Abstract

La présente invention concerne un procédé et un appareil de reconnaissance vocale, un dispositif informatique et un support de stockage lisible par ordinateur non volatil. Ledit procédé comprend les étapes consistant à : acquérir un segment vocal et tramer le segment vocal pour obtenir chaque trame de données vocales (S500) ; fenêtrer séquentiellement, selon un algorithme de fenêtrage stable prédéfini, chaque trame de données vocales du segment vocal, de façon à obtenir une trame vocale fenêtrée du segment vocal (S502) ; extraire un vecteur de coefficient cepstral de fréquence Mel (MFCC) de la trame vocale fenêtrée du segment vocal (S504) ; calculer la distance entre le MFCC et un vecteur de discrimination d'empreinte vocale (S506) ; et lorsque la distance est inférieure à un seuil prédéfini, déterminer que le résultat de reconnaissance du segment vocal est réussi (S508). Le procédé de reconnaissance vocale peut calculer un vecteur caractéristique dans un segment vocal de manière plus précise, ce qui permet d'améliorer la précision de la reconnaissance vocale.
PCT/CN2019/117761 2019-09-16 2019-11-13 Procédé et appareil de reconnaissance vocale et dispositif informatique WO2021051572A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910871726.5 2019-09-16
CN201910871726.5A CN110556126B (zh) 2019-09-16 2019-09-16 语音识别方法、装置以及计算机设备

Publications (1)

Publication Number Publication Date
WO2021051572A1 true WO2021051572A1 (fr) 2021-03-25

Family

ID=68740361

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/117761 WO2021051572A1 (fr) 2019-09-16 2019-11-13 Procédé et appareil de reconnaissance vocale et dispositif informatique

Country Status (2)

Country Link
CN (1) CN110556126B (fr)
WO (1) WO2021051572A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113744759A (zh) * 2021-09-17 2021-12-03 广州酷狗计算机科技有限公司 音色模板定制方法及其装置、设备、介质、产品
CN117577137A (zh) * 2024-01-15 2024-02-20 宁德时代新能源科技股份有限公司 切刀健康评估方法、装置、设备及存储介质

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111210829A (zh) * 2020-02-19 2020-05-29 腾讯科技(深圳)有限公司 语音识别方法、装置、系统、设备和计算机可读存储介质
CN111508498B (zh) * 2020-04-09 2024-01-30 携程计算机技术(上海)有限公司 对话式语音识别方法、系统、电子设备和存储介质
CN111933153B (zh) * 2020-07-07 2024-03-08 北京捷通华声科技股份有限公司 一种语音分割点的确定方法和装置
CN113098850A (zh) * 2021-03-24 2021-07-09 北京嘀嘀无限科技发展有限公司 一种语音验证方法、装置和电子设备
CN115129923B (zh) * 2022-05-17 2023-10-20 荣耀终端有限公司 语音搜索方法、设备及存储介质
CN114945099B (zh) * 2022-05-18 2024-04-26 广州博冠信息科技有限公司 语音监控方法、装置、电子设备及计算机可读介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8160877B1 (en) * 2009-08-06 2012-04-17 Narus, Inc. Hierarchical real-time speaker recognition for biometric VoIP verification and targeting
US20120158809A1 (en) * 2010-12-17 2012-06-21 Toshifumi Yamamoto Compensation Filtering Device and Method Thereof
CN105232064A (zh) * 2015-10-30 2016-01-13 科大讯飞股份有限公司 一种预测音乐对驾驶员行为影响的系统和方法
CN107527620A (zh) * 2017-07-25 2017-12-29 平安科技(深圳)有限公司 电子装置、身份验证的方法及计算机可读存储介质
CN109040913A (zh) * 2018-08-06 2018-12-18 中国船舶科学研究中心(中国船舶重工集团公司第七0二研究所) 窗函数加权电声换能器发射阵列的波束成形方法
CN110197657A (zh) * 2019-05-22 2019-09-03 大连海事大学 一种基于余弦相似度的动态音声特征提取方法

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107993071A (zh) * 2017-11-21 2018-05-04 平安科技(深圳)有限公司 电子装置、基于声纹的身份验证方法及存储介质
CN108899032A (zh) * 2018-06-06 2018-11-27 平安科技(深圳)有限公司 声纹识别方法、装置、计算机设备及存储介质
CN110047490A (zh) * 2019-03-12 2019-07-23 平安科技(深圳)有限公司 声纹识别方法、装置、设备以及计算机可读存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8160877B1 (en) * 2009-08-06 2012-04-17 Narus, Inc. Hierarchical real-time speaker recognition for biometric VoIP verification and targeting
US20120158809A1 (en) * 2010-12-17 2012-06-21 Toshifumi Yamamoto Compensation Filtering Device and Method Thereof
CN105232064A (zh) * 2015-10-30 2016-01-13 科大讯飞股份有限公司 一种预测音乐对驾驶员行为影响的系统和方法
CN107527620A (zh) * 2017-07-25 2017-12-29 平安科技(深圳)有限公司 电子装置、身份验证的方法及计算机可读存储介质
CN109040913A (zh) * 2018-08-06 2018-12-18 中国船舶科学研究中心(中国船舶重工集团公司第七0二研究所) 窗函数加权电声换能器发射阵列的波束成形方法
CN110197657A (zh) * 2019-05-22 2019-09-03 大连海事大学 一种基于余弦相似度的动态音声特征提取方法

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113744759A (zh) * 2021-09-17 2021-12-03 广州酷狗计算机科技有限公司 音色模板定制方法及其装置、设备、介质、产品
CN113744759B (zh) * 2021-09-17 2023-09-22 广州酷狗计算机科技有限公司 音色模板定制方法及其装置、设备、介质、产品
CN117577137A (zh) * 2024-01-15 2024-02-20 宁德时代新能源科技股份有限公司 切刀健康评估方法、装置、设备及存储介质
CN117577137B (zh) * 2024-01-15 2024-05-28 宁德时代新能源科技股份有限公司 切刀健康评估方法、装置、设备及存储介质

Also Published As

Publication number Publication date
CN110556126B (zh) 2024-01-05
CN110556126A (zh) 2019-12-10

Similar Documents

Publication Publication Date Title
WO2021051572A1 (fr) Procédé et appareil de reconnaissance vocale et dispositif informatique
JP6621536B2 (ja) 電子装置、身元認証方法、システム及びコンピュータ読み取り可能な記憶媒体
WO2021128741A1 (fr) Procédé et appareil d'analyse de fluctuation d'émotion dans la voix, et dispositif informatique et support de stockage
CN106683680B (zh) 说话人识别方法及装置、计算机设备及计算机可读介质
JP6649474B2 (ja) 声紋識別方法、装置及びバックグラウンドサーバ
WO2019100606A1 (fr) Dispositif électronique, procédé et système de vérification d'identité à base d'empreinte vocale, et support de stockage
WO2018166187A1 (fr) Serveur, procédé et système de vérification d'identité, et support d'informations lisible par ordinateur
US9343067B2 (en) Speaker verification
WO2020181824A1 (fr) Procédé, appareil et dispositif de reconnaissance d'empreinte vocale et support de stockage lisible par ordinateur
WO2021042537A1 (fr) Procédé et système d'authentification de reconnaissance vocale
WO2019136912A1 (fr) Dispositif électronique, procédé et système d'authentification d'identité, et support de stockage
WO2019136909A1 (fr) Procédé de détection de corps vivant vocal basé sur un apprentissage profond, serveur et support de stockage
WO2019136911A1 (fr) Procédé et appareil de reconnaissance vocale, dispositif terminal et support de stockage
CN113223536B (zh) 声纹识别方法、装置及终端设备
WO2019232826A1 (fr) Procédé d'extraction de vecteur i, procédé et appareil d'identification de locuteur, dispositif, et support
EP3989217A1 (fr) Procédé pour détecter une attaque audio adverse par rapport à une entrée vocale traitée par un système de reconnaissance vocale automatique, dispositif correspondant, produit programme informatique et support lisible par ordinateur
WO2019196305A1 (fr) Dispositif électronique, procédé de vérification d'identité, et support de stockage
WO2019218512A1 (fr) Serveur, procédé de vérification d'empreinte vocale et support d'informations
CN113035202A (zh) 一种身份识别方法和装置
CN108630208B (zh) 服务器、基于声纹的身份验证方法及存储介质
CN116312559A (zh) 跨信道声纹识别模型的训练方法、声纹识别方法及装置
CN114171032A (zh) 跨信道声纹模型训练方法、识别方法、装置及可读介质
CN113035230A (zh) 认证模型的训练方法、装置及电子设备
CN112992174A (zh) 一种语音分析方法及其语音记录装置
WO2021196458A1 (fr) Procédé d'entrée de prêt intelligent, appareil et support de stockage

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19945851

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19945851

Country of ref document: EP

Kind code of ref document: A1