WO2019179033A1 - Procédé d'authentification de locuteur, serveur et support d'informations lisible par ordinateur - Google Patents

Procédé d'authentification de locuteur, serveur et support d'informations lisible par ordinateur Download PDF

Info

Publication number
WO2019179033A1
WO2019179033A1 PCT/CN2018/102203 CN2018102203W WO2019179033A1 WO 2019179033 A1 WO2019179033 A1 WO 2019179033A1 CN 2018102203 W CN2018102203 W CN 2018102203W WO 2019179033 A1 WO2019179033 A1 WO 2019179033A1
Authority
WO
WIPO (PCT)
Prior art keywords
speaker
neural network
convolutional neural
network architecture
voice
Prior art date
Application number
PCT/CN2018/102203
Other languages
English (en)
Chinese (zh)
Inventor
王义文
王健宗
肖京
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2019179033A1 publication Critical patent/WO2019179033A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/22Interactive procedures; Man-machine interfaces

Definitions

  • the present application relates to the field of identity authentication, and in particular, to a speaker authentication method, a server, and a computer readable storage medium.
  • intelligent hardware for information security, most smart devices are equipped with an authentication password.
  • the usual identity authentication password is fingerprint authentication or digital password or graphic password as the basis of identity, but often the key or touch screen is not the most efficient.
  • voice input will be more convenient.
  • the current voice recognition is mainly when the user inputs a specific text voice, and when the smart device recognizes the corresponding content, the identity verification succeeds, but the specific voice is used as a password, which is easy to be cracked, and has a security risk.
  • the present application provides a speaker authentication method, a server, and a computer readable storage medium.
  • the present application provides a speaker authentication method, which is applied to a server, and the method includes:
  • test utterance When the test utterance is received, comparing the test utterance information with the stored speech model of the speaker;
  • the similarity between the test utterance information and the speaker's voice model is calculated.
  • the speaker authentication is successful, and when the similarity is less than a preset value, the speaker authentication fails.
  • the present application further provides a server, where the server includes a memory, a processor, and the speaker stores a speaker authentication system operable on the processor, the speaker authentication system.
  • the server includes a memory, a processor, and the speaker stores a speaker authentication system operable on the processor, the speaker authentication system.
  • test utterance When the test utterance is received, comparing the test utterance information with the stored speech model of the speaker;
  • the step of creating and storing the speaker's voice model by using the 3D convolutional neural network architecture includes:
  • a speaker's speech model is generated from an average vector of audio stack frames belonging to the speaker.
  • the present application further provides a computer readable storage medium storing a speaker authentication system, the speaker authentication system being executable by at least one processor, such that The at least one processor performs the steps of the speaker authentication method as described above.
  • 1 is a schematic diagram of an optional hardware architecture of the server of the present application.
  • FIG. 2 is a schematic diagram of a program module of a first embodiment of a speaker authentication system of the present application
  • FIG. 3 is a schematic diagram of parsing speaker speech into an audio stream stack frame
  • FIG. 4 is a schematic flow chart of a first embodiment of a speaker authentication method according to the present application.
  • FIG. 5 is a schematic diagram of a specific process of step S303 in the first embodiment of the speaker authentication method of the present application.
  • Server 2 Memory 11 processor 12 Network Interface 13 Speaker authentication system 200 Acquisition module 201 Building module 202 Input module 203 Comparison module 204 Calculation module 205 Parsing module 206
  • the server 2 may include, but is not limited to, the memory 11, the processor 12, and the network interface 13 being communicably connected to each other through a system bus. It is pointed out that Figure 1 only shows the server 2 with the components 11-13, but it should be understood that not all illustrated components may be implemented, and more or fewer components may be implemented instead.
  • the memory 11 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (eg, SD or DX memory, etc.), and a random access memory (RAM). , static random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory, magnetic disk, optical disk, and the like.
  • the memory 11 may be an internal storage unit of the server 2, such as a hard disk or memory of the server 2.
  • the memory 11 may also be an external storage device of the server 2, such as a plug-in hard disk equipped on the server 2, a smart memory card (SMC), and a secure digital (Secure) Digital, SD) cards, flash cards, etc.
  • the memory 11 can also include both the internal storage unit of the server 2 and its external storage device.
  • the memory 11 is generally used to store an operating system installed on the server 2 and various types of application software, such as program codes of the speaker authentication system 200. Further, the memory 11 can also be used to temporarily store various types of data that have been output or are to be output.
  • the processor 12 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments.
  • the processor 12 is typically used to control the overall operation of the server 2, such as performing control and processing related to data interaction or communication with the terminal device 1.
  • the processor 12 is configured to run program code or process data stored in the memory 11, such as running the speaker authentication system 200 and the like.
  • the network interface 13 may comprise a wireless network interface or a wired network interface, which is typically used to establish a communication connection between the server 2 and other electronic devices.
  • the present application proposes a speaker authentication system 200.
  • FIG. 2 it is a program block diagram of the first embodiment of the speaker authentication system 200 of the present application.
  • the speaker authentication system 200 includes a series of computer program instructions stored in the memory 11, and when the computer program instructions are executed by the processor 12, the speaker authentication operation of the embodiments of the present application can be implemented. .
  • the speaker authentication system 200 can be divided into one or more modules based on the particular operations implemented by the various portions of the computer program instructions. For example, in FIG. 2, the speaker authentication system 200 can be divided into an acquisition module 201, a construction module 202, an input module 203, a comparison module 204, and a calculation module 205. among them:
  • the obtaining module 201 is configured to acquire voice information of a preset speaker, where the voice information does not limit content.
  • acoustic features for speaker authentication: one is to do long-term statistics on acoustic feature parameters, and the other is to analyze several specific tones.
  • the time statistics of the acoustic feature parameters are regardless of the content of the speaker, that is, it is not related to the text, and is called text-independent speaker recognition.
  • the speaker To limit the content of the speech, for the specific sound analysis, the speaker must be made to emit certain specific words of speech, so it is related to the text, called text-dependent speaker recognition.
  • voice is used as the password of the server 2
  • a specific voice is used as the password, it is easy to be cracked, which poses a security risk. Therefore, in the present embodiment, text-independent speaker verification is employed.
  • the server 2 acquires the voice information of the speaker through the acquisition module 201, and the voice information does not restrict the content, that is, is independent of the text.
  • the text as an example of an application related to a text-independent voice password.
  • the text-related means that the content of the voice is pre-defined. For example, if the content is limited to “learning well”, the user only has to say “good study” to calculate the password correctly.
  • the text is irrelevant because the voice content is not limited, regardless of whether the user says "good learning” or "every day”, the password is considered correct as long as it corresponds to the speaker's voice model stored by the server.
  • the speech model for storing speakers will be detailed below.
  • the building block 202 is configured to construct a 3D (three-dimensional) convolutional neural network architecture, and input the speaker's voice information to the 3D convolutional neural network architecture through the input module 203.
  • the server 2 constructs a 3D convolutional neural network architecture through the building block 202.
  • the 3D convolutional neural network architecture (3D-CNN) includes a hardwired layer H1 (hardwired layer), a convolutional layer, a downsampling layer, a convolutional layer, and a downsampling layer in order from the input end. , convolutional layer, fully connected layer, classification layer.
  • the speaker's voice information is input to an input of the 3D convolutional neural network.
  • the building block 202 is further configured to create and store the speaker's voice model through the 3D convolutional neural network architecture.
  • the server 2 wants to confirm the identity of a person, for example, if a server confirms whether the person is an administrator or has a person who has the authority to open the server, the internal storage of the server 2 must have a voice model in which the speaker is stored. . That is, the server 2 has to collect the speaker's voice and build his model, also called the target model.
  • the building module 203 creates the speaker's voice model according to the acquired speaker's voice information through the 3D convolutional neural network architecture and stores it in the internal storage of the server 2.
  • the 3D convolutional neural network architecture can analyze the voiceprint information of the speaker, and the voiceprint can be recognized because each person has a unique difference in the oral cavity, the nasal cavity and the channel structure. According to the obtained voice information of the speaker, the voiceprint information is analyzed, and the difference of the sounding organ is indirectly analyzed to determine the identity of the speaker.
  • the comparison module 204 is configured to compare the test utterance with the stored speech model of the speaker when the test utterance information is received.
  • the server 2 when the server 2 sets a voice password, only the administrator who is authenticated or has the authority to open the server can be unlocked.
  • the server 2 when the server 2 receives the test utterance information, for example, receives the utterance information of A, the server 2 acquires the voice information of A through the comparison module 204, and extracts the voiceprint information according to the voice information of A. Further, the voiceprint information of A is compared with the voice model of the speaker stored in the server 2 to verify whether A is an administrator or a person having the authority to turn on the server.
  • the calculating module 205 is configured to calculate a similarity between the test utterance information and the speaker's voice model. When the similarity is greater than a preset value, the speaker authentication is successful, and when the similarity is less than a preset value At the time, the speaker authentication failed.
  • the server 2 calculates a similarity score, that is, a similarity degree, by the calculation module 205 calculating a cosine similarity between the speaker's voice model and the test utterance information. Therefore, it is judged according to the similarity whether the current speaker is an administrator or a person who has the authority to open the server.
  • the speaker authentication system 200 further includes a parsing module 206, wherein:
  • the parsing module 206 is configured to parse the obtained voice information of the speaker into an audio stack frame.
  • FIG. 3 is a schematic diagram of parsing speaker speech into audio stream stack frames according to the present application.
  • the MFCC Mel Frequency Cepstral Coefficient
  • the final generation of the MFCC's DCT1 operation will result in these features becoming non-local features, which are distinct from the local features in the convolution operation. Contrast.
  • the logarithmic energy that is, the MFEC
  • the feature extracted in the MFEC is similar to the feature obtained by discarding the DCT operation, and the temporal feature overlaps the 20 ms window with a span of 10 ms to generate a spectral feature (audio stack).
  • 80 temporal feature sets (each constituting 40 MFEC features) can be obtained from the input speech feature map.
  • the dimensions of each input feature are nx80x40, which consists of 80 input frames and similar map features, and n represents the number of statements used in the 3D convolutional neural network architecture.
  • the input module 203 is further configured to input the audio stack frame to the 3D convolutional neural network architecture.
  • the building block 202 is further configured to generate a vector for each word of the audio stack frame, and generate an average vector of audio stack frames belonging to the speaker to generate a speaker's voice model.
  • the server 2 parses the acquired speaker voice into a stacked frame of the audio stream by using the parsing module 206, and inputs the audio stack frame into the 3D through the input module 203.
  • Convolutional neural network architecture finally through the construction module 202 each utterance will directly generate a d vector, which belongs to the average d vector of the speaker's utterance to generate a speaker model.
  • the server 2 may also acquire a plurality of different voice information of the same speaker, and further parse the plurality of different voice information into a feature map and superimpose them together.
  • the superimposed feature map is converted into a vector input to a convolutional neural network architecture convolutional neural network architecture to generate a speaker's speech model.
  • D1 represents the vector of the test utterance information
  • D2 represents the vector of the speaker's speech model
  • the numerator represents the dot product of the two vectors
  • the denominator represents the product of the moduli of the two vectors.
  • the server 2 presets a preset value, and when the calculated similarity is greater than the preset value, it indicates that the speaker verification is successful, that is, A is an administrator or a person having the authority to open the server. Similarly, when the calculated similarity is less than the preset value, the speaker authentication fails.
  • the server 2 locks or issues an alarm to improve the security of the use of the server.
  • the speaker authentication system 200 proposed by the present application firstly acquires voice information of a preset speaker, wherein the voice information does not restrict content; and then constructs a 3D convolutional neural network architecture; further And inputting the speaker's voice information to the 3D convolutional neural network architecture; then, creating and storing the speaker's voice model through the 3D convolutional neural network architecture; and then, when receiving the test utterance And comparing the test utterance information with the stored speech model of the speaker; finally, calculating the similarity between the test utterance information and the speaker's speech model, when the similarity is greater than a preset value
  • the speaker authentication is successful. When the similarity is less than a preset value, the speaker authentication fails.
  • the present application also proposes a speaker authentication method.
  • FIG. 4 it is a schematic flowchart of the first embodiment of the speaker authentication method of the present application.
  • the order of execution of the steps in the flowchart shown in FIG. 4 may be changed according to different requirements, and some steps may be omitted.
  • Step S301 Acquire voice information of a preset speaker, wherein the voice information does not limit content.
  • acoustic features for speaker authentication: one is to do long-term statistics on acoustic feature parameters, and the other is to analyze several specific tones.
  • the time statistics of the acoustic feature parameters are regardless of the content of the speaker, that is, it is not related to the text, and is called text-independent speaker recognition.
  • the speaker To limit the content of the speech, for the specific sound analysis, the speaker must be made to emit certain specific words of speech, so it is related to the text, called text-dependent speaker recognition.
  • voice is used as the password of the server, if a specific voice is used as the password, it is easy to be cracked, which poses a security risk. Therefore, in the present embodiment, text-independent speaker verification is employed.
  • the server 2 acquires the speaker's voice information, which does not restrict the content, that is, is independent of the text.
  • the text-related means that the content of the voice is pre-defined. For example, if the content is limited to “learning well”, the user only has to say “good study” to calculate the password correctly.
  • the text is irrelevant because the voice content is not limited, regardless of whether the user says "good learning” or "every day”, the password is considered correct as long as it corresponds to the speaker's voice model stored by the server.
  • the speech model for storing speakers will be detailed below.
  • Step S302 constructing a 3D convolutional neural network architecture, and inputting the speaker's voice information to the 3D convolutional neural network architecture through the input module 203.
  • the server 2 constructs a 3D convolutional neural network architecture.
  • the 3D convolutional neural network architecture (3D-CNN) includes a hardwired layer H1 (hardwired layer), a convolutional layer, a downsampling layer, a convolutional layer, and a downsampling layer in order from the input end. , convolutional layer, fully connected layer, classification layer.
  • the speaker's voice information is input to the input of the 3D convolutional neural network.
  • Step S303 creating and storing the speaker's voice model through the 3D convolutional neural network architecture.
  • the server 2 wants to confirm the identity of a person, for example, if a server confirms whether the person is an administrator or has a person who has the authority to open the server, the internal storage of the server 2 must have a voice model in which the speaker is stored. . That is, the server 2 has to collect the speaker's voice and build his model, also called the target model. In this embodiment, the server 2 creates the speaker's voice model according to the acquired speaker's voice information through the 3D convolutional neural network architecture and stores it in the internal storage of the server 2.
  • step S303 creating and storing the speaker's voice model through the 3D convolutional neural network architecture, specifically including S401-S403.
  • Step S401 parsing the obtained voice information of the speaker into an audio stack frame.
  • FIG. 3 is a schematic diagram of parsing speaker speech into audio stream stack frames according to the present application.
  • the MFCC Mel Frequency Cepstral Coefficient
  • the MFCC Mel Frequency Cepstral Coefficient
  • the logarithmic energy that is, the MFEC
  • the feature extracted in the MFEC is similar to the feature obtained by discarding the DCT operation, and the temporal feature overlaps the 20 ms window with a span of 10 ms to generate a spectral feature (audio stack).
  • each input feature can be obtained from the input speech feature map.
  • the dimensions of each input feature are nx80x40, which consists of 80 input frames and similar map features, and n represents the number of statements used in the 3D convolutional neural network architecture.
  • Step S402 inputting the audio stack frame to the 3D convolutional neural network architecture.
  • Step S403 generating a vector for each word of the audio stack frame, and generating an average vector of the audio stack frames belonging to the speaker to generate a speaker's voice model.
  • the server 2 parses the acquired speaker speech into a stacked frame of audio streams, and then inputs the audio stack frame into a 3D-convolution neural network architecture, and finally each utterance will be A d vector is directly generated, which belongs to the average d vector of the speaker's utterance to generate a speaker model.
  • the server 2 may also acquire a plurality of different voice information of the same speaker, and further parse the plurality of different voice information into a feature map and superimpose them together.
  • the superimposed feature map is converted into a vector input to a convolutional neural network architecture convolutional neural network architecture to generate a speaker's speech model.
  • Step S304 when the test utterance information is received, comparing the test utterance with the stored speech model of the speaker.
  • the server 2 sets a voice password
  • only the administrator who is authenticated or has the authority to open the server can be unlocked.
  • the server 2 receives the test utterance information, for example, the utterance information of A is received, and the voiceprint information is extracted according to the voice information of A, and then the voiceprint information of A and the server are connected to the server. 2
  • the internally stored speaker's voice model is compared to verify that A is an administrator or has access to the server.
  • Step S305 calculating a similarity between the test utterance information and the speaker's voice model.
  • the similarity is greater than a preset value, the speaker authentication is successful, and when the similarity is less than a preset value, the speaker is Authentication failed.
  • the server 2 calculates a cosine similarity between the speaker's speech model and the test utterance information to obtain a similarity score, that is, a similarity. Therefore, it is judged according to the similarity whether the current speaker is an administrator or a person who has the authority to open the server.
  • the similarity is calculated using the following formula:
  • D1 represents the vector of the test utterance information
  • D2 represents the vector of the speaker's speech model
  • the numerator represents the dot product of the two vectors
  • the denominator represents the product of the moduli of the two vectors.
  • the server 2 presets a preset value, and when the calculated similarity is greater than the preset value, it indicates that the speaker verification is successful, that is, A is an administrator or a person having the authority to open the server. Similarly, when the calculated similarity is less than the preset value, the speaker authentication fails.
  • the server 2 locks or issues an alarm to improve the security of the use of the server.
  • the speaker authentication method proposed by the present application firstly acquires voice information of a preset speaker, wherein the voice information does not restrict content; then, constructs a 3D convolutional neural network architecture; further, Inputting the speaker's voice information into the 3D convolutional neural network architecture; then, creating and storing the speaker's voice model through the 3D convolutional neural network architecture; and then, when receiving the test utterance, Comparing the test utterance information with the stored speech model of the speaker; finally, calculating the similarity between the test utterance information and the speaker's speech model, when the similarity is greater than a preset value, The speaker authentication succeeds. When the similarity is less than a preset value, the speaker authentication fails.
  • a speaker-independent text model as a password, it is not easy to crack and improve server security.
  • the foregoing embodiment method can be implemented by means of software plus a necessary general hardware platform, and of course, can also be through hardware, but in many cases, the former is better.
  • Implementation Based on such understanding, the technical solution of the present application, which is essential or contributes to the prior art, may be embodied in the form of a software product stored in a storage medium (such as ROM/RAM, disk,
  • the optical disc includes a number of instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the methods described in various embodiments of the present application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Telephonic Communication Services (AREA)

Abstract

L'invention concerne un procédé d'authentification de locuteur, comprenant les étapes consistant : à obtenir des informations vocales d'un locuteur prédéfini, le contenu des informations vocales n'étant pas limité (S301); à établir une architecture de réseau neuronal convolutionnel 3D et entrer les informations vocales du locuteur dans l'architecture de réseau neuronal convolutionnel 3D (S302); à créer et stocker un modèle vocal du locuteur au moyen de l'architecture de réseau neuronal convolutionnel 3D (S303); lorsqu'une parole de test est reçue, à comparer des informations de parole de test avec le modèle vocal stocké du locuteur (S304); et à calculer le degré de similarité entre les informations de parole de test et le modèle vocal du locuteur; lorsque le degré de similarité est supérieur à une valeur prédéfinie, le locuteur est authentifié avec succès; lorsque le degré de similarité est inférieur à la valeur prédéfinie, le locuteur ne peut être authentifié (S305). La présente invention concerne également un serveur et un support d'informations lisible par ordinateur.
PCT/CN2018/102203 2018-03-23 2018-08-24 Procédé d'authentification de locuteur, serveur et support d'informations lisible par ordinateur WO2019179033A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810246497.3A CN108597523B (zh) 2018-03-23 2018-03-23 说话人认证方法、服务器及计算机可读存储介质
CN201810246497.3 2018-03-23

Publications (1)

Publication Number Publication Date
WO2019179033A1 true WO2019179033A1 (fr) 2019-09-26

Family

ID=63627358

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/102203 WO2019179033A1 (fr) 2018-03-23 2018-08-24 Procédé d'authentification de locuteur, serveur et support d'informations lisible par ordinateur

Country Status (2)

Country Link
CN (1) CN108597523B (fr)
WO (1) WO2019179033A1 (fr)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109771944B (zh) * 2018-12-19 2022-07-12 武汉西山艺创文化有限公司 一种游戏音效生成方法、装置、设备和存储介质
CN109979467B (zh) * 2019-01-25 2021-02-23 出门问问信息科技有限公司 人声过滤方法、装置、设备及存储介质
CN110415708A (zh) * 2019-07-04 2019-11-05 平安科技(深圳)有限公司 基于神经网络的说话人确认方法、装置、设备及存储介质
CN111048097B (zh) * 2019-12-19 2022-11-29 中国人民解放军空军研究院通信与导航研究所 一种基于3d卷积的孪生网络声纹识别方法
CN112562685A (zh) * 2020-12-10 2021-03-26 上海雷盎云智能技术有限公司 一种服务机器人的语音交互方法和装置

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150294670A1 (en) * 2014-04-09 2015-10-15 Google Inc. Text-dependent speaker identification
CN105575388A (zh) * 2014-07-28 2016-05-11 索尼电脑娱乐公司 情感语音处理
US20170069327A1 (en) * 2015-09-04 2017-03-09 Google Inc. Neural Networks For Speaker Verification
CN107358951A (zh) * 2017-06-29 2017-11-17 阿里巴巴集团控股有限公司 一种语音唤醒方法、装置以及电子设备
CN107404381A (zh) * 2016-05-19 2017-11-28 阿里巴巴集团控股有限公司 一种身份认证方法和装置
CN107464568A (zh) * 2017-09-25 2017-12-12 四川长虹电器股份有限公司 基于三维卷积神经网络文本无关的说话人识别方法及系统

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104485102A (zh) * 2014-12-23 2015-04-01 智慧眼(湖南)科技发展有限公司 声纹识别方法和装置
CN106971724A (zh) * 2016-01-14 2017-07-21 芋头科技(杭州)有限公司 一种防干扰声纹识别方法和系统
CN107220237A (zh) * 2017-05-24 2017-09-29 南京大学 一种基于卷积神经网络的企业实体关系抽取的方法
CN107357875B (zh) * 2017-07-04 2021-09-10 北京奇艺世纪科技有限公司 一种语音搜索方法、装置及电子设备

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150294670A1 (en) * 2014-04-09 2015-10-15 Google Inc. Text-dependent speaker identification
CN105575388A (zh) * 2014-07-28 2016-05-11 索尼电脑娱乐公司 情感语音处理
US20170069327A1 (en) * 2015-09-04 2017-03-09 Google Inc. Neural Networks For Speaker Verification
CN107404381A (zh) * 2016-05-19 2017-11-28 阿里巴巴集团控股有限公司 一种身份认证方法和装置
CN107358951A (zh) * 2017-06-29 2017-11-17 阿里巴巴集团控股有限公司 一种语音唤醒方法、装置以及电子设备
CN107464568A (zh) * 2017-09-25 2017-12-12 四川长虹电器股份有限公司 基于三维卷积神经网络文本无关的说话人识别方法及系统

Also Published As

Publication number Publication date
CN108597523B (zh) 2019-05-17
CN108597523A (zh) 2018-09-28

Similar Documents

Publication Publication Date Title
WO2019179033A1 (fr) Procédé d'authentification de locuteur, serveur et support d'informations lisible par ordinateur
JP6621536B2 (ja) 電子装置、身元認証方法、システム及びコンピュータ読み取り可能な記憶媒体
US10476872B2 (en) Joint speaker authentication and key phrase identification
US10013985B2 (en) Systems and methods for audio command recognition with speaker authentication
JP6567040B2 (ja) 人工知能に基づく声紋ログイン方法と装置
WO2017197953A1 (fr) Procédé et dispositif de reconnaissance d'identité fondés sur une empreinte vocale
WO2018113243A1 (fr) Procédé, dispositif et appareil de segmentation de la parole et support de stockage informatique
US9343067B2 (en) Speaker verification
US9183367B2 (en) Voice based biometric authentication method and apparatus
WO2019179036A1 (fr) Modèle de réseau neuronal profond, dispositif électronique, procédé d'authentification d'identité et support de stockage
WO2017113658A1 (fr) Procédé et dispositif à base d'intelligence artificielle permettant une authentification par empreinte vocale
KR102210775B1 (ko) 인적 상호 증명으로서 말하는 능력을 이용하는 기법
US20120143608A1 (en) Audio signal source verification system
KR20180034507A (ko) 사용자 성문 모델을 구축하기 위한 방법, 장치 및 시스템
US20080154599A1 (en) Spoken free-form passwords for light-weight speaker verification using standard speech recognition engines
WO2006109515A1 (fr) Dispositif de reconnaissance d’operateur, procede de reconnaissance d'operateur, et programme de reconnaissance d'operateur
WO2019179029A1 (fr) Dispositif électronique, procédé de vérification d'identité et support d'informations lisible par ordinateur
WO2021042537A1 (fr) Procédé et système d'authentification de reconnaissance vocale
EP2879130A1 (fr) Procédés et systèmes pour la séparation d'un signal numérique
US9947323B2 (en) Synthetic oversampling to enhance speaker identification or verification
JP2006235623A (ja) 短い発話登録を使用する話者認証のためのシステムおよび方法
US20140188468A1 (en) Apparatus, system and method for calculating passphrase variability
CN108694952B (zh) 电子装置、身份验证的方法及存储介质
WO2020189432A1 (fr) Système et procédé d'authentification
CN116013324A (zh) 基于声纹识别的机器人语音控制权限管理方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18910903

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 15.01.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 18910903

Country of ref document: EP

Kind code of ref document: A1