WO2022057283A1 - 声纹注册方法、装置和计算机可读存储介质 - Google Patents

声纹注册方法、装置和计算机可读存储介质 Download PDF

Info

Publication number
WO2022057283A1
WO2022057283A1 PCT/CN2021/093285 CN2021093285W WO2022057283A1 WO 2022057283 A1 WO2022057283 A1 WO 2022057283A1 CN 2021093285 W CN2021093285 W CN 2021093285W WO 2022057283 A1 WO2022057283 A1 WO 2022057283A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
text
reading
read
aloud
Prior art date
Application number
PCT/CN2021/093285
Other languages
English (en)
French (fr)
Inventor
童颖
Original Assignee
北京沃东天骏信息技术有限公司
北京京东世纪贸易有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京沃东天骏信息技术有限公司, 北京京东世纪贸易有限公司 filed Critical 北京沃东天骏信息技术有限公司
Publication of WO2022057283A1 publication Critical patent/WO2022057283A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/22Interactive procedures; Man-machine interfaces
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/20Pattern transformations or operations aimed at increasing system robustness, e.g. against channel noise or different working conditions
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/22Interactive procedures; Man-machine interfaces
    • G10L17/24Interactive procedures; Man-machine interfaces the user being prompted to utter a password or a predefined phrase

Definitions

  • the present application is based on the CN application number 202010996045.4 and the filing date is September 21, 2020, and claims its priority.
  • the disclosure of the CN application is hereby incorporated into the present application as a whole.
  • the present invention relates to the field of voice technology, and in particular, to a voiceprint registration method, device and computer-readable storage medium.
  • Voiceprint recognition technology determines which person the current audio is from based on a certain piece of audio, and requires prior information such as the speaker model. In voiceprint recognition, it is necessary to obtain the information of a specified person or persons in advance. When the unknown audio is obtained, it is determined whether the audio belongs to one of the designated speakers.
  • the basic steps of voiceprint recognition can include the following: first, use a large number of speaker voices to train a voiceprint recognition model; then register, that is, the device needs to record the audio of a speaker first, thereby generating the speaker's audio. Speaker model; followed by testing, that is, matching the unknown test audio with the speaker model to determine whether the test audio belongs to the specified speaker.
  • Voiceprint recognition is different from speech recognition. Voiceprint recognition does not need to determine the corresponding text information based on audio, and does not need to determine the speaker's age, gender and other prior information.
  • Voiceprint recognition can be applied to everyday smart devices to provide personalized services; it can also be used in the fields of finance and security to confirm identity. This requires voiceprint recognition to have relatively high accuracy and relatively high attack resistance.
  • the speaker inputs audio corresponding to the text information according to the text information prompted by the device, and the recorded audio is greater than or equal to one.
  • a voiceprint registration method including: generating guidance information according to the reading text and the process text for guiding a user to speak the reading text using preset sound attributes; After the information is output to the user, the user's reading voice is obtained; the sound attribute of the reading voice is determined, and the text content corresponding to the reading voice is identified; the voice attribute of the reading voice is the preset sound attribute, the text content corresponding to the reading voice and the reading voice.
  • text matching it is determined that the spoken voice is available; for the guidance information for voiceprint registration, when the spoken voice is available, the correspondence between the user and the voiceprint information extracted from the spoken voice is stored.
  • the guidance information is guidance speech.
  • a guidance voice with preset sound attributes is generated according to the spoken text and the flow text.
  • the reading speech is available when the sound attribute of the reading speech is a preset sound attribute, the text content corresponding to the reading speech matches the reading text, and the reading speech is not noise.
  • the voiceprint registration method further includes: when the sound attribute of the spoken voice is not a preset sound attribute, or the text content corresponding to the spoken voice does not match the spoken text, or the spoken voice is noise, determining The reading voice is unavailable; in the case that the reading voice is unavailable, the corresponding reading correction information is output according to the unavailable type of the reading voice, wherein the unavailable type includes the sound attribute mismatch, the content of the reading voice is incomplete, and the reading voice is noise.
  • the text content corresponding to the reading voice includes the reading text
  • the phoneme sequence corresponding to the text content corresponding to the reading voice includes the phoneme sequence corresponding to the reading text
  • the text content corresponding to the reading voice Matches spoken text.
  • the voiceprint registration method further includes: acquiring a voice attribute in the registration information input by the user as a preset voice attribute; or, before generating the guidance information, collecting the user's voice, and determining the collected user's voice The sound property of the voice, as the default sound property.
  • determining the sound attribute of the spoken voice includes: inputting the voice feature of the spoken voice into a preset sound attribute classification model to obtain the sound attribute of the spoken voice.
  • determining the sound attribute of the read-aloud speech includes: inputting the speech feature of the read-aloud speech into a preset neural network model, and obtaining a speech embedded feature vector extracted by a hidden layer of the neural network model; calculating the speech embedded feature vector and The distance between the preset speech embedding feature vectors of each sound attribute, and the shortest distance among them is determined; if the shortest distance is not greater than the preset distance threshold, the sound attribute corresponding to the shortest distance is determined as the sound attribute of the reading speech ; In the case that the shortest distance is greater than the preset distance threshold, determine the sound attribute of the spoken voice as an unknown attribute.
  • identifying the text content corresponding to the read-aloud speech includes: determining the text content corresponding to the read-aloud speech by using a speech recognition model corresponding to a sound attribute of the read-aloud speech.
  • the voiceprint registration method further includes: randomly selecting a preset number of words from the character library to form candidate texts, and determining phoneme combinations of the candidate texts; detecting the appearance of phoneme combinations in the preset text database Frequency; when the appearance frequency of the phoneme combination is lower than the preset frequency, the alternative text is used as the reading text.
  • an isolated word recognition model corresponding to a preset sound attribute is used to recognize the spoken speech, and the text content corresponding to the spoken speech is obtained.
  • the guidance information is guidance voice
  • the flow text includes description text corresponding to each word selected at random.
  • the sound attribute is a dialect type.
  • the user's identity verification is passed.
  • a voiceprint registration device comprising: a guide information generation module configured to read aloud text according to the read-aloud text and a process for guiding a user to speak the read-aloud text using preset sound attributes Text, to generate guidance information; the reading voice acquisition module is configured to obtain the user's reading voice after the guidance information is output to the user; the reading voice parsing module is configured to determine the sound attribute of the reading voice and identify the corresponding reading voice.
  • an availability determination module configured to determine that the reading voice is available when the sound attribute of the reading voice is a preset sound attribute and the text content corresponding to the reading voice matches the reading text;
  • the storage module is configured to For the guidance information of voiceprint registration, in the case that the reading voice is available, the correspondence between the user and the voiceprint information extracted from the reading voice is stored.
  • a voiceprint registration apparatus comprising: a memory; and a processor coupled to the memory, the processor being configured to execute any one of the foregoing based on instructions stored in the memory A voiceprint registration method.
  • a voiceprint registration system comprising: any of the foregoing voiceprint registration apparatuses; an output device configured to output guidance information generated by the voiceprint registration apparatus; and a recording device , which is configured to record the user's spoken speech.
  • the output device is a sound output device.
  • a non-transitory computer-readable storage medium on which a computer program is stored, wherein when the program is executed by a processor, any one of the aforementioned voiceprint registration methods is implemented.
  • the embodiments of the present invention can guide the user to speak the preset reading text with preset sound attributes through automatically generated guide information.
  • verification and voice collection can be performed according to the user's speaking habits, and voice recognition can be performed under the condition of knowing the sound attributes used by the user, which improves the recognition accuracy. Therefore, the applicable population and the convenience of use of voiceprint registration are improved, and the efficiency of the registration process is also improved.
  • FIG. 1A shows a schematic flowchart of a voiceprint registration method according to some embodiments of the present invention.
  • FIG. 1B shows a schematic flowchart of a voiceprint verification method according to some embodiments of the present invention.
  • FIG. 2 shows a schematic flowchart of a method for outputting guidance speech according to some embodiments of the present invention.
  • FIG. 3 shows a schematic flowchart of a reading correction method according to some embodiments of the present invention.
  • FIG. 4 shows a schematic flowchart of a method for generating read-aloud text according to some embodiments of the present invention.
  • FIG. 5 shows a schematic structural diagram of a voiceprint registration apparatus according to some embodiments of the present invention.
  • FIG. 6 shows a schematic structural diagram of a voiceprint registration system according to some embodiments of the present invention.
  • FIG. 7 shows a schematic structural diagram of a voiceprint registration apparatus according to other embodiments of the present invention.
  • FIG. 8 shows a schematic structural diagram of a voiceprint registration apparatus according to further embodiments of the present invention.
  • the inventor found that in the related technologies, the voiceprint recognition only requires the user to speak the audio corresponding to the text, and some users speak non-Mandarin, which may lead to inaccurate recognition or unsmooth registration process.
  • a technical problem to be solved by the embodiments of the present invention is: how to improve the applicability of voiceprint registration.
  • FIG. 1A shows a schematic flowchart of a voiceprint registration method according to some embodiments of the present invention. As shown in FIG. 1A , the voiceprint registration method of this embodiment includes steps S102 to S110.
  • step S102 guidance information is generated according to the read-aloud text and the flow text for guiding the user to speak the read-aloud text using the preset sound attribute.
  • the sound attribute refers to the dialect type.
  • the user When the user only speaks dialect but does not speak Mandarin, the user is guided to speak the preset content in the dialect; or, when the user can speak Mandarin, the user is guided to speak the preset content in Mandarin. Therefore, a personalized voiceprint registration method can also be provided for different dialects in the same language, which improves the applicability of the voiceprint registration and the convenience of the user.
  • the sound attribute in the registration information input by the user is acquired as a preset sound attribute.
  • the user's voice is collected, and a sound attribute of the collected user's voice is determined as a preset sound attribute. For example, before generating the guidance information, guidance information such as "please say a few words" is played or displayed, so that the user can say something casually in a relaxed state, and the sound attributes used by the user are recognized. Thus, the sound attributes that the user is more accustomed to use can be detected more accurately.
  • the process text includes some introductory phrases, such as "please say it", “please repeat”, etc.
  • the process text includes text synthesized by using a description text of a preset sound attribute and a preset guide template.
  • the description text for Cantonese is "Cantonese”
  • the introductory template is "Please use ⁇ description text> to say”, where " ⁇ >” is a placeholder for the description text to replace.
  • the synthesized text is "Please speak in Cantonese”.
  • the instruction information is used for voiceprint registration, or for voiceprint verification.
  • Each type of guidance information corresponds to a preset process text, for example, so that the user can make it clear whether it is a registration process or a verification process at present.
  • step S104 after the guidance information is output to the user, the reading voice of the user is acquired.
  • An exemplary process for guiding a user to speak the spoken text is as follows. For a user who speaks Sichuan dialect, if the user wants to say “good morning” in Sichuan dialect when the user is registered, the reading text is "good morning”, and the flow text is, for example, "please use Sichuan dialect to say it". The content played or displayed through the user's terminal is "please use Sichuan dialect to say good morning", so the user will use Sichuan dialect to say "good morning” after hearing the playback content.
  • step S106 the sound attribute of the read-aloud voice is determined, and the text content corresponding to the read-aloud voice is identified.
  • the speech features of the spoken speech are input into a preset sound attribute classification model to obtain the sound attributes of the spoken speech.
  • the sound attribute classification model is, for example, a deep neural network model trained by using labeled sound samples.
  • the sound attribute classification model can identify preset categories of sound attributes, and identify any input as one of the preset categories.
  • the speech features of the spoken speech are input into a preset neural network model to obtain a speech embedded feature vector extracted by the hidden layer of the neural network model, for example, a layer after a pooling layer or The feature vector output by the two layers; calculate the distance between the speech embedding feature vector and the preset speech embedding feature vector for each sound attribute, and determine the shortest distance among them; when the shortest distance is not greater than the preset distance threshold, the The sound attribute corresponding to the shortest distance is determined as the sound attribute of the spoken voice; when the shortest distance is greater than the preset distance threshold, the sound attribute of the spoken voice is determined as an unknown attribute. In this way, when the sound attribute used by the user does not belong to any one of the preset sound attributes, it is determined as an unknown attribute.
  • the classification methods of the above two sound attributes can also be applied to the recognition of other voices. For example, before generating the guidance information, the user's voice is collected, and the sound of the collected user's voice is determined by one of the above two methods. Attributes.
  • a speech recognition model corresponding to the sound attribute of the spoken speech is used to determine the text content corresponding to the spoken speech. For example, if the user speaks aloud speech in Mandarin, a speech recognition model corresponding to Mandarin or a general speech recognition model is used; if the user speaks aloud speech in Sichuan dialect, the speech recognition model corresponding to Sichuan dialect is used. Therefore, the content of the speech of the user can be more accurately recognized.
  • step S108 if the sound attribute of the spoken voice is a preset sound attribute, and the text content corresponding to the spoken voice matches the spoken text, it is determined that the spoken spoken voice is available.
  • the text content corresponding to the spoken voice when the text content corresponding to the spoken voice includes the spoken text, the text content corresponding to the spoken voice matches the spoken text.
  • the phoneme sequence corresponding to the text content corresponding to the spoken voice includes a phoneme sequence corresponding to the spoken text
  • the text content corresponding to the spoken voice matches the spoken text.
  • the word in the read-aloud text includes "extremely”
  • the corresponding content in the recognition result of the user's read-aloud voice is "jealous”
  • the pronunciation of the two is the same in Chinese.
  • the text in the recognition result is inconsistent with the reading text, but the pronunciation is consistent, it can also be considered that the text content corresponding to the reading voice matches the reading text at this time.
  • the main purpose of speech recognition is to perform live detection and fully collect the user's pronunciation of various phonemes, and the accuracy of text information is not high. Therefore, the method of confirming text matching through phoneme matching can improve the registration efficiency.
  • the registered sound attribute can be kept consistent with the sound attribute habitually used by the user or the sound attribute selected by the user.
  • the recognition accuracy can be improved.
  • the method of determining whether the read-aloud voice is available according to the speech recognition result can also enable the voice spoken by the user to cover more phonemes, so as to generate a more comprehensive speaker model.
  • the reading speech is available when the sound attribute of the reading speech is a preset sound attribute, the text content corresponding to the reading speech matches the reading text, and the reading speech is not noise. For example, whether the collected sound is speech or noise is identified through a voice endpoint detection (Voice Activity Detection, VAD for short) model.
  • VAD Voice Activity Detection
  • step S110 for the guidance information for voiceprint registration, in the case that the reading voice is available, the correspondence between the user and the voiceprint information extracted from the reading voice is stored through registration.
  • the user can be guided to speak the preset reading text with the preset sound attribute through the automatically generated guide information.
  • verification and voice collection can be performed according to the user's speaking habits, and voice recognition can be performed under the condition of knowing the sound attributes used by the user, which improves the recognition accuracy. Therefore, the above-mentioned embodiment improves the applicable population and the convenience of use of the voiceprint registration, and also improves the efficiency of the registration process.
  • the user speaks using a preset voice attribute.
  • the voiceprint verification method of the present invention uses a preset voice attribute to determine whether the speaker is a registered user.
  • FIG. 1B shows a schematic flowchart of a voiceprint verification method according to some embodiments of the present invention.
  • the voiceprint verification method of this embodiment includes steps S112 to S120 , wherein the specific implementations of steps S112 to S118 refer to steps S102 to S108 in the embodiment of FIG. 1A , which will not be repeated here.
  • step S120 for the guidance information used for voiceprint verification, in the case that the reading voice is available, if the extracted voiceprint matches the stored voiceprint corresponding to the user, the user's identity verification is passed. .
  • the automatically generated guide information can also be used to guide the user to speak the preset reading text with the preset voice attribute in the voiceprint verification stage.
  • verification and voice collection can be performed according to the user's speaking habits, and voice recognition can be performed under the condition of knowing the sound attributes used by the user, which improves the recognition accuracy. Therefore, the above-mentioned embodiment improves the applicability of the voiceprint-based verification process and the convenience of use, and also improves the efficiency of the verification process.
  • the guidance information is guidance voice, so as to prompt the user to speak the preset content using the corresponding sound attribute in a voice manner.
  • An embodiment of the guidance voice output method is described below with reference to FIG. 2 .
  • FIG. 2 shows a schematic flowchart of a method for outputting guidance speech according to some embodiments of the present invention.
  • the guidance voice output method of this embodiment includes steps S202-S206.
  • step S202 the guidance text is generated according to the read-aloud text and the flow text for guiding the user to speak the read-aloud text using the preset sound attribute.
  • step S204 the guidance text is converted into guidance speech.
  • step S206 the guidance voice is played to the user.
  • the instructional text is converted into instructional speech using a text-to-speech (Text To Speech, TTS for short) technique.
  • the generated guidance speech is speech with a certain emotion and intonation, so as to enhance the fun and interactivity of the speech guidance process.
  • the generated guidance speech is speech with preset sound attributes, so that the user using the corresponding sound attribute can be more helpful to understand the content to be spoken, and the corresponding sound attribute can be used to speak more accurately.
  • voice By using voice to play the guidance information, users with reading disabilities such as illiterate, incomplete recognition, poor eyesight, or blindness can also obtain the information to be read aloud, thereby improving the applicability of voiceprint recognition applications.
  • corresponding correction information may be given according to specific circumstances, so as to guide the user to speak the preset content more accurately.
  • the following describes an embodiment of the read-aloud correction method of the present invention with reference to FIG. 3 .
  • FIG. 3 shows a schematic flowchart of a reading correction method according to some embodiments of the present invention. As shown in FIG. 3 , the reading correction method of this embodiment includes steps S302-S312.
  • step S302 after the guidance information is output to the user, the reading voice of the user is acquired.
  • step S304 the sound attribute of the read-aloud voice is determined, and the text content corresponding to the read-aloud voice is identified, and whether the read-aloud voice is noise is detected.
  • step S306 when the sound attribute of the read-aloud voice is a preset voice attribute, the text content corresponding to the read-aloud voice matches the read-aloud text, and the read-aloud voice is not noise, it is determined that the read-aloud voice is available.
  • step S308 the correspondence between the user and the voiceprint information extracted from the spoken voice is stored.
  • step S310 when the sound attribute of the spoken voice is not a preset sound attribute, or the text content corresponding to the spoken voice does not match the spoken text, or the spoken spoken voice is noise, it is determined that the spoken spoken voice is unavailable.
  • step S312 in the case that the reading voice is unavailable, according to the unavailable type of the reading voice, output corresponding reading correction information as the guide information, wherein the unavailable type includes the sound attribute mismatch, the content of the reading voice is incomplete, The spoken speech is noise. Then, go back to step S302 to continue to acquire the reading voice of the user.
  • the reading correction information is "please repeat it again in Mandarin” and similar information; when the user only speaks part of the reading text , the reading correction information is "the content is missing, please say it again completely” and other similar information; when the reading voice is noise, the reading correction information is "can't hear clearly, please go to a quiet place and say it again”.
  • Each correction information may also include information corresponding to the read-aloud text.
  • the reading correction information is output to the user by means of text display or voice playback.
  • guidance and correction information can be automatically generated during the voiceprint application process such as registration and recognition of the user's voiceprint, so as to assist the user to complete operations related to the voiceprint application more quickly and accurately.
  • words, words or sentences with a frequency lower than a preset value are selected to reduce the possibility of an attacker recording in advance for attacking.
  • meaningless "self-made words” can also be automatically generated to further reduce the possibility of being attacked.
  • FIG. 4 shows a schematic flowchart of a method for generating read-aloud text according to some embodiments of the present invention. As shown in FIG. 4 , the method for generating a reading text in this embodiment includes steps S402 to S406.
  • step S402 a preset number of words are randomly selected from the word library to form candidate texts, and phoneme combinations of the candidate texts are determined.
  • step S404 the occurrence frequency of the phoneme combination in the preset text library is detected.
  • step S406 when the frequency of occurrence of the phoneme combination is lower than the preset frequency, the candidate text is used as the read-aloud text.
  • the description text of each randomly selected word is searched as part of the flow text, so as to convert the description text into voice playback.
  • the coined word is "Jiaoqi”
  • “comparison is relatively relatively” and "odd is odd, odd” can be used as the description text of the two characters in the coined word
  • the content of the guide voice is, for example, "Please To say that it is more strange, it is more to be compared, and strange is strange to strange.” In this way, the user can be assisted to understand the content to be read aloud more vividly, so that the user can speak the preset content more accurately.
  • an isolated word recognition model corresponding to a preset sound attribute is used to recognize the reading speech, and the text content corresponding to the reading speech is obtained.
  • a multi-language model is included in the speech recognition model, such as 2-Gram (binary grammar) and 3-Gram (trigram) models.
  • 2-Gram binary grammar
  • 3-Gram trigram
  • an isolated word recognition model can be used in the recognition process.
  • the model uses a 1-Gram (unary grammar) model or does not use a grammar model, so that in the recognition process, only from the perspective of phonetics, without considering The impact of semantics on recognition results.
  • FIG. 5 shows a schematic structural diagram of a voiceprint registration apparatus according to some embodiments of the present invention.
  • the voiceprint registration device 500 of this embodiment includes: a guidance information generation module 5100 configured to read aloud text and a flow text for guiding a user to speak the read aloud text using preset sound attributes, generating guidance information; the reading voice acquisition module 5200 is configured to obtain the user's reading voice after the guidance information is output to the user; the reading voice parsing module 5300 is configured to determine the sound attribute of the reading voice and identify The text content corresponding to the read-aloud voice; the availability determination module 5400 is configured to, in the case where the sound attribute of the read-aloud voice is the preset voice attribute, and the text content corresponding to the read-aloud voice matches the read-aloud text under the condition that the reading voice is available; the storage module 5500 is configured to store the user and the voice extracted from the reading voice for the guidance information used for voiceprint registration under the condition that the reading voice is available.
  • the guidance information is guidance speech.
  • the guidance information generation module 5100 is further configured to generate guidance speech with preset sound attributes according to the spoken text and the flow text.
  • the availability determination module 5400 is further configured to determine the read aloud when the sound attribute of the read speech is a preset sound attribute, the text content corresponding to the read speech matches the read text, and the read speech is not noise Voice available.
  • the availability determination module 5400 is further configured to: in the case that the sound attribute of the spoken voice is not a preset sound attribute, or the text content corresponding to the spoken voice does not match the spoken text, or the spoken voice is noise, It is determined that the reading voice is not available; the guidance information generation module 5100 is further configured to generate corresponding reading correction information according to the unavailable type of the reading voice under the condition that the reading voice is not available, and add the corresponding reading correction information to the guidance information, wherein the unavailable type Including sound attribute mismatch, incomplete content of the spoken voice, and noise of the spoken voice.
  • the text content corresponding to the reading voice includes the reading text
  • the phoneme sequence corresponding to the text content corresponding to the reading voice includes the phoneme sequence corresponding to the reading text
  • the text content corresponding to the reading voice Matches spoken text.
  • the voiceprint registration apparatus 500 further includes: a voice attribute acquisition module 5600, configured to acquire the voice attribute in the registration information input by the user as a preset voice attribute; or, before generating the guidance information, collect the user and determine the sound attribute of the collected user's voice as a preset sound attribute.
  • a voice attribute acquisition module 5600 configured to acquire the voice attribute in the registration information input by the user as a preset voice attribute; or, before generating the guidance information, collect the user and determine the sound attribute of the collected user's voice as a preset sound attribute.
  • the reading speech parsing module 5300 is further configured to input the speech features of the reading speech into a preset sound attribute classification model to obtain the sound attribute of the reading speech.
  • the reading speech parsing module 5300 is further configured to input the speech features of the reading speech into a preset neural network model to obtain a speech embedded feature vector extracted by the hidden layer of the neural network model; calculate the speech embedded feature vector The distance from the preset speech embedding feature vector of each sound attribute, and the shortest distance among them is determined; if the shortest distance is not greater than the preset distance threshold, the sound attribute corresponding to the shortest distance is determined as the sound of the spoken speech attribute; when the shortest distance is greater than the preset distance threshold, determine the sound attribute of the spoken voice as an unknown attribute.
  • the reading speech parsing module 5300 is further configured to use a speech recognition model corresponding to the sound attribute of the reading speech to determine the text content corresponding to the reading speech.
  • the voiceprint registration apparatus 500 further includes: a read-aloud text generation module 5700, configured to randomly select a preset number of words from the character library, form candidate texts, and determine the phoneme combination of the candidate texts; In the preset text library, the appearance frequency of the phoneme combination; when the appearance frequency of the phoneme combination is lower than the preset frequency, the alternative text is used as the reading text.
  • a read-aloud text generation module 5700 configured to randomly select a preset number of words from the character library, form candidate texts, and determine the phoneme combination of the candidate texts.
  • the preset text library the appearance frequency of the phoneme combination
  • the alternative text is used as the reading text.
  • the reading speech parsing module 5300 is further configured to recognize the reading speech by using an isolated word recognition model corresponding to a preset sound attribute, and obtain the text content corresponding to the reading speech.
  • the guidance information is guidance speech
  • the flow text includes description text corresponding to each word selected at random.
  • the sound attribute is a dialect type.
  • the voiceprint registration apparatus 500 further includes a verification module 5800, which is configured to, for the guidance information used for voiceprint verification, in the case that the spoken voice is available, if the extracted voiceprint is the same as the stored voiceprint corresponding to the user Voiceprint matching, through the user's authentication.
  • a verification module 5800 which is configured to, for the guidance information used for voiceprint verification, in the case that the spoken voice is available, if the extracted voiceprint is the same as the stored voiceprint corresponding to the user Voiceprint matching, through the user's authentication.
  • FIG. 6 shows a schematic structural diagram of a voiceprint registration system according to some embodiments of the present invention.
  • the voiceprint registration system 60 of this embodiment includes: a voiceprint registration device 610 , whose specific real-time manner can refer to the voiceprint registration device 500 ; and an output device 620 configured to output the voiceprint registration device generated by the voiceprint registration device 620 . and the recording device 630, which is configured to record the user's reading voice.
  • output device 620 is a sound output device.
  • FIG. 7 shows a schematic structural diagram of a voiceprint registration apparatus according to other embodiments of the present invention.
  • the voiceprint registration apparatus 70 of this embodiment includes: a memory 710 and a processor 720 coupled to the memory 710 , and the processor 720 is configured to execute any one of the foregoing based on the instructions stored in the memory 710 The voiceprint registration method in the embodiment.
  • the memory 710 may include, for example, a system memory, a fixed non-volatile storage medium, and the like.
  • the system memory stores, for example, an operating system, an application program, a boot loader (Boot Loader), and other programs.
  • FIG. 8 shows a schematic structural diagram of a voiceprint registration apparatus according to further embodiments of the present invention.
  • the voiceprint registration apparatus 80 in this embodiment includes: a memory 810 and a processor 820, and may further include an input/output interface 830, a network interface 840, a storage interface 850, and the like. These interfaces 830 , 840 , 850 and the memory 810 and the processor 820 can be connected, for example, through a bus 860 .
  • the input and output interface 830 provides a connection interface for input and output devices such as a display, a mouse, a keyboard, and a touch screen.
  • Network interface 840 provides a connection interface for various networked devices.
  • the storage interface 850 provides a connection interface for external storage devices such as SD cards and U disks.
  • Embodiments of the present invention also provide a computer-readable storage medium on which a computer program is stored, characterized in that, when the program is executed by a processor, any one of the aforementioned voiceprint registration methods is implemented.
  • embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable non-transitory storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein .
  • computer-usable non-transitory storage media including, but not limited to, disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory result in an article of manufacture comprising instruction means, the instructions
  • the apparatus implements the functions specified in the flow or flow of the flowcharts and/or the block or blocks of the block diagrams.

Abstract

一种声纹注册方法、装置(500)和计算机可读存储介质。声纹注册方法包括:根据朗读文本、以及用于引导用户采用预设声音属性说出朗读文本的流程文本,生成指导信息(S102);在指导信息被输出给用户后,获取用户的朗读语音(S104);确定朗读语音的声音属性、以及识别朗读语音对应的文字内容(S106);在朗读语音的声音属性为预设的声音属性、朗读语音对应的文字内容与朗读文本匹配的情况下,确定朗读语音可用(S108);在朗读语音可用的情况下,存储用户和从朗读语音中提取的声纹信息之间的对应关系(S110)。声纹注册方法提高了声纹注册的适用人群和使用便捷度,也提高了注册流程的效率。

Description

声纹注册方法、装置和计算机可读存储介质
相关申请的交叉引用
本申请是以CN申请号为202010996045.4,申请日为2020年9月21日的申请为基础,并主张其优先权,该CN申请的公开内容在此作为整体引入本申请中。
技术领域
本发明涉及语音技术领域,特别涉及一种声纹注册方法、装置和计算机可读存储介质。
背景技术
声纹识别技术根据某条音频来判断当前音频是来自于哪个人,需要具备说话人模型等先验信息。在声纹识别中,需要事先获得指定的某个人或多个人的信息。当获得未知音频时,判定该音频是否属于指定的说话人中的某一个。
声纹识别的基本步骤可以包括以下几个:首先利用大量的说话人语音来训练一个声纹识别模型;然后是注册,即设备需要先录入某个说话人的音频,由此生成该说话人的说话人模型;之后是测试,即将未知的测试音频与说话人模型进行匹配,判断该测试音频是否属于指定说话人。声纹识别与语音识别不同,声纹识别不需要根据音频来判断相应的文字信息,且不需要判定说话人的年龄、性别等先验信息。
声纹识别可以应用于日常智能设备,以提供个性化的服务;也可以用于金融和安防领域,以进行身份的确认。这就要求声纹识别具有比较高的准确性和比较高的防攻击性。
在相关技术中,在声纹识别的注册阶段,说话人根据设备提示的文字信息来录入与文字信息对应的音频,并且录入的音频大于或等于1条。
发明内容
根据本发明一些实施例的第一个方面,提供一种声纹注册方法,包括:根据朗读文本、以及用于引导用户采用预设声音属性说出朗读文本的流程文本,生成指导信息;在指导信息被输出给用户后,获取用户的朗读语音;确定朗读语音的声音属性、以及识别朗读语音对应的文字内容;在朗读语音的声音属性为预设的声音属性、朗读语音 对应的文字内容与朗读文本匹配的情况下,确定朗读语音可用;对于用于声纹注册的指导信息,在朗读语音可用的情况下,存储用户和从朗读语音中提取的声纹信息之间的对应关系。
在一些实施例中,指导信息为指导语音。
在一些实施例中,根据朗读文本和流程文本,生成具有预设声音属性的指导语音。
在一些实施例中,在朗读语音的声音属性为预设的声音属性、朗读语音对应的文字内容与朗读文本匹配、并且朗读语音不为噪声的情况下,确定朗读语音可用。
在一些实施例中,声纹注册方法还包括:在朗读语音的声音属性不为预设的声音属性、或者朗读语音对应的文字内容不与朗读文本匹配、或者朗读语音为噪声的情况下,确定朗读语音不可用;在朗读语音不可用的情况下,根据朗读语音的不可用类型,输出相应的朗读纠正信息,其中,不可用类型包括声音属性不匹配、朗读语音的内容不完整、朗读语音为噪声。
在一些实施例中,在朗读语音对应的文字内容包括朗读文本的情况下,或者,在朗读语音对应的文字内容对应的音素序列包括朗读文本对应的音素序列的情况下,朗读语音对应的文字内容与朗读文本匹配。
在一些实施例中,声纹注册方法还包括:获取用户输入的注册信息中的声音属性,作为预设声音属性;或者,在生成指导信息之前,采集用户的语音,并确定所采集的用户的语音的声音属性,作为预设声音属性。
在一些实施例中,确定朗读语音的声音属性包括:将朗读语音的语音特征输入到预设的声音属性分类模型中,获得朗读语音的声音属性。
在一些实施例中,确定朗读语音的声音属性包括:将朗读语音的语音特征输入到预设的神经网络模型中,获得神经网络模型的隐藏层提取的语音嵌入特征向量;计算语音嵌入特征向量与每个声音属性的预设语音嵌入特征向量之间的距离,并确定其中的最短距离;在最短距离不大于预设距离阈值的情况下,将最短距离对应的声音属性确定为朗读语音的声音属性;在最短距离大于预设距离阈值的情况下,将朗读语音的声音属性确定为未知属性。
在一些实施例中,识别朗读语音对应的文字内容包括:采用朗读语音的声音属性所对应的语音识别模型,确定朗读语音对应的文字内容。
在一些实施例中,声纹注册方法还包括:从字库中随机选择预设数量的字,组成备选文本,并确定备选文本的音素组合;检测预设的文本库中,音素组合的出现频率; 在音素组合的出现频率低于预设频率的情况下,将备选文本作为朗读文本。
在一些实施例中,采用预设声音属性对应的孤立词识别模型对朗读语音进行识别,并获得朗读语音对应的文字内容。
在一些实施例中,指导信息为指导语音,并且流程文本包括随机选择的每个字对应的描述文本。
在一些实施例中,声音属性为方言类型。
在一些实施例中,对于用于声纹验证的指导信息,在朗读语音可用的情况下,如果提取的声纹与存储的、用户对应的声纹匹配,通过用户的身份验证。
根据本发明一些实施例的第二个方面,提供一种声纹注册装置,包括:指导信息生成模块,被配置为根据朗读文本、以及用于引导用户采用预设声音属性说出朗读文本的流程文本,生成指导信息;朗读语音获取模块,被配置为在指导信息被输出给用户后,获取用户的朗读语音;朗读语音解析模块,被配置为确定朗读语音的声音属性、以及识别朗读语音对应的文字内容;可用性确定模块,被配置为在朗读语音的声音属性为预设的声音属性、朗读语音对应的文字内容与朗读文本匹配的情况下,确定朗读语音可用;存储模块,被配置为对于用于声纹注册的指导信息,在朗读语音可用的情况下,存储用户和从朗读语音中提取的声纹信息之间的对应关系。
根据本发明一些实施例的第三个方面,提供一种声纹注册装置,包括:存储器;以及耦接至存储器的处理器,处理器被配置为基于存储在存储器中的指令,执行前述任意一种声纹注册方法。
根据本发明一些实施例的第四个方面,提供一种声纹注册系统,包括:前述任意一种声纹注册装置;输出设备,被配置为输出声纹注册装置生成的指导信息;以及录音设备,被配置为录制用户的朗读语音。
在一些实施例中,输出设备为声音输出设备。
根据本发明一些实施例的第五个方面,提供一种非瞬时性计算机可读存储介质,其上存储有计算机程序,其中,该程序被处理器执行时实现前述任意一种声纹注册方法。
上述发明中的一些实施例具有如下优点或有益效果:本发明的实施例能够通过自动生成的引导信息,引导用户以预设声音属性说出预设的朗读文本。并且,可以根据用户的说话习惯进行验证和语音采集,并在了解用户使用的声音属性的情况下进行语音识别,提高了识别的准确率。因此提高了声纹注册的适用人群和使用便捷度,也提 高了注册流程的效率。
通过以下参照附图对本发明的示例性实施例的详细描述,本发明的其它特征及其优点将会变得清楚。
附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1A示出了根据本发明一些实施例的声纹注册方法的流程示意图。
图1B示出了根据本发明一些实施例的声纹验证方法的流程示意图。
图2示出了根据本发明一些实施例的指导语音输出方法的流程示意图。
图3示出了根据本发明一些实施例的朗读纠正方法的流程示意图。
图4示出了根据本发明一些实施例的朗读文本生成方法的流程示意图。
图5示出了根据本发明一些实施例的声纹注册装置的结构示意图。
图6示出了根据本发明一些实施例的声纹注册系统的结构示意图。
图7示出了根据本发明另一些实施例的声纹注册装置的结构示意图。
图8示出了根据本发明又一些实施例的声纹注册装置的结构示意图。
具体实施方式
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。以下对至少一个示例性实施例的描述实际上仅仅是说明性的,决不作为对本发明及其应用或使用的任何限制。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
除非另外具体说明,否则在这些实施例中阐述的部件和步骤的相对布置、数字表达式和数值不限制本发明的范围。
同时,应当明白,为了便于描述,附图中所示出的各个部分的尺寸并不是按照实际的比例关系绘制的。
对于相关领域普通技术人员已知的技术、方法和设备可能不作详细讨论,但在适 当情况下,所述技术、方法和设备应当被视为授权说明书的一部分。
在这里示出和讨论的所有示例中,任何具体值应被解释为仅仅是示例性的,而不是作为限制。因此,示例性实施例的其它示例可以具有不同的值。
应注意到:相似的标号和字母在下面的附图中表示类似项,因此,一旦某一项在一个附图中被定义,则在随后的附图中不需要对其进行进一步讨论。
发明人对相关技术进行分析后发现,在相关技术中,声纹识别中只要求用户说出文字对应的音频,而部分用户使用的是非普通话,可能会导致识别不够准确、或者注册过程不顺畅。
本发明实施例所要解决的一个技术问题是:如何提高声纹注册的适用广度。
图1A示出了根据本发明一些实施例的声纹注册方法的流程示意图。如图1A所示,该实施例的声纹注册方法包括步骤S102~S110。
在步骤S102中,根据朗读文本、以及用于引导用户采用预设声音属性说出朗读文本的流程文本,生成指导信息。
在一些实施例中,声音属性是指方言类型。当用户只会说方言、而不会说普通话时,引导用户使用方言说出预设的内容;或者,当用户会说普通话时,引导用户使用普通话说出预设的内容。从而,对于同一语言下的不同方言,也能够提供个性化的声纹注册方式,提高了声纹注册的适用广度以及用户使用的便捷度。
在一些实施例中,获取用户输入的注册信息中的声音属性,作为预设声音属性。
在一些实施例中,在生成指导信息之前,采集用户的语音,并确定所采集的用户的语音的声音属性,作为预设声音属性。例如,在生成指导信息之前,播放或显示“请随便说几句”等引导信息,以便令用户在放松的状态下随意地说一些内容,并识别用户使用的声音属性。从而,能够更准确地检测用户更习惯使用的声音属性。
流程文本包括一些引导用语,例如“请您说出”“请重复一次”等等。在一些实施例中,流程文本包括采用预设声音属性的描述文本和预设的引导模板合成的文本。例如,粤语的描述文本为“粤语”,引导模板为“请使用<描述文本>说出”,其中“<>”为占位符,以供描述文本进行替换。合成的文本为“请使用粤语说出”。
在一些实施例中,指导信息用于声纹注册、或者用于声纹验证。每种指导信息的类型例如对应预设的流程文本,以使用户明确当前是注册过程还是验证过程。
在步骤S104中,在指导信息被输出给用户后,获取用户的朗读语音。
引导用户说出朗读文本的一个示例性过程如下。对于说四川话的用户,在该用户 注册时欲使用户使用四川话说出“早上好”,则朗读文本为“早上好”,流程文本例如为“请使用四川话说出”。通过用户的终端播放或显示的内容为“请使用四川话说出早上好”,从而用户在听到该播放内容后,会使用四川话说出“早上好”。
在步骤S106中,确定朗读语音的声音属性、以及识别朗读语音对应的文字内容。
在一些实施例中,将朗读语音的语音特征输入到预设的声音属性分类模型中,获得朗读语音的声音属性。声音属性分类模型例如为通过采用标记的声音样本训练的深度神经网络模型。声音属性分类模型能够识别预设种类的声音属性,并将任意的输入识别为预设种类的其中一个。
在一些实施例中,将朗读语音的语音特征输入到预设的神经网络模型中,获得神经网络模型的隐藏层提取的语音嵌入特征向量,例如,提取池化(pooling)层后的一层或两层输出的特征向量;计算语音嵌入特征向量与每个声音属性的预设语音嵌入特征向量之间的距离,并确定其中的最短距离;在最短距离不大于预设距离阈值的情况下,将最短距离对应的声音属性确定为朗读语音的声音属性;在最短距离大于预设距离阈值的情况下,将朗读语音的声音属性确定为未知属性。通过这种方式,当用户使用的声音属性不属于任何一种预设的声音属性时,将其判定为未知属性。
上述两种声音属性的分类方式也可以应用于对其他语音的识别,例如,在生成指导信息之前,采集用户的语音,并通过上述两种方式中的一种确定所采集的用户的语音的声音属性。
在一些实施例中,采用朗读语音的声音属性所对应的语音识别模型,确定朗读语音对应的文字内容。例如,如果用户使用普通话说出朗读语音,则使用普通话对应的语音识别模型、或者通用的语音识别模型;如果用户使用四川话说出朗读语音,则使用四川话对应的语音识别模型。从而,能够更准确地识别用户说话内容。
在步骤S108中,在朗读语音的声音属性为预设的声音属性、朗读语音对应的文字内容与朗读文本匹配的情况下,确定朗读语音可用。
在一些实施例中,在朗读语音对应的文字内容包括朗读文本的情况下,朗读语音对应的文字内容与朗读文本匹配。
在一些实施例中,在朗读语音对应的文字内容对应的音素序列包括朗读文本对应的音素序列的情况下,朗读语音对应的文字内容与朗读文本匹配。例如,朗读文本中的词语包括“极度”,而用户的朗读语音的识别结果中的相应内容是“嫉妒”,二者在汉语中的发音是相同的。识别结果中的文字虽然与朗读文本不一致、但发音是一致 的,此时也可以认为朗读语音对应的文字内容与朗读文本匹配。
由于声纹注册阶段,语音识别的主要目的是进行活体检测、以及充分地采集用户对各种音素的发音情况,对文字信息的准确性的要求并不高。因此通过音素匹配来确认文本匹配的方式可以提高注册效率。
通过确认朗读语音的声音属性为预设的声音属性,可以使得注册的声音属性与用户习惯使用的声音属性、或者用户选择的声音属性保持一致。当每种声音属性对应一个语音识别模型时,能够提高识别的准确度。
通过确认朗读语音对应的文字内容是否与朗读文本匹配,能够实现活体检测,降低说话方并非真人、而是录音的可能性。从而提高了注册的安全性。并且,当朗读文本中包括较多音素时,根据语音识别结果确定朗读语音是否可用的方式也能够令用户说出的语音覆盖更多的音素、以生成更全面的说话人模型。
在一些实施例中,在朗读语音的声音属性为预设的声音属性、朗读语音对应的文字内容与朗读文本匹配、并且朗读语音不为噪声的情况下,确定朗读语音可用。例如,通过语音端点检测(Voice Activity Detection,简称:VAD)模型识别采集的声音是语音还是噪声。
在步骤S110中,对于用于声纹注册的指导信息,在朗读语音可用的情况下,通过注册,存储用户和从朗读语音中提取的声纹信息之间的对应关系。
通过上述实施例,能够通过自动生成的引导信息,引导用户以预设声音属性说出预设的朗读文本。并且,可以根据用户的说话习惯进行验证和语音采集,并在了解用户使用的声音属性的情况下进行语音识别,提高了识别的准确率。因此,上述实施例提高了声纹注册的适用人群和使用便捷度,也提高了注册流程的效率。
在一些实施例中,在完成注册后,当用户后续在使用相关的应用或产品时,用户使用预设的声音属性说话。通过提取用户语音中的声纹、并将其与注册阶段存储的声纹进行对比,可以确定说话人是否为注册用户。下面参考图1B描述本发明声纹验证方法的实施例。
图1B示出了根据本发明一些实施例的声纹验证方法的流程示意图。如图1B所示,该实施例的声纹验证方法包括步骤S112~S120,其中,步骤S112~S118的具体实施方式参考图1A实施例中的步骤S102~S108,这里不再赘述。
在步骤S120中,对于用于声纹验证的指导信息,在所述朗读语音可用的情况下,如果提取的声纹与存储的、所述用户对应的声纹匹配,通过所述用户的身份验证。
通过上述实施例,在声纹验证阶段也能够通过自动生成的引导信息,引导用户以预设声音属性说出预设的朗读文本。并且,可以根据用户的说话习惯进行验证和语音采集,并在了解用户使用的声音属性的情况下进行语音识别,提高了识别的准确率。因此,上述实施例提高了基于声纹的验证过程的适用人群和使用便捷度,也提高了验证流程的效率。
在一些实施例中,指导信息为指导语音,从而以语音的方式提示用户采用相应的声音属性说出预设的内容。下面参考图2描述指导语音输出方法的实施例。
图2示出了根据本发明一些实施例的指导语音输出方法的流程示意图。如图2所示,该实施例的指导语音输出方法包括步骤S202~S206。
在步骤S202中,根据朗读文本、以及用于引导用户采用预设声音属性说出朗读文本的流程文本,生成指导文本。
在步骤S204中,将指导文本转换为指导语音。
在步骤S206中,将指导语音播放给用户。
在一些实施例中,使用从文本到语音(Text To Speech,简称:TTS)技术将指导文本转换为指导语音。在一些实施例中,生成的指导语音为具有某种情绪、语调的语音,从而可以增强语音指导过程中的趣味性和互动性。在一些实施例中,生成的指导语音为具有预设声音属性的语音,从而可以更有助于使用相应声音属性的用户理解要说出的内容,并能够更准确地采用相应声音属性说话。
通过使用语音播放指导信息的方式,能够使得不识字、认字不全、视力较差或失明等阅读有障碍的用户也能够获得待朗读的信息,从而提高了声纹识别应用的适用广度。
在一些实施例中,当识别到用户的朗读语音不符合要求时,可以根据具体情况给出相应的纠正信息,以引导用户更准确地说出预设的内容。下面参考图3描述本发明朗读纠正方法的实施例。
图3示出了根据本发明一些实施例的朗读纠正方法的流程示意图。如图3所示,该实施例的朗读纠正方法包括步骤S302~S312。
在步骤S302中,在指导信息被输出给用户后,获取用户的朗读语音。指导信息的生成方式参见前述实施例,这里不再赘述。
在步骤S304中,确定朗读语音的声音属性、以及识别朗读语音对应的文字内容,并检测朗读语音是否为噪声。
在步骤S306中,在朗读语音的声音属性为预设的声音属性、朗读语音对应的文字内容与朗读文本匹配、并且朗读语音不为噪声的情况下,确定朗读语音可用。
在步骤S308中,存储用户和从朗读语音中提取的声纹信息之间的对应关系。
在步骤S310中,在朗读语音的声音属性不为预设的声音属性、或者朗读语音对应的文字内容不与朗读文本匹配、或者朗读语音为噪声的情况下,确定朗读语音不可用。
在步骤S312中,在朗读语音不可用的情况下,根据朗读语音的不可用类型,输出相应的朗读纠正信息作为指导信息,其中,不可用类型包括声音属性不匹配、朗读语音的内容不完整、朗读语音为噪声。然后,回到步骤S302以继续获取用户的朗读语音。
例如,当预设的方言类型为普通话,而用户使用四川话说出朗读语音时,朗读纠正信息为“请使用普通话再重复一遍”等类似的信息;当用户只说出了朗读文本的部分内容时,朗读纠正信息为“内容有缺失,请完整地再说一遍”等类似的信息;当朗读语音为噪声时,朗读纠正信息为“听不清楚,请到安静的地方再说一遍”。每种纠正信息中还可以包括朗读文本对应的信息。朗读纠正信息通过文字显示或语音播放等方式输出给用户。
通过上述实施例的方法,可以在用户声纹的注册、识别等声纹应用过程中自动生成引导和纠正信息,以辅助用户更快速、准确地完成声纹应用相关的操作。
在一些实施例中,当生成朗读文本时,选择出现频率低于预设值的字、词或者句子,以降低攻击者提前录音以进行攻击的可能性。在一些实施例中,还可以自动生成无意义的“自造词”,以进一步降低被攻击的可能性。下面参考图4描述朗读文本生成方法的实施例。
图4示出了根据本发明一些实施例的朗读文本生成方法的流程示意图。如图4所示,该实施例的朗读文本生成方法包括步骤S402~S406。
在步骤S402中,从字库中随机选择预设数量的字,组成备选文本,并确定备选文本的音素组合。
在步骤S404中,检测预设的文本库中,音素组合的出现频率。
在步骤S406中,在音素组合的出现频率低于预设频率的情况下,将备选文本作为朗读文本。
例如,随机选择“鸣”、“添”组成备选文本时,虽然现有词库中并没有“鸣 添”这个词,但是该词语与常用词“明天”包括完全相同的音素,从而该备选文本的音素组合出现频率较高。
在一些实施例中,在指导信息采用指导语音的形式的情况下,查找每个随机选择的字的描述文本、作为流程文本的一部分,以便将描述文本转换为语音播放。例如,生造词为“较奇”,则可以将“较是比较的较”、“奇是奇怪的奇”分别作为生造词中两个字的描述文本,指导语音的内容例如为“请说出较奇,较是比较的较,奇是奇怪的奇”。通过这种方式,可以辅助用户更形象地理解待朗读的内容,从而用户能够更准确地说出预设的内容。
在一些实施例中,当使用生造词作为朗读文本、并识别用户的朗读语音时,采用预设声音属性对应的孤立词识别模型对朗读语音进行识别,并获得朗读语音对应的文字内容。通常,语音识别模型中包含了多元语言模型,例如2-Gram(二元语法)、3-Gram(三元语法)模型。而当使用生造词时,在识别过程中可以使用孤立词识别模型,该模型例如使用1-Gram(一元语法)模型或不使用语法模型,以在识别过程中仅从语音角度出发,不考语义对识别结果造成的影响。
通过上述实施例的方法,能够自动合成低唤醒度的文本,从而提高了安全性。
下面参考图5描述声纹注册装置的实施例。
图5示出了根据本发明一些实施例的声纹注册装置的结构示意图。如图5所示,该实施例的声纹注册装置500包括:指导信息生成模块5100,被配置为根据朗读文本、以及用于引导用户采用预设声音属性说出所述朗读文本的流程文本,生成指导信息;朗读语音获取模块5200,被配置为在所述指导信息被输出给用户后,获取用户的朗读语音;朗读语音解析模块5300,被配置为确定所述朗读语音的声音属性、以及识别所述朗读语音对应的文字内容;可用性确定模块5400,被配置为在所述朗读语音的声音属性为所述预设的声音属性、所述朗读语音对应的文字内容与所述朗读文本匹配的情况下,确定所述朗读语音可用;存储模块5500,被配置为对于用于声纹注册的指导信息,在所述朗读语音可用的情况下,存储所述用户和从所述朗读语音中提取的声纹信息之间的对应关系。
在一些实施例中,指导信息为指导语音。
在一些实施例中,指导信息生成模块5100进一步被配置为根据朗读文本和流程文本,生成具有预设声音属性的指导语音。
在一些实施例中,可用性确定模块5400进一步被配置为在朗读语音的声音属性 为预设的声音属性、朗读语音对应的文字内容与朗读文本匹配、并且朗读语音不为噪声的情况下,确定朗读语音可用。
在一些实施例中,可用性确定模块5400进一步被配置为在朗读语音的声音属性不为预设的声音属性、或者朗读语音对应的文字内容不与朗读文本匹配、或者朗读语音为噪声的情况下,确定朗读语音不可用;指导信息生成模块5100进一步被配置为在朗读语音不可用的情况下,根据朗读语音的不可用类型,生成相应的朗读纠正信息、添加到指导信息中,其中,不可用类型包括声音属性不匹配、朗读语音的内容不完整、朗读语音为噪声。
在一些实施例中,在朗读语音对应的文字内容包括朗读文本的情况下,或者,在朗读语音对应的文字内容对应的音素序列包括朗读文本对应的音素序列的情况下,朗读语音对应的文字内容与朗读文本匹配。
在一些实施例中,声纹注册装置500还包括:声音属性获取模块5600,被配置为获取用户输入的注册信息中的声音属性,作为预设声音属性;或者,在生成指导信息之前,采集用户的语音,并确定所采集的用户的语音的声音属性,作为预设声音属性。
在一些实施例中,朗读语音解析模块5300进一步被配置为将朗读语音的语音特征输入到预设的声音属性分类模型中,获得朗读语音的声音属性。
在一些实施例中,朗读语音解析模块5300进一步被配置为将朗读语音的语音特征输入到预设的神经网络模型中,获得神经网络模型的隐藏层提取的语音嵌入特征向量;计算语音嵌入特征向量与每个声音属性的预设语音嵌入特征向量之间的距离,并确定其中的最短距离;在最短距离不大于预设距离阈值的情况下,将最短距离对应的声音属性确定为朗读语音的声音属性;在最短距离大于预设距离阈值的情况下,将朗读语音的声音属性确定为未知属性。
在一些实施例中,朗读语音解析模块5300进一步被配置为采用朗读语音的声音属性所对应的语音识别模型,确定朗读语音对应的文字内容。
在一些实施例中,声纹注册装置500还包括:朗读文本生成模块5700,被配置为从字库中随机选择预设数量的字,组成备选文本,并确定备选文本的音素组合;检测预设的文本库中,音素组合的出现频率;在音素组合的出现频率低于预设频率的情况下,将备选文本作为朗读文本。
在一些实施例中,朗读语音解析模块5300进一步被配置为采用预设声音属性对应的孤立词识别模型对朗读语音进行识别,并获得朗读语音对应的文字内容。
在一些实施例中,指导信息为指导语音,并且流程文本包括随机选择的每个字对应的描述文本。
在一些实施例中,声音属性为方言类型。
在一些实施例中,声纹注册装置500还包括验证模块5800,被配置为对于用于声纹验证的指导信息,在朗读语音可用的情况下,如果提取的声纹与存储的、用户对应的声纹匹配,通过该用户的身份验证。
下面参考图6描述本发明声纹注册系统的实施例。
图6示出了根据本发明一些实施例的声纹注册系统的结构示意图。如图6所示,该实施例的声纹注册系统60包括:声纹注册装置610,其具体实时方式可以参考声纹注册装置500;输出设备620,被配置为输出所述声纹注册装置生成的指导信息;以及录音设备630,被配置为录制用户的朗读语音。
在一些实施例中,输出设备620为声音输出设备。
图7示出了根据本发明另一些实施例的声纹注册装置的结构示意图。如图7所示,该实施例的声纹注册装置70包括:存储器710以及耦接至该存储器710的处理器720,处理器720被配置为基于存储在存储器710中的指令,执行前述任意一个实施例中的声纹注册方法。
其中,存储器710例如可以包括系统存储器、固定非易失性存储介质等。系统存储器例如存储有操作系统、应用程序、引导装载程序(Boot Loader)以及其他程序等。
图8示出了根据本发明又一些实施例的声纹注册装置的结构示意图。如图8所示,该实施例的声纹注册装置80包括:存储器810以及处理器820,还可以包括输入输出接口830、网络接口840、存储接口850等。这些接口830,840,850以及存储器810和处理器820之间例如可以通过总线860连接。其中,输入输出接口830为显示器、鼠标、键盘、触摸屏等输入输出设备提供连接接口。网络接口840为各种联网设备提供连接接口。存储接口850为SD卡、U盘等外置存储设备提供连接接口。
本发明的实施例还提供一种计算机可读存储介质,其上存储有计算机程序,其特征在于,该程序被处理器执行时实现前述任意一种声纹注册方法。
本领域内的技术人员应当明白,本发明的实施例可提供为方法、系统、或计算机程序产品。因此,本发明可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用非瞬时性存储介质(包括但不限于磁盘存储器、CD-ROM、光学 存储器等)上实施的计算机程序产品的形式。
本发明是参照根据本发明实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解为可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
以上所述仅为本发明的较佳实施例,并不用以限制本发明,凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。

Claims (20)

  1. 一种声纹注册方法,包括:
    根据朗读文本、以及用于引导用户采用预设声音属性说出所述朗读文本的流程文本,生成指导信息;
    在所述指导信息被输出给用户后,获取用户的朗读语音;
    确定所述朗读语音的声音属性、以及识别所述朗读语音对应的文字内容;
    在所述朗读语音的声音属性为所述预设的声音属性、所述朗读语音对应的文字内容与所述朗读文本匹配的情况下,确定所述朗读语音可用;
    对于用于声纹注册的指导信息,在所述朗读语音可用的情况下,存储所述用户和从所述朗读语音中提取的声纹信息之间的对应关系。
  2. 根据权利要求1所述的声纹注册方法,其中,所述指导信息为指导语音。
  3. 根据权利要求2所述的声纹注册方法,其中,根据所述朗读文本和所述流程文本,生成具有所述预设声音属性的指导语音。
  4. 根据权利要求1所述的声纹注册方法,
    在所述朗读语音的声音属性为所述预设的声音属性、所述朗读语音对应的文字内容与所述朗读文本匹配、并且所述朗读语音不为噪声的情况下,确定所述朗读语音可用。
  5. 根据权利要求1所述的声纹注册方法,还包括:
    在所述朗读语音的声音属性不为所述预设的声音属性、或者所述朗读语音对应的文字内容不与所述朗读文本匹配、或者所述朗读语音为噪声的情况下,确定所述朗读语音不可用;
    在所述朗读语音不可用的情况下,根据所述朗读语音的不可用类型,输出相应的朗读纠正信息,其中,所述不可用类型包括声音属性不匹配、所述朗读语音的内容不完整、所述朗读语音为噪声。
  6. 根据权利要求1~5中任一项所述的声纹注册方法,其中,在所述朗读语音对应的文字内容包括所述朗读文本的情况下,或者,在所述朗读语音对应的文字内容对应的音素序列包括所述朗读文本对应的音素序列的情况下,所述朗读语音对应的文字内容与所述朗读文本匹配。
  7. 根据权利要求1所述的声纹注册方法,还包括:
    获取所述用户输入的注册信息中的声音属性,作为所述预设声音属性;或者,
    在生成所述指导信息之前,采集用户的语音,并确定所采集的用户的语音的声音属性,作为所述预设声音属性。
  8. 根据权利要求1所述的声纹注册方法,其中,所述确定所述朗读语音的声音属性包括:
    将所述朗读语音的语音特征输入到预设的声音属性分类模型中,获得所述朗读语音的声音属性。
  9. 根据权利要求1所述的声纹注册方法,其中,所述确定所述朗读语音的声音属性包括:
    将所述朗读语音的语音特征输入到预设的神经网络模型中,获得所述神经网络模型的隐藏层提取的语音嵌入特征向量;
    计算所述语音嵌入特征向量与每个声音属性的预设语音嵌入特征向量之间的距离,并确定其中的最短距离;
    在所述最短距离不大于预设距离阈值的情况下,将所述最短距离对应的声音属性确定为所述朗读语音的声音属性;
    在所述最短距离大于预设距离阈值的情况下,将所述朗读语音的声音属性确定为未知属性。
  10. 根据权利要求1所述的声纹注册方法,其中,所述识别所述朗读语音对应的文字内容包括:
    采用所述朗读语音的声音属性所对应的语音识别模型,确定所述朗读语音对应的文字内容。
  11. 根据权利要求1所述的声纹注册方法,还包括:
    从字库中随机选择预设数量的字,组成备选文本,并确定所述备选文本的音素组合;
    检测预设的文本库中,所述音素组合的出现频率;
    在所述音素组合的出现频率低于预设频率的情况下,将所述备选文本作为所述朗读文本。
  12. 根据权利要求11所述的声纹注册方法,其中,采用所述预设声音属性对应的孤立词识别模型对所述朗读语音进行识别,并获得所述朗读语音对应的文字内容。
  13. 根据权利要求11所述的声纹注册方法,其中,所述指导信息为指导语音,并 且所述流程文本包括随机选择的每个字对应的描述文本。
  14. 根据权利要求1所述的声纹注册方法,其中,所述声音属性为方言类型。
  15. 根据权利要求1~5中任一项所述的声纹注册方法,还包括:
    对于用于声纹验证的指导信息,在所述朗读语音可用的情况下,如果提取的声纹与存储的、所述用户对应的声纹匹配,通过所述用户的身份验证。
  16. 一种声纹注册装置,包括:
    指导信息生成模块,被配置为根据朗读文本、以及用于引导用户采用预设声音属性说出所述朗读文本的流程文本,生成指导信息;
    朗读语音获取模块,被配置为在所述指导信息被输出给用户后,获取用户的朗读语音;
    朗读语音解析模块,被配置为确定所述朗读语音的声音属性、以及识别所述朗读语音对应的文字内容;
    可用性确定模块,被配置为在所述朗读语音的声音属性为所述预设的声音属性、所述朗读语音对应的文字内容与所述朗读文本匹配的情况下,确定所述朗读语音可用;
    存储模块,被配置为对于用于声纹注册的指导信息,在所述朗读语音可用的情况下,存储所述用户和从所述朗读语音中提取的声纹信息之间的对应关系。
  17. 一种声纹注册系统,包括:
    权利要求16所述的声纹注册装置;
    输出设备,被配置为输出所述声纹注册装置生成的指导信息;以及
    录音设备,被配置为录制用户的朗读语音。
  18. 根据权利要求17所述的声纹注册系统,其中,所述输出设备为声音输出设备。
  19. 一种声纹注册装置,包括:
    存储器;以及
    耦接至所述存储器的处理器,所述处理器被配置为基于存储在所述存储器中的指令,执行如权利要求1~15中任一项所述的声纹注册方法。
  20. 一种非瞬时性计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现权利要求1~15中任一项所述的声纹注册方法。
PCT/CN2021/093285 2020-09-21 2021-05-12 声纹注册方法、装置和计算机可读存储介质 WO2022057283A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010996045.4A CN112309406A (zh) 2020-09-21 2020-09-21 声纹注册方法、装置和计算机可读存储介质
CN202010996045.4 2020-09-21

Publications (1)

Publication Number Publication Date
WO2022057283A1 true WO2022057283A1 (zh) 2022-03-24

Family

ID=74488609

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/093285 WO2022057283A1 (zh) 2020-09-21 2021-05-12 声纹注册方法、装置和计算机可读存储介质

Country Status (2)

Country Link
CN (1) CN112309406A (zh)
WO (1) WO2022057283A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112309406A (zh) * 2020-09-21 2021-02-02 北京沃东天骏信息技术有限公司 声纹注册方法、装置和计算机可读存储介质
CN113570754B (zh) * 2021-07-01 2022-04-29 汉王科技股份有限公司 声纹锁控制方法、装置、电子设备
CN116612766B (zh) * 2023-07-14 2023-11-17 北京中电慧声科技有限公司 具备声纹注册功能的会议系统及声纹注册方法

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160035349A1 (en) * 2014-07-29 2016-02-04 Samsung Electronics Co., Ltd. Electronic apparatus and method of speech recognition thereof
CN108231090A (zh) * 2018-01-02 2018-06-29 深圳市酷开网络科技有限公司 文本朗读水平评估方法、装置及计算机可读存储介质
CN108989341A (zh) * 2018-08-21 2018-12-11 平安科技(深圳)有限公司 语音自主注册方法、装置、计算机设备及存储介质
CN109473106A (zh) * 2018-11-12 2019-03-15 平安科技(深圳)有限公司 声纹样本采集方法、装置、计算机设备及存储介质
CN109510844A (zh) * 2019-01-16 2019-03-22 中民乡邻投资控股有限公司 一种基于声纹的对话交流式的账号注册方法及装置
CN111090846A (zh) * 2019-12-06 2020-05-01 中信银行股份有限公司 登录认证方法、装置、电子设备及计算机可读存储介质
CN111161746A (zh) * 2019-12-31 2020-05-15 苏州思必驰信息科技有限公司 声纹注册方法及系统
CN111161706A (zh) * 2018-10-22 2020-05-15 阿里巴巴集团控股有限公司 交互方法、装置、设备和系统
CN111477234A (zh) * 2020-03-05 2020-07-31 厦门快商通科技股份有限公司 一种声纹数据注册方法和装置以及设备
CN112309406A (zh) * 2020-09-21 2021-02-02 北京沃东天骏信息技术有限公司 声纹注册方法、装置和计算机可读存储介质

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004171174A (ja) * 2002-11-19 2004-06-17 Brother Ind Ltd 文章読み上げ装置、読み上げのためのプログラム及び記録媒体
CN107346568B (zh) * 2016-05-05 2020-04-17 阿里巴巴集团控股有限公司 一种门禁系统的认证方法和装置
EP3744152A4 (en) * 2018-01-22 2021-07-21 Nokia Technologies Oy DEVICE AND PROCEDURE FOR PRIVACY PRIVACY VOICEPRINT AUTHENTICATION
CN109473108A (zh) * 2018-12-15 2019-03-15 深圳壹账通智能科技有限公司 基于声纹识别的身份验证方法、装置、设备及存储介质
CN111091837A (zh) * 2019-12-27 2020-05-01 中国人民解放军陆军工程大学 一种基于在线学习的时变声纹认证方法及系统

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160035349A1 (en) * 2014-07-29 2016-02-04 Samsung Electronics Co., Ltd. Electronic apparatus and method of speech recognition thereof
CN108231090A (zh) * 2018-01-02 2018-06-29 深圳市酷开网络科技有限公司 文本朗读水平评估方法、装置及计算机可读存储介质
CN108989341A (zh) * 2018-08-21 2018-12-11 平安科技(深圳)有限公司 语音自主注册方法、装置、计算机设备及存储介质
CN111161706A (zh) * 2018-10-22 2020-05-15 阿里巴巴集团控股有限公司 交互方法、装置、设备和系统
CN109473106A (zh) * 2018-11-12 2019-03-15 平安科技(深圳)有限公司 声纹样本采集方法、装置、计算机设备及存储介质
CN109510844A (zh) * 2019-01-16 2019-03-22 中民乡邻投资控股有限公司 一种基于声纹的对话交流式的账号注册方法及装置
CN111090846A (zh) * 2019-12-06 2020-05-01 中信银行股份有限公司 登录认证方法、装置、电子设备及计算机可读存储介质
CN111161746A (zh) * 2019-12-31 2020-05-15 苏州思必驰信息科技有限公司 声纹注册方法及系统
CN111477234A (zh) * 2020-03-05 2020-07-31 厦门快商通科技股份有限公司 一种声纹数据注册方法和装置以及设备
CN112309406A (zh) * 2020-09-21 2021-02-02 北京沃东天骏信息技术有限公司 声纹注册方法、装置和计算机可读存储介质

Also Published As

Publication number Publication date
CN112309406A (zh) 2021-02-02

Similar Documents

Publication Publication Date Title
WO2022057283A1 (zh) 声纹注册方法、装置和计算机可读存储介质
Lei et al. Dialect classification via text-independent training and testing for Arabic, Spanish, and Chinese
JP4085130B2 (ja) 感情認識装置
US11676572B2 (en) Instantaneous learning in text-to-speech during dialog
Singh Forensic and Automatic Speaker Recognition System.
Khelifa et al. Constructing accurate and robust HMM/GMM models for an Arabic speech recognition system
Kopparapu Non-linguistic analysis of call center conversations
Lazaridis et al. Swiss French Regional Accent Identification.
Kanabur et al. An extensive review of feature extraction techniques, challenges and trends in automatic speech recognition
US20180012602A1 (en) System and methods for pronunciation analysis-based speaker verification
Li et al. Cost-sensitive learning for emotion robust speaker recognition
US11615787B2 (en) Dialogue system and method of controlling the same
JP2010197644A (ja) 音声認識システム
Huang et al. Unsupervised discriminative training with application to dialect classification
KR20130126570A (ko) 핵심어에서의 음소 오류 결과를 고려한 음향 모델 변별 학습을 위한 장치 및 이를 위한 방법이 기록된 컴퓨터 판독 가능한 기록매체
Ziedan et al. A unified approach for arabic language dialect detection
Mengistu Automatic text independent amharic language speaker recognition in noisy environment using hybrid approaches of LPCC, MFCC and GFCC
Rao et al. Language identification using excitation source features
KR101598950B1 (ko) 발음 평가 장치 및 이를 이용한 발음 평가 방법에 대한 프로그램이 기록된 컴퓨터 판독 가능한 기록 매체
Bisikalo et al. Precision Automated Phonetic Analysis of Speech Signals for Information Technology of Text-dependent Authentication of a Person by Voice.
Kruspe et al. A GMM approach to singing language identification
CN112420022B (zh) 一种噪声提取方法、装置、设备和存储介质
JP6517417B1 (ja) 評価システム、音声認識装置、評価プログラム、及び音声認識プログラム
KR20180057315A (ko) 자연어 발화 음성 판별 시스템 및 방법
Hosier et al. Disambiguation and Error Resolution in Call Transcripts

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21868121

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21868121

Country of ref document: EP

Kind code of ref document: A1