WO2023185004A1 - 一种音色切换方法及装置 - Google Patents

一种音色切换方法及装置 Download PDF

Info

Publication number
WO2023185004A1
WO2023185004A1 PCT/CN2022/132585 CN2022132585W WO2023185004A1 WO 2023185004 A1 WO2023185004 A1 WO 2023185004A1 CN 2022132585 W CN2022132585 W CN 2022132585W WO 2023185004 A1 WO2023185004 A1 WO 2023185004A1
Authority
WO
WIPO (PCT)
Prior art keywords
voiceprint
user
voice command
timbre
target voice
Prior art date
Application number
PCT/CN2022/132585
Other languages
English (en)
French (fr)
Inventor
张凯月
张桂芳
Original Assignee
青岛海尔空调器有限总公司
青岛海尔空调电子有限公司
海尔智家股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 青岛海尔空调器有限总公司, 青岛海尔空调电子有限公司, 海尔智家股份有限公司 filed Critical 青岛海尔空调器有限总公司
Publication of WO2023185004A1 publication Critical patent/WO2023185004A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/14Use of phonemic categorisation or speech recognition prior to speaker recognition or verification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Definitions

  • the present application relates to the field of artificial intelligence technology, and in particular to a timbre switching method.
  • the current existing timbre switching method requires users to use a mobile terminal to open an application (Application, APP) to perform manual switching.
  • Application Application, APP
  • the air conditioner is used by multiple people in the same family. People has to adjust the sound before using it, which is very cumbersome.
  • This application provides a timbre switching method and device to solve the defects of timbre switching in the prior art and realize convenient and intelligent timbre switching.
  • This application provides a timbre switching method, including:
  • a response tone pattern is set.
  • performing voiceprint recognition on the target voice command and obtaining the voiceprint recognition result includes:
  • the object sending the target voice instruction is a target registered user, determine the first age information in the registration information of the target registered user;
  • the user category of the target registered user is determined as the voiceprint recognition result.
  • a timbre switching method after comparing the voiceprint characteristics with the characteristics of all recorded voiceprints, it also includes:
  • the user category of the object sending the target voice instruction is determined to be the voiceprint recognition result.
  • the method before comparing the voiceprint features with the features of all recorded voiceprints, the method further includes:
  • the entered age is input by any user in response to the entered age prompt.
  • setting a response timbre mode according to the voiceprint recognition result includes:
  • the response tone mode is set to a child tone mode
  • the response timbre mode is set to the default timbre mode
  • the response timbre mode is set to an elderly person's timbre mode.
  • determining the voiceprint characteristics of the target voice command includes:
  • This application also provides a timbre switching device, including:
  • the receiving unit receives the target voice command
  • the acquisition unit performs voiceprint recognition on the target voice command and obtains the voiceprint recognition result
  • the determining unit sets the response tone mode according to the voiceprint recognition result.
  • This application also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor.
  • the processor executes the program, it implements any one of the above timbre switching methods. .
  • the present application also provides a non-transitory computer-readable storage medium on which a computer program is stored.
  • a computer program is stored on which a computer program is stored.
  • any one of the above timbre switching methods is implemented.
  • the present application also provides a computer program product, which includes a computer program.
  • the computer program When the computer program is executed by a processor, it implements any one of the above timbre switching methods.
  • the timbre switching method and device provided by this application can identify different user attributes by analyzing the user's voice and using voiceprint recognition, and automatically switch to the user's preferred response timbre to achieve convenient and intelligent voice switching.
  • FIG. 1 is one of the flow diagrams of the timbre switching method provided by this application.
  • FIG. 2 is the second schematic flow chart of the timbre switching method provided by this application.
  • FIG. 3 is a schematic structural diagram of the timbre switching device provided by this application.
  • Figure 4 is a schematic structural diagram of an electronic device provided by this application.
  • the execution subject may be an electronic device or a software or functional module or functional entity in the electronic device that can implement the timbre switching method.
  • the electronic device includes but is not limited to smart air conditioning equipment. It should be noted that the above execution entities do not constitute a limitation on this application.
  • Figure 1 is one of the flow diagrams of the timbre switching method provided by this application. As shown in Figure 1, it includes but is not limited to the following steps:
  • step S1 a target voice command is received.
  • the user who sends the target voice command can be a registered user who has entered a voiceprint, or an unregistered user who has not entered a voiceprint.
  • step S2 voiceprint recognition is performed on the target voice instruction to obtain a voiceprint recognition result.
  • the target voice command is preprocessed such as pre-emphasis, framing, and windowing, and the preprocessed target voice command is converted into a voiceprint feature map.
  • the voiceprint feature map can be a Mel energy spectrogram.
  • the Mel energy spectrogram is processed using a Mel filter bank (simulating the human cochlea) based on the spectrogram (a description of the human vocal system). What was obtained later was a description of the human auditory system.
  • Mel energy spectrogram can represent the frequency distribution of sounds that people can hear, which is the deep feature of people identifying things through sound. Using this distribution characteristic in the Mel frequency domain is more suitable for building a speaker recognition system.
  • the speech signal passes through Through such conversion, the speech signal becomes an image carrying voiceprint information.
  • For a single signal its Mel energy spectrum is black and white and can be understood as a single-channel feature map.
  • the voiceprint feature map is input into the pre-trained age recognition neural network model to obtain the age information of the user who sends the target voice command. It realizes intelligent recognition of people and changes the user-controlled air conditioner into the air conditioner actively serving users, which is extremely convenient.
  • the age recognition neural network model has been trained with a large amount of sample data.
  • the sample data includes the age information of the sample user and the voiceprint feature map of the sample user. Therefore, the user's age information can be output after inputting the user's voiceprint feature map.
  • the user category corresponding to the target voice command can be determined, and the user category is used as the voiceprint recognition result.
  • User categories can include: children, adults, and seniors.
  • step S3 a response tone mode is set according to the voiceprint recognition result.
  • TTS Text To Speech
  • the timbre switching method provided by this application can identify different user attributes by analyzing the user's voice and using voiceprint recognition, and automatically switches to the user's preferred response timbre to achieve convenient and intelligent voice switching.
  • determining the voiceprint characteristics of the target voice instruction includes:
  • the high-frequency end is attenuated at about 6 decibels/octave (dB/oct) above 800 Hz.
  • Digital filters can be used to achieve pre-emphasis of target voice commands.
  • the voiceprint signal is divided into several frames at intervals of 10 to 20 milliseconds (ms), and one frame is a basic unit to realize the framing of pre-emphasized voice commands.
  • the Hamming window function is used to window the framed speech instructions.
  • voiceprint features before comparing the voiceprint features with all recorded voiceprint features, it also includes:
  • the entered age is input by any user in response to the entered age prompt.
  • the smart air conditioner After receiving the instruction to enter the voiceprint, the smart air conditioner switches to the voiceprint entry mode and issues a voice prompt to remind the user to enter the voiceprint test voice.
  • the user repeats the voiceprint test voice more than twice.
  • the feature information of the filter group Frter bank, Fbank
  • the voiceprint recognition model converts the Fbank feature information into the segment.
  • the voiceprint characteristics of the voice are averaged as the characteristics of the entered voiceprint sent by the user; the smart air conditioner generates the entry age prompt, and after receiving the entry age sent by the user, the voiceprint will be entered and enter the age as the user's registration information, and the voice broadcast module prompts that the entry is successful.
  • the voiceprint recognition model is a deep neural network model that is trained on thousands of hours of Chinese corpus and has strong noise resistance and robustness.
  • performing voiceprint recognition on the target voice command and obtaining the voiceprint recognition result includes:
  • the object sending the target voice instruction is a target registered user, determine the first age information in the registration information of the target registered user;
  • the user category of the target registered user is determined as the voiceprint recognition result.
  • the output is the voiceprint feature of the target voice command.
  • Similarity calculation is performed between the voiceprint feature of the target voice command and the recorded voiceprint features that have been stored by all registered users. ; If the highest similarity obtained is higher than the set voiceprint threshold, the user with the entered voiceprint feature corresponding to the highest similarity is determined to be the user who issued the target voice command.
  • the age information can be determined based on the user's registration information and the voice can be generated.
  • the fingerprint recognition result of the target voice command if the highest similarity is lower than the set voiceprint threshold, it is determined that the person sending the target voice command is not a registered user.
  • the method further includes:
  • the user category of the object sending the target voice instruction is determined to be the voiceprint recognition result.
  • Registration-based voiceprint recording as well as direct identification of age attributes for non-registration-based voiceprints, can both support automatic recognition of user roles and automatic switching of timbres.
  • the voiceprint feature map is input into the pre-trained age recognition neural network model to obtain the age information of the user who sends the target voice command, and generates the fingerprint recognition result of the target voice command.
  • setting a response tone mode according to the voiceprint recognition result includes:
  • the response tone mode is set to a child tone mode
  • the response timbre mode is set to the default timbre mode
  • the response timbre mode is set to an elderly person's timbre mode.
  • the answering timbre mode When the answering timbre mode is the children's timbre mode, use the children's timbre for voice interaction and response; when the answering timbre mode is the default timbre mode, the timbre remains unchanged; when the answering timbre mode is the elderly timbre mode Next, use the old man’s voice for voice interaction and response.
  • Figure 2 is the second schematic flow chart of the timbre switching method provided by this application. As shown in Figure 2, it includes:
  • the target voice command sent by the user is obtained
  • voiceprint recognition is performed on the target voice command.
  • the voiceprint recognition result shows that the user is a child
  • the voice is automatically switched to the child's voice
  • the voiceprint recognition result shows that the user is an adult
  • the voice remains unchanged
  • the fingerprint recognition result shows that the user is an elderly person, it will automatically switch to the elderly voice.
  • the timbre switching device provided by the present application will be described below.
  • the timbre switching device described below and the timbre switching method described above can be referenced correspondingly.
  • FIG 3 is a schematic structural diagram of the timbre switching device provided by this application. As shown in Figure 3, it includes:
  • the receiving unit 301 receives the target voice command
  • the acquisition unit 302 performs voiceprint recognition on the target voice command and obtains the voiceprint recognition result
  • the determining unit 303 sets the response timbre mode according to the voiceprint recognition result.
  • the receiving unit 301 receives the target voice instruction.
  • the user who sends the target voice command can be a registered user who has entered a voiceprint, or an unregistered user who has not entered a voiceprint.
  • the obtaining unit 302 performs voiceprint recognition on the target voice instruction and obtains the voiceprint recognition result.
  • the target voice command is preprocessed such as pre-emphasis, framing, and windowing, and the preprocessed target voice command is converted into a voiceprint feature map.
  • the voiceprint feature map can be a Mel energy spectrogram.
  • the Mel energy spectrogram is processed using a Mel filter bank (simulating the human cochlea) based on the spectrogram (a description of the human vocal system). What was obtained later was a description of the human auditory system.
  • Mel energy spectrogram can represent the frequency distribution of sounds that people can hear, which is the deep feature of people identifying things through sound. Using this distribution characteristic in the Mel frequency domain is more suitable for building a speaker recognition system.
  • the speech signal passes through Through such conversion, the speech signal becomes an image carrying voiceprint information.
  • For a single signal its Mel energy spectrum is black and white and can be understood as a single-channel feature map.
  • the voiceprint feature map is input into a pre-trained age recognition neural network model to obtain the age information of the target user. It realizes intelligent recognition of people and changes the user-controlled air conditioner into the air conditioner actively serving users, which is extremely convenient.
  • the age recognition neural network model has been trained with a large amount of sample data.
  • the sample data includes the sample user's voice signal and the sample user's voiceprint feature map. Therefore, the user's age information can be output after inputting the user's voiceprint feature map.
  • the user category corresponding to the target voice command can be determined, and the user category is used as the voiceprint recognition result.
  • User categories can include: children, adults, and seniors.
  • the determining unit 303 sets the response tone mode according to the voiceprint recognition result.
  • TTS Text To Speech
  • the timbre switching device provided by this application can identify different user attributes by analyzing the user's voice and using voiceprint recognition, and automatically switches to the user's preferred response timbre to achieve convenient and intelligent voice switching.
  • FIG 4 is a schematic structural diagram of an electronic device provided by this application.
  • the electronic device may include: a processor (processor) 410, a communications interface (Communications Interface) 420, a memory (memory) 430 and a communication bus 440.
  • the processor 410, the communication interface 420, and the memory 430 complete communication with each other through the communication bus 440.
  • the processor 410 can call logical instructions in the memory 430 to execute a timbre switching method.
  • the method includes: receiving a target voice instruction; performing voiceprint recognition on the target voice instruction to obtain a voiceprint recognition result; and performing voiceprint recognition based on the voiceprint recognition. As a result, the answer tone mode is set.
  • the above-mentioned logical instructions in the memory 430 can be implemented in the form of software functional units and can be stored in a computer-readable storage medium when sold or used as an independent product.
  • the technical solution of the present application is essentially or the part that contributes to the existing technology or the part of the technical solution can be embodied in the form of a software product.
  • the computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in various embodiments of this application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program code. .
  • the present application also provides a computer program product.
  • the computer program product includes a computer program.
  • the computer program can be stored on a non-transitory computer-readable storage medium.
  • the computer can Execute the timbre switching method provided by each of the above methods, which method includes: receiving a target voice command; performing voiceprint recognition on the target voice command to obtain a voiceprint recognition result; and setting a response timbre mode according to the voiceprint recognition result.
  • the present application also provides a non-transitory computer-readable storage medium on which a computer program is stored.
  • the computer program is implemented when executed by a processor to perform the timbre switching method provided by each of the above methods.
  • the method includes: Receive the target voice command; perform voiceprint recognition on the target voice command to obtain the voiceprint recognition result; and set the response tone mode according to the voiceprint recognition result.
  • the device embodiments described above are only illustrative.
  • the units described as separate components may or may not be physically separated.
  • the components shown as units may or may not be physical units, that is, they may be located in One location, or it can be distributed across multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment. Persons of ordinary skill in the art can understand and implement the method without any creative effort.
  • each embodiment can be implemented by software plus a necessary general hardware platform, and of course, it can also be implemented by hardware.
  • the computer software product can be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., including a number of instructions to cause a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods described in various embodiments or certain parts of the embodiments.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Business, Economics & Management (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Game Theory and Decision Science (AREA)
  • Telephonic Communication Services (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

一种音色切换方法、装置、电子设备、可读存储介质和程序产品,该方法包括:接收目标语音指令(S1);对目标语音指令进行声纹识别,获取声纹识别结果(S2);根据声纹识别结果,设置应答音色模式(S3)。该方法通过对用户的语音进行分析,利用声纹识别,可辨别不同的用户属性,并自动切换至用户所喜好的应答音色模式,实现便捷智能的语音切换。

Description

一种音色切换方法及装置
相关申请的交叉引用
本申请要求于2022年3月29日提交的申请号为202210322472.3,名称为“一种音色切换方法及装置”的中国专利申请的优先权,其通过引用方式全部并入本文。
技术领域
本申请涉及人工智能技术领域,尤其涉及一种音色切换方法。
背景技术
每个不同年龄阶段的用户,喜好的语音的音色是不一样的,
当前现有的音色切换方式需要用户使用移动端打开应用程序(Application,APP)进行手动切换。
但是空调是一家多人混用,每个人在使用前都要先调一遍音色,非常的繁琐。
发明内容
本申请提供一种音色切换方法及装置,用以解决现有技术中音色切换的缺陷,实现便捷智能的音色切换。
本申请提供一种音色切换方法,包括:
接收目标语音指令;
对所述目标语音指令进行声纹识别,获取声纹识别结果;
根据所述声纹识别结果,设置应答音色模式。
根据本申请提供的一种音色切换方法,所述对所述目标语音指令进行声纹识别,获取声纹识别结果,包括:
确定所述目标语音指令的声纹特征;
将所述声纹特征与所有的录入声纹的特征进行比对;
在发送所述目标语音指令的对象为目标注册用户的情况下,在所述目标注册用户的注册信息中确定第一年龄信息;
根据所述第一年龄信息,确定所述目标注册用户的用户类别为所述声 纹识别结果。
根据本申请提供的一种音色切换方法,在所述将所述声纹特征与所有的录入声纹的特征进行比对之后,还包括:
在发送所述目标语音指令的对象不为注册用户的情况下,对所述声纹特征进行年龄分析,确定发送所述目标语音指令的对象的第二年龄信息;
根据所述第二年龄信息,确定发送所述目标语音指令的对象的用户类别为所述声纹识别结果。
根据本申请提供的一种音色切换方法,在所述将所述声纹特征与所有的录入声纹的特征进行比对之前,还包括:
接收录入声纹指令;
根据所述录入声纹指令,生成录入声纹提示;
在接收到任一用户发送的声纹测试语音的情况下,确定所述任一用户的录入声纹并提取所述录入声纹的特征;
根据所述任一用户的录入声纹的特征,生成录入年龄提示;
根据所述任一用户的录入声纹的特征和录入年龄,确定所述任一用户的注册信息,并生成录入完成提示;
所述录入年龄是所述任一用户响应所述录入年龄提示后输入的。
根据本申请提供的一种音色切换方法,所述根据所述声纹识别结果,设置应答音色模式,包括:
在确定所述用户类别为儿童的情况下,将所述应答音色模式设置为儿童音色模式;
在确定所述用户类别为成人的情况下,将所述应答音色模式设置为默认音色模式;
在确定所述用户类别为老人的情况下,将所述应答音色模式设置为老人音色模式。
根据本申请提供的一种音色切换方法,所述确定所述目标语音指令的声纹特征,包括:
对所述目标语音指令进行预加重,确定预加重语音指令;
对所述预加重语音指令进行分帧,确定分帧语音指令;
对所述分帧语音指令进行加窗,获取加窗语音指令;
对所述加窗语音指令进行声纹提取,获取所述目标语音指令的声纹特征。
本申请还提供一种音色切换装置,包括:
接收单元,接收目标语音指令;
获取单元,对所述目标语音指令进行声纹识别,获取声纹识别结果;
确定单元,根据所述声纹识别结果,设置应答音色模式。
本申请还提供一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现如上述任一种所述音色切换方法。
本申请还提供一种非暂态计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现如上述任一种所述音色切换方法。
本申请还提供一种计算机程序产品,包括计算机程序,所述计算机程序被处理器执行时实现如上述任一种所述音色切换方法。
本申请提供的音色切换方法及装置,通过对用户的语音进行分析,利用声纹识别,可辨别不同的用户属性,并自动切换至用户所喜好的应答音色,实现便捷智能的语音切换。
附图说明
为了更清楚地说明本申请或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请提供的音色切换方法的流程示意图之一;
图2是本申请提供的音色切换方法的流程示意图之二;
图3是本申请提供的音色切换装置的结构示意图;
图4是本申请提供的电子设备的结构示意图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合本申请中的附图,对本申请中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实 施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
现有语音网器无法实现一台网器同时满足一家人的不同音色需求,需要用户手动去切换音色,非常麻烦。
下面结合图1至图4描述本申请的实施例所提供的音色切换方法及装置。
本申请实施例提供的音色切换方法,执行主体可以为电子设备或者电子设备中能够实现该音色切换方法的软件或功能模块或功能实体,本申请实施例中电子包括但不限于智能空调设备。需要说明的是,上述执行主体并不构成对本申请的限制。
图1是本申请提供的音色切换方法的流程示意图之一,如图1所示,包括但不限于以下步骤:
首先,在步骤S1中,接收目标语音指令。
接收用户发送的目标语音指令。
发送目标语音指令的用户可以是已录入声纹的注册用户,也可以为未录入声纹的非注册用户。
进一步地,在步骤S2中,对所述目标语音指令进行声纹识别,获取声纹识别结果。
在获取到目标语音指令之后,将该目标语音指令进行预加重、分帧和加窗等预处理,将预处理后的目标语音指令转换为声纹特征图。其中声纹特征图可以为梅尔能量谱图,梅尔能量谱图是在语谱图(对人的发声系统的一种描述)的基础上用梅尔滤波器组(模拟人的耳蜗)处理之后得到,是对人的听觉系统的描述。梅尔能量谱图能表征人能听到的声音的频率分布,是人通过声音辨别事物的深层特征,利用这种在梅尔频域的分布特性,更适合构建说话人识别系统,语音信号经过这样的转换,语音信号就变为了携带声纹信息的图像,对于单个信号,其梅尔能量谱图是黑白的,可以理解为单通道的特征图。
将声纹特征图输入至预先训练好的年龄识别神经网络模型以得到发送目标语音指令的用户的年龄信息。实现了智慧识人,变用户操控空调为空调主动为用户服务,极其的便利。
年龄识别神经网络模型经过大量的样本数据训练,样本数据包括样本用户的年龄信息和样本用户的声纹特征图,因此在输入用户的声纹特征图后就可以输出用户的年龄信息。
根据年龄信息,可以确定目标语音指令对应的用户类别,并将用户类别作为声纹识别结果。用户类别可以包括:儿童、成人和老人。
进一步地,在步骤S3中,根据所述声纹识别结果,设置应答音色模式。
根据声纹识别结果中的用户类别,得出的最适合儿童以及老人的播报语音合成(Text To Speech,TTS)音色。
本申请提供的音色切换方法,通过对用户的语音进行分析,利用声纹识别,可辨别不同的用户属性,并自动切换至用户所喜好的应答音色,实现便捷智能的语音切换。
可选地,所述确定所述目标语音指令的声纹特征,包括:
对所述目标语音指令进行预加重,确定预加重语音指令;
对所述预加重语音指令进行分帧,确定分帧语音指令;
对所述分帧语音指令进行加窗,获取加窗语音指令;
对所述加窗语音指令进行声纹提取,获取所述目标语音指令的声纹特征。
由于语音信号的平均功率谱受声门激励和口鼻辐射的影响,高频端大约在800赫兹(Hz)以上按6分贝/倍频程(dB/oct)衰减,频率越高相应的成分越小,为此要在对语音信号进行分析之前对其高频部分加以提升。可以利用数字滤波器实现目标语音指令的预加重。
以10至20毫秒(ms)为间隔将声纹信号分为若干帧,一帧为一个基本单位,实现对预加重语音指令的分帧。
采用汉明窗函数对分帧语音指令来进行窗化。
经过对目标语音指令的预加重、分帧和加窗,能够消除因为人类发声器官本身和由于采集语音信号的设备所带来的混叠、高次谐波失真、高频等等因素,对语音信号质量的影响。尽可能保证后续语音处理得到的信号更均匀、平滑,为信号参数提取提供优质的参数,提高语音处理质量。
可选地,在所述将所述声纹特征与所有的录入声纹的特征进行比对之 前,还包括:
接收录入声纹指令;
根据所述录入声纹指令,生成录入声纹提示;
在接收到任一用户发送的声纹测试语音的情况下,确定所述任一用户的录入声纹并提取所述录入声纹的特征;
根据所述任一用户的录入声纹的特征,生成录入年龄提示;
根据所述任一用户的录入声纹的特征和录入年龄,确定所述任一用户的注册信息,并生成录入完成提示;
所述录入年龄是所述任一用户响应所述录入年龄提示后输入的。
智能空调在接收到录入声纹的指令之后,切换至声纹录入模式,并发出语音提示提醒用户录入声纹测试语音。
用户重复发音两次以上的声纹测试语音,每次发音后,提取该段纹测试语音的滤波器组的特征(Filter bank,Fbank)特征信息,声纹识别模型将Fbank特征信息转化为该段语音的声纹特征;最后将各次发音得到的声纹特征求平均值作为用户发出的录入声纹的特征;智能空调生成录入年龄提示,在接收到用户发送的录入年龄之后,将录入声纹和录入年龄作为用户的注册信息,并语音播报模块提示该次录入成功。
声纹识别模型是一个深度神经网络模型,由上千小时的中文语料训练而得,具有很强的抗噪性和鲁棒性。
可选地,所述对所述目标语音指令进行声纹识别,获取声纹识别结果,包括:
确定所述目标语音指令的声纹特征;
将所述声纹特征与所有的录入声纹的特征进行比对;
在发送所述目标语音指令的对象为目标注册用户的情况下,在所述目标注册用户的注册信息中确定第一年龄信息;
根据所述第一年龄信息,确定所述目标注册用户的用户类别为所述声纹识别结果。
提取目标语音指令的Fbank特征信息,并输入至声纹识别模型,输出为目标语音指令的声纹特征,将目标语音指令的声纹特征与所有注册用户已储存的录入声纹特征进行相似度计算;若得到的最高相似度高于设置的 声纹阈值,则判定该最高相似度对应的录入声纹特征用户为目标语音指令的发出用户,可以根据该用户的注册信息确定年龄信息,并生成声目标语音指令的纹识别结果;若最高相似度低于设置的声纹阈值,则确定发送所述目标语音指令的对象不为注册用户。
可选地,在所述将所述声纹特征与所有的录入声纹的特征进行比对之后,还包括:
在发送所述目标语音指令的对象不为注册用户的情况下,对所述声纹特征进行年龄分析,确定发送所述目标语音指令的对象的第二年龄信息;
根据所述第二年龄信息,确定发送所述目标语音指令的对象的用户类别为所述声纹识别结果。
注册制的录入声纹,以及对非注册制声纹直接识别年龄属性,均能都支持自动识别用户角色并实现音色的自动切换。
将声纹特征图输入至预先训练好的年龄识别神经网络模型以得到发送目标语音指令的用户的年龄信息,并生成声目标语音指令的纹识别结果。
可选地,所述根据所述声纹识别结果,设置应答音色模式,包括:
在确定所述用户类别为儿童的情况下,将所述应答音色模式设置为儿童音色模式;
在确定所述用户类别为成人的情况下,将所述应答音色模式设置为默认音色模式;
在确定所述用户类别为老人的情况下,将所述应答音色模式设置为老人音色模式。
对于儿童来说,喜欢更活泼可爱的儿童音色;对于老人来说,听力下降,更喜欢语速更慢更清晰,声音更洪亮的老年化音色;儿童音色模式和老人音色模式分别针对儿童和老人特点专属定制的个性化音色。
在答音色模式为儿童音色模式的情况下,以儿童的音色进行语音交互和应答;在答音色模式为默认音色模式的情况下,以保持音色不变;在答音色模式为老人音色模式的情况下,以老人的音色进行语音交互和应答。
图2是本申请提供的音色切换方法的流程示意图之二,如图2所示,包括:
首先,通过语音交互,得到用户发送的目标语音指令;
进一步地,对目标语音指令进行声纹识别,在声纹识别结果显示用户为儿童的情况下,自动切换为儿童音色;在声纹识别结果显示用户为成人的情况下,保持音色不变;声纹识别结果显示用户为老人的情况下,自动切换为老人音色。
下面对本申请提供的音色切换装置进行描述,下文描述的音色切换装置与上文描述的音色切换方法可相互对应参照。
图3是本申请提供的音色切换装置的结构示意图,如图3所示,包括:
接收单元301,接收目标语音指令;
获取单元302,对所述目标语音指令进行声纹识别,获取声纹识别结果;
确定单元303,根据所述声纹识别结果,设置应答音色模式。
首先,接收单元301接收目标语音指令。
接收用户发送的目标语音指令。
发送目标语音指令的用户可以是已录入声纹的注册用户,也可以为未录入声纹的非注册用户。
进一步地,获取单元302对所述目标语音指令进行声纹识别,获取声纹识别结果。
在获取到目标语音指令之后,将该目标语音指令进行预加重、分帧和加窗等预处理,将预处理后的目标语音指令转换为声纹特征图。其中声纹特征图可以为梅尔能量谱图,梅尔能量谱图是在语谱图(对人的发声系统的一种描述)的基础上用梅尔滤波器组(模拟人的耳蜗)处理之后得到,是对人的听觉系统的描述。梅尔能量谱图能表征人能听到的声音的频率分布,是人通过声音辨别事物的深层特征,利用这种在梅尔频域的分布特性,更适合构建说话人识别系统,语音信号经过这样的转换,语音信号就变为了携带声纹信息的图像,对于单个信号,其梅尔能量谱图是黑白的,可以理解为单通道的特征图。
将所述声纹特征图输入至预先训练好的年龄识别神经网络模型以得到所述目标用户的年龄信息。实现了智慧识人,变用户操控空调为空调主动为用户服务,极其的便利。
年龄识别神经网络模型经过大量的样本数据训练,样本数据包括样本 用户的语音信号和样本用户的声纹特征图,因此在输入用户的声纹特征图后就可以输出用户的年龄信息。
根据年龄信息,可以确定目标语音指令对应的用户类别,并将用户类别作为声纹识别结果。用户类别可以包括:儿童、成人和老人。
进一步地,确定单元303根据所述声纹识别结果,设置应答音色模式。
根据声纹识别结果中的用户类别,得出的最适合儿童以及老人的播报语音合成(Text To Speech,TTS)音色。
本申请提供的音色切换装置,通过对用户的语音进行分析,利用声纹识别,可辨别不同的用户属性,并自动切换至用户所喜好的应答音色,实现便捷智能的语音切换。
图4是本申请提供的电子设备的结构示意图,如图4所示,该电子设备可以包括:处理器(processor)410、通信接口(Communications Interface)420、存储器(memory)430和通信总线440,其中,处理器410,通信接口420,存储器430通过通信总线440完成相互间的通信。处理器410可以调用存储器430中的逻辑指令,以执行音色切换方法,该方法包括:接收目标语音指令;对所述目标语音指令进行声纹识别,获取声纹识别结果;根据所述声纹识别结果,设置应答音色模式。
此外,上述的存储器430中的逻辑指令可以通过软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。
另一方面,本申请还提供一种计算机程序产品,所述计算机程序产品包括计算机程序,计算机程序可存储在非暂态计算机可读存储介质上,所述计算机程序被处理器执行时,计算机能够执行上述各方法所提供的音色切换方法,该方法包括:接收目标语音指令;对所述目标语音指令进行声 纹识别,获取声纹识别结果;根据所述声纹识别结果,设置应答音色模式。
又一方面,本申请还提供一种非暂态计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现以执行上述各方法提供的音色切换方法,该方法包括:接收目标语音指令;对所述目标语音指令进行声纹识别,获取声纹识别结果;根据所述声纹识别结果,设置应答音色模式。
以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下,即可以理解并实施。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件。基于这样的理解,上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在计算机可读存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行各个实施例或者实施例的某些部分所述的方法。
最后应说明的是:以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。

Claims (10)

  1. 一种音色切换方法,包括:
    接收目标语音指令;
    对所述目标语音指令进行声纹识别,获取声纹识别结果;
    根据所述声纹识别结果,设置应答音色模式。
  2. 根据权利要求1所述的音色切换方法,其中,所述对所述目标语音指令进行声纹识别,获取声纹识别结果,包括:
    确定所述目标语音指令的声纹特征;
    将所述声纹特征与所有的录入声纹的特征进行比对;
    在发送所述目标语音指令的对象为目标注册用户的情况下,在所述目标注册用户的注册信息中确定第一年龄信息;
    根据所述第一年龄信息,确定所述目标注册用户的用户类别为所述声纹识别结果。
  3. 根据权利要求2所述的音色切换方法,其中,在所述将所述声纹特征与所有的录入声纹的特征进行比对之后,还包括:
    在发送所述目标语音指令的对象不为注册用户的情况下,对所述声纹特征进行年龄分析,确定发送所述目标语音指令的对象的第二年龄信息;
    根据所述第二年龄信息,确定发送所述目标语音指令的对象的用户类别为所述声纹识别结果。
  4. 根据权利要求2所述的音色切换方法,其中,在所述将所述声纹特征与所有的录入声纹的特征进行比对之前,还包括:
    接收录入声纹指令;
    根据所述录入声纹指令,生成录入声纹提示;
    在接收到任一用户发送的声纹测试语音的情况下,确定所述任一用户的录入声纹并提取所述录入声纹的特征;
    根据所述任一用户的录入声纹的特征,生成录入年龄提示;
    根据所述任一用户的录入声纹的特征和录入年龄,确定所述任一用户的注册信息,并生成录入完成提示;
    所述录入年龄是所述任一用户响应所述录入年龄提示后输入的。
  5. 根据权利要求2或3所述的音色切换方法,其中,所述根据所述 声纹识别结果,设置应答音色模式,包括:
    在确定所述用户类别为儿童的情况下,将所述应答音色模式设置为儿童音色模式;
    在确定所述用户类别为成人的情况下,将所述应答音色模式设置为默认音色模式;
    在确定所述用户类别为老人的情况下,将所述应答音色模式设置为老人音色模式。
  6. 根据权利要求2所述的音色切换方法,其中,所述确定所述目标语音指令的声纹特征,包括:
    对所述目标语音指令进行预加重,确定预加重语音指令;
    对所述预加重语音指令进行分帧,确定分帧语音指令;
    对所述分帧语音指令进行加窗,获取加窗语音指令;
    对所述加窗语音指令进行声纹提取,获取所述目标语音指令的声纹特征。
  7. 一种音色切换装置,包括:
    接收单元,接收目标语音指令;
    获取单元,对所述目标语音指令进行声纹识别,获取声纹识别结果;
    确定单元,根据所述声纹识别结果,设置应答音色模式。
  8. 一种电子设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,其中,所述处理器执行所述程序时实现如权利要求1至6任一项所述音色切换方法。
  9. 一种非暂态计算机可读存储介质,其上存储有计算机程序,其中,所述计算机程序被处理器执行时实现如权利要求1至6任一项所述音色切换方法。
  10. 一种计算机程序产品,包括计算机程序,其中,所述计算机程序被处理器执行时实现如权利要求1至6任一项所述音色切换方法。
PCT/CN2022/132585 2022-03-29 2022-11-17 一种音色切换方法及装置 WO2023185004A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210322472.3A CN114708875A (zh) 2022-03-29 2022-03-29 一种音色切换方法及装置
CN202210322472.3 2022-03-29

Publications (1)

Publication Number Publication Date
WO2023185004A1 true WO2023185004A1 (zh) 2023-10-05

Family

ID=82170565

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/132585 WO2023185004A1 (zh) 2022-03-29 2022-11-17 一种音色切换方法及装置

Country Status (2)

Country Link
CN (1) CN114708875A (zh)
WO (1) WO2023185004A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114708875A (zh) * 2022-03-29 2022-07-05 青岛海尔空调器有限总公司 一种音色切换方法及装置

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014024751A1 (ja) * 2012-08-10 2014-02-13 エイディシーテクノロジー株式会社 音声応答装置
CN109272984A (zh) * 2018-10-17 2019-01-25 百度在线网络技术(北京)有限公司 用于语音交互的方法和装置
CN110336723A (zh) * 2019-07-23 2019-10-15 珠海格力电器股份有限公司 智能家电的控制方法及装置、智能家电设备
CN111599367A (zh) * 2020-05-18 2020-08-28 珠海格力电器股份有限公司 一种智能家居设备的控制方法、装置、设备及介质
CN112185344A (zh) * 2020-09-27 2021-01-05 北京捷通华声科技股份有限公司 语音交互方法、装置、计算机可读存储介质和处理器
CN114141247A (zh) * 2021-11-18 2022-03-04 青岛海尔科技有限公司 设备的控制方法、装置、存储介质及电子装置
CN114708875A (zh) * 2022-03-29 2022-07-05 青岛海尔空调器有限总公司 一种音色切换方法及装置

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014024751A1 (ja) * 2012-08-10 2014-02-13 エイディシーテクノロジー株式会社 音声応答装置
CN109272984A (zh) * 2018-10-17 2019-01-25 百度在线网络技术(北京)有限公司 用于语音交互的方法和装置
CN110336723A (zh) * 2019-07-23 2019-10-15 珠海格力电器股份有限公司 智能家电的控制方法及装置、智能家电设备
CN111599367A (zh) * 2020-05-18 2020-08-28 珠海格力电器股份有限公司 一种智能家居设备的控制方法、装置、设备及介质
CN112185344A (zh) * 2020-09-27 2021-01-05 北京捷通华声科技股份有限公司 语音交互方法、装置、计算机可读存储介质和处理器
CN114141247A (zh) * 2021-11-18 2022-03-04 青岛海尔科技有限公司 设备的控制方法、装置、存储介质及电子装置
CN114708875A (zh) * 2022-03-29 2022-07-05 青岛海尔空调器有限总公司 一种音色切换方法及装置

Also Published As

Publication number Publication date
CN114708875A (zh) 2022-07-05

Similar Documents

Publication Publication Date Title
CN108564942B (zh) 一种基于敏感度可调的语音情感识别方法及系统
CN109326302B (zh) 一种基于声纹比对和生成对抗网络的语音增强方法
US7962342B1 (en) Dynamic user interface for the temporarily impaired based on automatic analysis for speech patterns
WO2020006935A1 (zh) 动物声纹特征提取方法、装置及计算机可读存储介质
Vlaming et al. HearCom: Hearing in the communication society
JP2019212288A (ja) 情報を出力するための方法、及び装置
CN105405439A (zh) 语音播放方法及装置
Lai et al. Multi-objective learning based speech enhancement method to increase speech quality and intelligibility for hearing aid device users
CN107112026A (zh) 用于智能语音识别和处理的系统、方法和装置
CN107945790A (zh) 一种情感识别方法和情感识别系统
WO2022121155A1 (zh) 基于元学习的自适应语音识别方法、装置、设备及介质
CN110070865A (zh) 一种具有语音和图像识别功能的向导机器人
JP4050350B2 (ja) 音声認識をする方法とシステム
WO2023185004A1 (zh) 一种音色切换方法及装置
EP1280137B1 (en) Method for speaker identification
US11699043B2 (en) Determination of transcription accuracy
Gustafson et al. Voice transformations for improving children's speech recognition in a publicly available dialogue system
WO2023185005A1 (zh) 一种工作模式切换方法及装置
CN109599094A (zh) 声音美容与情感修饰的方法
JP2009178783A (ja) コミュニケーションロボット及びその制御方法
Hansen et al. A speech perturbation strategy based on “Lombard effect” for enhanced intelligibility for cochlear implant listeners
CN111460094A (zh) 一种基于tts的音频拼接优化的方法及其装置
CN109754816B (zh) 一种语音数据处理的方法及装置
WO2023185007A1 (zh) 一种睡眠场景设置方法及装置
WO2023185006A1 (zh) 一种工作模式设置方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22934822

Country of ref document: EP

Kind code of ref document: A1