WO2021082572A1 - 一种唤醒模型生成方法、智能终端唤醒方法及装置 - Google Patents

一种唤醒模型生成方法、智能终端唤醒方法及装置 Download PDF

Info

Publication number
WO2021082572A1
WO2021082572A1 PCT/CN2020/105998 CN2020105998W WO2021082572A1 WO 2021082572 A1 WO2021082572 A1 WO 2021082572A1 CN 2020105998 W CN2020105998 W CN 2020105998W WO 2021082572 A1 WO2021082572 A1 WO 2021082572A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
wake
word
frame
label
Prior art date
Application number
PCT/CN2020/105998
Other languages
English (en)
French (fr)
Inventor
白二伟
倪合强
宋志�
姚寿柏
Original Assignee
苏宁云计算有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏宁云计算有限公司 filed Critical 苏宁云计算有限公司
Priority to CA3158930A priority Critical patent/CA3158930A1/en
Publication of WO2021082572A1 publication Critical patent/WO2021082572A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/225Feedback of the input speech
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Definitions

  • the present invention relates to the technical field of data security, in particular to a method for generating a wake-up model, a method and device for waking up an intelligent terminal.
  • voice wake-up has a wide range of applications, such as robots, mobile phones, wearable devices, smart homes, and vehicles.
  • Different smart terminals have different wake-up words.
  • the smart terminal can be switched from the standby state to the working state. Only when the state switching is completed quickly and accurately, the user can use it almost without perception. Other functions of the smart terminal, therefore, improving the wake-up effect is very important.
  • awakening smart terminals mainly adopts a neural network-based wake-up technology.
  • the data preparation stage it is necessary to manually intercept the positive sample data to a fixed time length t, and the time length of recording wake-up words cannot exceed the time length t, which will greatly increase labor costs, and it is not possible to wake up voices with a slower speaking rate.
  • the neural network needs to process the time in the terminal memory every time Audio with a length of t, so that there will be a large amount of repeated data to be processed between two adjacent time lengths t, thereby increasing the calculation time and power consumption of the terminal.
  • the present invention aims to solve at least one of the technical problems existing in the prior art or related technologies. To this end, the present invention provides a wake-up model generation method, a smart terminal wake-up method and a device.
  • a method for generating a wake-up model includes:
  • the cyclic neural network is trained using the multiple audio training samples to generate an arousal model.
  • the labeling the start and end times of each wake-up word included in the wake-up word audio in the sample audio set to obtain the labeled wake-up word audio includes:
  • the start and end times of each wake-up word are respectively marked to obtain the marked audio.
  • the using negative sample audio containing background noise to add noise to the labeled wake word audio to obtain positive sample audio includes:
  • the amplitude mean value of the negative sample audio segment is adjusted, and the labeled audio is mixed and noised using the adjusted negative sample audio segment to obtain the positive sample audio.
  • the frame label includes a positive label, a negative label and an intermediate label
  • the labeling of the frame label of the positive sample audio and the negative sample audio to obtain a plurality of audio training samples includes:
  • the audio frame is marked as a positive label, otherwise, the audio frame is marked as a negative label;
  • the audio frame is marked as a negative label.
  • a smart terminal wake-up method includes:
  • the smart terminal obtains the real-time audio at the current moment
  • the wake-up model is generated using the wake-up model generation method described in the first aspect.
  • a device for generating a wake-up model includes:
  • the first labeling module is used to label the start and end times of each wake-up word included in the wake-up word audio in the sample audio set to obtain the labeled wake-up word audio, wherein the time length of the wake-up word audio is not fixed;
  • a noise adding processing module configured to use negative sample audio containing background noise to add noise to the labeled wake-up word audio to obtain positive sample audio;
  • a feature extraction module configured to extract multiple audio frame features from the positive sample audio and the negative sample audio respectively
  • the second labeling module is configured to label the positive sample audio and the negative sample audio with frame labels to obtain multiple audio training samples
  • the model generation module is used to train the cyclic neural network using the multiple audio training samples to generate an arousal model.
  • the first labeling module is specifically configured to:
  • the start and end times of each wake-up word are respectively marked to obtain the marked audio.
  • noise adding processing module is specifically configured to:
  • the amplitude mean value of the negative sample audio segment is adjusted, and the labeled audio is mixed and noised using the adjusted negative sample audio segment to obtain the positive sample audio.
  • the frame label includes a positive label, a negative label, and an intermediate label
  • the second labeling module is specifically configured to:
  • the audio frame is marked as a positive label, otherwise, the audio frame is marked as a negative label;
  • the audio frame is marked as a negative label.
  • a device for waking up an intelligent terminal includes:
  • the audio acquisition module is used for the smart terminal to acquire the real-time audio at the current moment
  • a feature extraction module for extracting multiple audio frame features from the real-time audio
  • the model recognition module is used to sequentially input the extracted multiple audio frame features into the pre-deployed wake-up model, and calculate in combination with the state saved at the previous time of the wake-up model to obtain whether the real-time audio contains a wake-up The awakening result of the word;
  • the wake-up model is generated using the wake-up model generation method described in the first aspect.
  • the RNN can be trained uninterruptedly, thereby improving the recognition accuracy of the wake-up word, which is conducive to improving the wake-up effect of the smart terminal;
  • FIG. 1 shows a schematic flowchart of a method for generating a wake-up model according to an embodiment of the present invention
  • FIG. 2 shows a schematic diagram of marking the start and end times of a wake-up word provided by an embodiment of the present invention
  • FIG. 3 shows a schematic diagram of MFCC feature vector acquisition provided by an embodiment of the present invention
  • FIG. 4 shows a schematic diagram of labeling frame labels provided by an embodiment of the present invention
  • FIG. 5 shows a schematic flowchart of a method for waking up a smart terminal according to an embodiment of the present invention
  • FIG. 7 shows a schematic structural diagram of an apparatus for generating a wake-up model according to an embodiment of the present invention
  • FIG. 8 shows a schematic structural diagram of a device for waking up an intelligent terminal according to an embodiment of the present invention.
  • the embodiment of the present invention provides a method for generating a wake-up model.
  • the method can be applied to a server. As shown in FIG. 1, the method may include the steps:
  • the sample audio set contains multiple wake-up word audios, and each wake-up word audio includes at least one wake-up word.
  • a certain time interval must be kept between adjacent wake-up words. The content is the same, such as " ⁇ biu ⁇ biu".
  • the time length of each wake-up word audio is approximately a few seconds to several minutes, and the time length of the wake-up word is approximately 1 second.
  • identifying at least one key audio segment in the wake-up word audio that only includes the wake-up word and labeling the start and end time of each wake-up word according to the respective start and end times of each key audio segment, to obtain the labeled audio.
  • the start and end time of each wake-up word in the wake-up word audio can be marked manually on the server to obtain the labeled wake-up word audio.
  • the start and end time includes the start time and the end time.
  • the start time and end time of the wake word are marked.
  • start N and end N can be used as the start time and end time of the Nth wake word respectively, as shown in Figure 2.
  • 2 shows a schematic diagram of marking the start and end times of a wake-up word provided by an embodiment of the present invention, in which the black part represents the wake-up word.
  • background noise in different scenes can be pre-recorded to obtain negative sample audio, where different scenes can be various scenes, for example, scenes when playing TV, scenes when cooking, or other scenes.
  • the negative sample audio segment with the same duration as the labeled wake-up word audio is intercepted from the negative sample audio, the amplitude average of the negative sample audio segment is adjusted, and the adjusted negative sample audio segment is used to mix and add the labeled audio. Noise, get positive sample audio.
  • the mean amplitude of the negative sample audio segment can be adjusted to be equal to the mean amplitude of the labeled audio, and then the mean amplitude of the negative sample audio segment can be reduced to a preset percentage of the mean amplitude, where the preset percentage It can be between 5% and 10%.
  • N negative sample audio may be used to add noise to each of the M wake word audios to obtain N*M positive sample audios.
  • the process may include:
  • the audio frame feature Specifically, it can be the feature of Mel frequency cepstrum coefficient.
  • the feature spectrum is Mel frequency Cepstrum Coefficient (Mel Frequency Cepstrum Coefficient, MFCC) spectrogram.
  • MFCC Mel frequency Cepstrum Coefficient
  • Fig. 3 shows a schematic diagram of MFCC feature vector acquisition provided by an embodiment of the present invention.
  • the mel frequency cepstral coefficient can be calculated with the preset window width W, the moving step S, and the mel frequency cepstral coefficient C Mel .
  • the mel frequency cepstral coefficient C Mel can be calculated with the preset window width W, the moving step S, and the mel frequency cepstral coefficient C Mel .
  • the positive sample audio and the negative sample audio are labeled with frame labels, where the frame label includes a positive label, a negative label, and an intermediate label.
  • the process may include:
  • the positive label, the negative label, and the middle label may be represented as "Positive”, “Negative”, “Middle” or "1", “-1", “0”, respectively.
  • Fig. 4 shows a schematic diagram of labeling a frame label provided by an embodiment of the present invention.
  • the start time of the window is denoted as t and the width of the window is w
  • the audio The frame is marked as "Negative", that is: (end N-1 ⁇ t)&&(t+w ⁇ start N ); if part or all of the audio frame falls within the start and end time period of any wake-up word, the The audio frame is marked as "Middle", that is: (start N ⁇ t+w)&&(t ⁇ end N ); if the previous audio frame of the audio frame falls within the start and end time period of any wake-up word, and the audio The frame does not contain the end time of the wake-up word for the first time, that is: (end N ⁇ t)&&(t-1 ⁇ end N ), then the audio frame is marked as "Po
  • each audio frame of the negative sample audio is marked as "Negative".
  • the frame feature of the audio frame is used as the input data at time t of the input layer of the cyclic neural network, and the frame label of the audio frame is used as the output layer of the cyclic neural network.
  • the output result at time t combined with the state value St-1 of the hidden layer of the recurrent neural network at time t , calculate the state value S t of the hidden layer of the recurrent neural network at time t, and then calculate the value of the recurrent neural network.
  • the state value of each moment of the hidden layer generates a wake-up model.
  • the wake-up model can be deployed on the smart terminal, so as to use the wake-up model to perform wake-up processing on the smart terminal.
  • the embodiment of the present invention provides a method for generating a wake-up module. Since the time length of the wake-up word audio is not fixed, the wake-up word audio is used as variable-length input data to train the cyclic neural network RNN, thereby avoiding manual data interception and saving manpower Cost, and it can also recognize slow-speaking data; at the same time, because the sample audio set can contain long audio, it can train RNN uninterruptedly, thereby improving the recognition accuracy of wake-up words, which is beneficial to improve the wake-up effect of smart terminals .
  • the embodiment of the present invention provides a smart terminal wake-up method, which can be applied to a smart terminal, and the smart terminal is pre-deployed with a wake-up model generated based on the wake-up model generation method in the first embodiment, as shown in FIG. 5 ,
  • the method may include the steps:
  • the smart terminal obtains real-time audio at the current moment.
  • the smart terminal can use a microphone to collect real-time audio at the current moment in the scene.
  • smart terminals include, but are not limited to, robots, smart phones, wearable devices, smart homes, and vehicle-mounted terminals.
  • the Mel frequency cepstral coefficient characteristics are extracted from each audio frame of the real-time audio to obtain multiple audio frame characteristics. .
  • the method provided in the embodiment of the present invention may further include:
  • Preprocessing the real-time audio at the current moment where the preprocessing includes but is not limited to echo cancellation and noise reduction processing.
  • each audio frame feature is sequentially input into the wake-up model, combined with the state saved at the previous time of the wake-up model for calculation, and according to the output result of the wake-up model, Obtain the frame labels corresponding to the multiple audio frames of the real-time audio at the current moment and the current state of the wake-up model, save the current state of the wake-up model, and obtain whether the real-time audio contains the corresponding frame labels according to the corresponding frame labels of the multiple audio frames
  • the wake-up result of the wake-up word where, when the frame labels corresponding to the multiple audio frames respectively contain a positive label, it is determined that the real-time audio contains the wake-up word.
  • the memory of the smart terminal can only store N frames of data at a time
  • the capacity of the terminal memory is limited.
  • the neural network needs to process the audio of the time length t in the terminal memory every time, In this way, there will be a large amount of repeated data that needs to be processed between two adjacent time lengths t, which increases the calculation time and power consumption of the terminal.
  • the present invention uses the variable-length input RNN wake-up model to determine whether the real-time audio contains wake-up words without recalculating old data, thereby reducing the amount of calculation, speeding up processing, and reducing power consumption.
  • an embodiment of the present invention provides a wake-up model generation device. As shown in FIG. 7, the device includes:
  • the first labeling module 71 is configured to label the start and end times of each wake-up word included in the wake-up word audio in the sample audio set to obtain the labeled wake-up word audio, wherein the time length of the wake-up word audio is not fixed;
  • the noise adding processing module 72 is configured to use negative sample audio containing background noise to add noise to the labeled wake-up word audio to obtain positive sample audio;
  • the feature extraction module 73 is used to extract multiple audio frame features from the positive sample audio and the negative sample audio respectively;
  • the second labeling module 74 is configured to label the positive sample audio and the negative sample audio with frame labels to obtain multiple audio training samples;
  • the model generation module 75 is used to train the cyclic neural network using multiple audio training samples to generate an arousal model.
  • the first labeling module 71 is specifically configured to:
  • the start and end times of each wake-up word are marked to obtain the marked audio.
  • noise adding processing module 72 is specifically used for:
  • Adjust the amplitude mean value of the negative sample audio segment and use the adjusted negative sample audio segment to mix and add noise to the labeled audio to obtain the positive sample audio.
  • the frame label includes a positive label, a negative label, and an intermediate label
  • the second labeling module 74 is specifically configured to:
  • the audio frame is marked as a negative label.
  • the wake-up model generation device provided by the embodiment of the present invention belongs to the same inventive concept as the wake-up model generation method provided in the first embodiment of the present invention.
  • the wake-up model generation method provided in any embodiment of the present invention can be implemented, and it is capable of executing the wake-up model generation method.
  • Corresponding functional modules and beneficial effects for technical details that are not described in detail in the embodiment of the present invention, please refer to the wake-up model generation method provided in the embodiment of the present invention, which will not be repeated here.
  • an embodiment of the present invention provides a smart terminal wake-up device. As shown in FIG. 8, the device includes:
  • the audio acquisition module 81 is used for the smart terminal to acquire the real-time audio at the current moment;
  • the feature extraction module 82 is used to extract multiple audio frame features from real-time audio
  • the model recognition module 83 is used to sequentially input the extracted multiple audio frame features into the pre-deployed wake-up model, and calculate with the state saved at the previous time of the wake-up model to obtain whether the real-time audio contains a wake-up word result;
  • the wake-up model is generated by using the wake-up model generation method in the first embodiment.
  • the device may further include:
  • the preprocessing module is used to preprocess the real-time audio at the current moment, where the preprocessing includes but is not limited to echo cancellation and noise reduction processing.
  • the feature extraction module 82 is also used to extract multiple audio frame features from the preprocessed real-time audio.
  • the smart terminal wake-up device provided in the embodiment of the present invention belongs to the same inventive concept as the smart terminal wake-up method provided in the second embodiment of the present invention. It can execute the smart terminal wake-up method provided in any embodiment of the present invention and has a method for executing service request processing. Corresponding functional modules and beneficial effects. For technical details that are not described in detail in the embodiment of the present invention, refer to the smart terminal wake-up method provided in the embodiment of the present invention, which will not be repeated here.
  • another embodiment of the present invention also provides a computer device, including:
  • One or more processors are One or more processors;
  • the program stored in the memory When the program stored in the memory is executed by one or more processors, the program causes the processor to execute the steps of the wake-up model generation method described in the foregoing embodiment.
  • another embodiment of the present invention also provides a computer device, including:
  • One or more processors are One or more processors;
  • the program stored in the memory When the program stored in the memory is executed by one or more processors, the program causes the processor to execute the steps of the smart terminal wake-up method described in the foregoing embodiment.
  • another embodiment of the present invention also provides a computer-readable storage medium.
  • the computer-readable storage medium stores a program.
  • the processor executes the wake-up model generation described in the above-mentioned embodiment. Method steps.
  • another embodiment of the present invention also provides a computer-readable storage medium.
  • the computer-readable storage medium stores a program.
  • the processor is caused to execute the smart terminal wake-up as described in the above-mentioned embodiment. Method steps.
  • the embodiments in the embodiments of the present invention may be provided as methods, devices, or computer program products. Therefore, the embodiments of the present invention may adopt a form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, the embodiments of the present invention may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes. .
  • computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions can be provided to the processor of a general-purpose computer, a special-purpose computer, an embedded processor, or other programmable data processing equipment to generate a machine, so that the instructions executed by the processor of the computer or other programmable data processing equipment are generated It is a device that realizes the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
  • These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device.
  • the device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
  • These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment. Instructions provide steps for implementing functions specified in a liuc process or multiple processes in the flowchart and/or a block or multiple blocks in the block diagram.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Electrically Operated Instructional Devices (AREA)
  • Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)
  • Telephone Function (AREA)

Abstract

一种唤醒模型生成方法、智能终端唤醒方法及装置,属于语音唤醒技术领域,唤醒模型生成方法包括:对样本音频集合中的唤醒词音频所包含的每个唤醒词的起止时间进行标注,得到标注后的唤醒词音频,其中,唤醒词音频的时间长度不固定(101);利用包含背景噪声的负样本音频对标注后的唤醒词音频进行加噪,得到正样本音频(102);从正样本音频和负样本音频中分别提取多个音频帧特征,并对正样本音频和负样本音频进行帧标签的标注,获得多个音频训练样本(103);使用多个音频训练样本对循环神经网络进行训练,生成唤醒模型(104)。实施例通过使用变长输入的循环神经网络进行模型训练,能够避免人工截取样本的操作,有利于提高智能终端的唤醒效果。

Description

一种唤醒模型生成方法、智能终端唤醒方法及装置 技术领域
本发明涉及数据安全技术领域,特别涉及一种唤醒模型生成方法、智能终端唤醒方法及装置。
背景技术
目前,语音唤醒的应用领域比较广泛,例如机器人、手机、可穿戴设备、智能家居、车载等。不同的智能终端会有不同的唤醒词,当用户说出特定的唤醒词,能够使智能终端从待机状态切换到工作状态,只有快速、精准地完成状态的切换,用户才能近乎无感知地直接使用智能终端的其他功能,因此,提高唤醒效果至关重要。
现有技术中,对智能终端进行唤醒主要采用基于神经网络的唤醒技术。在数据准备阶段,需要人工将正样本数据统一截取到固定时间长度t,且录制唤醒词的时长不能超过该时间长度t,这样会极大地增加人力成本,并且对语速较慢的唤醒语音无法识别;另外,由于唤醒词的时间可能较短,导致对神经网络的训练不足,最终对智能终端唤醒效果造成影响;此外,在终端唤醒阶段,由于神经网络每次都需要处理终端内存中的时间长度t的音频,这样相邻两个时间长度t之间就会有大量的重复数据需要处理,从而增加了终端的计算时间和功耗。
发明内容
本发明旨在至少解决现有技术或相关技术中存在的技术问题之一,为此本发明提供一种唤醒模型生成方法、智能终端唤醒方法及装置。
本发明实施例提供的具体技术方案如下:
第一方面,提供了一种唤醒模型生成方法,所述方法包括:
对样本音频集合中的唤醒词音频所包含的每个唤醒词的起止时间进行标注,得到标注后的唤醒词音频,其中,所述唤醒词音频的时间长度不固定;
利用包含背景噪声的负样本音频对标注后的所述唤醒词音频进行加噪,得到正样本音频;
从所述正样本音频和所述负样本音频中分别提取多个音频帧特征,并对所述正样本音频和所述负样本音频进行帧标签的标注,获得多个音频训练样本;
使用所述多个音频训练样本对循环神经网络进行训练,生成唤醒模型。
进一步地,所述对样本音频集合中的唤醒词音频所包含的每个唤醒词的起止时间进行标注,得到标注后的唤醒词音频,包括:
识别所述唤醒词音频中的仅包含所述唤醒词的至少一个关键音频段;
根据各个所述关键音频段各自的起止时间,分别标注每个所述唤醒词的起止时间,得到所述标注音频。
进一步地,所述利用包含背景噪声的负样本音频对标注后的所述唤醒词音频进行加噪,得到正样本音频,包括:
从所述负样本音频中截取与标注后的所述唤醒词音频的时长相同的负样本音频段;
对所述负样本音频段的振幅均值进行调整,利用调整后的所述负样本音频段对所述标注音频进行混合加噪,得到所述正样本音频。
进一步地,所述帧标签包括正标签、负标签和中间标签,所述对所述正样本音频和所述负样本音频进行帧标签的标注,获得多个音频训练样本,包括:
针对所述正样本音频的每个音频帧,判断所述音频帧的部分或全部是否落入任一唤醒词的起止时间段内,若判断为是,则将所述音频帧标记为中间标签;
若判断为否,则判断所述音频帧的前一个音频帧是否落入任一所述唤醒词的起止时间段内,且所述音频帧首次不包含唤醒词的结束时间,若是,则将所述音频帧标记为正标签,否则,则将所述音频帧标记为负标签;
针对所述负样本音频的每个音频帧,将所述音频帧标记为负标签。
第二方面,提供了一种智能终端唤醒方法,所述方法包括:
智能终端获取当前时刻的实时音频;
从所述实时音频中提取多个音频帧特征;
将提取到的多个音频帧特征依次输入到预先部署的唤醒模型中,并结合所述唤醒模型前一个时刻保存的状态进行计算,以获得所述实时音频中是否包含唤醒词的唤醒结果;
其中,所述唤醒模型为利用第一方面所述的唤醒模型生成方法生成的。
第三方面,提供了一种唤醒模型生成装置,所述装置包括:
第一标注模块,用于对样本音频集合中的唤醒词音频所包含的每个唤醒词的起止时间进行标注,得到标注后的唤醒词音频,其中,所述唤醒词音频的时间长度不固定;
加噪处理模块,用于利用包含背景噪声的负样本音频对标注后的所述唤醒词音频进行加噪,得到正样本音频;
特征提取模块,用于从所述正样本音频和所述负样本音频中分别提取多个音频帧特征;
第二标注模块,用于对所述正样本音频和所述负样本音频进行帧标签的标注,获得多个音频训练样本;
模型生成模块,用于使用所述多个音频训练样本对循环神经网络进行训练,生成唤醒模型。
进一步地,所述第一标注模块具体用于:
识别所述唤醒词音频中的仅包含所述唤醒词的至少一个关键音频段;
根据各个所述关键音频段各自的起止时间,分别标注每个所述唤醒词的起止时间,得到所述标注音频。
进一步地,所述加噪处理模块具体用于:
从所述负样本音频中截取与标注后的所述唤醒词音频的时长相同的负样本音频段;
对所述负样本音频段的振幅均值进行调整,利用调整后的所述负样本音频段对所述标注音频进行混合加噪,得到所述正样本音频。
进一步地,所述帧标签包括正标签、负标签和中间标签,所述第二标注模块具体用于:
针对所述正样本音频的每个音频帧,判断所述音频帧的部分或全部是否落入任一唤醒词的起止时间段内,若判断为是,则将所述音频帧标记为中间标签;
若判断为否,则判断所述音频帧的前一个音频帧是否落入任一所述唤醒词的起止时间段内,且所述音频帧首次不包含唤醒词的结束时间,若是,则将所述音频帧标记为正标签,否则,则将所述音频帧标记为负标签;
针对所述负样本音频的每个音频帧,将所述音频帧标记为负标签。
第四方面,提供了一种智能终端唤醒装置,所述装置包括:
音频获取模块,用于智能终端获取当前时刻的实时音频;
特征提取模块,用于从所述实时音频中提取多个音频帧特征;
模型识别模块,用于将提取到的多个音频帧特征依次输入到预先部署的唤醒模型中,并结合所述唤醒模型前一个时刻保存的状态进行计算,以获得所述实时音频中是否包含唤醒词的唤醒结果;
其中,所述唤醒模型为利用第一方面所述的唤醒模型生成方法生成的。
本发明实施例提供的技术方案带来的有益效果是:
1、由于唤醒词音频的时间长度不固定,利用唤醒词音频作为变长输入数据进行循环神经网络RNN的训练,从而避免了手工截取数据,减少人工处理数据流程,节约了人力成本,且对语速较慢的唤醒语音也能够识别;
2、由于样本音频集合中可以包含长音频,能够实现不间断地训练RNN,从而提高唤醒词的识别精度,有利于提高智能终端的唤醒效果;
3、在终端唤醒过程中,对于新加入终端内存的每一帧音频,无需重复计算旧数据,减小了终端的计算时间和功耗。
附图说明
为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1示出了本发明实施例提供的一种唤醒模型生成方法的流程示意图;
图2示出了本发明实施例提供的唤醒词的起止时间标注示意图;
图3示出了本发明实施例提供的MFCC特征向量获取示意图;
图4示出了本发明实施例提供的帧标签的标注示意图;
图5示出了本发明实施例提供的一种智能终端唤醒方法的流程示意图;
图6a示出了本发明实施例提供的t=1时终端内存中的唤醒过程示意图;
图6b示出了本发明实施例提供的t=M时终端内存中的唤醒过程示意图;
图7示出了本发明实施例提供的一种唤醒模型生成装置的结构示意图;
图8示出了本发明实施例提供的一种智能终端唤醒装置的结构示意图。
具体实施方式
为使本发明的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
在本申请的描述中,需要理解的是,术语“第一”、“第二”等仅用于描述目的,而不能理解为指示或暗示相对重要性。此外,在本申请的描述中,除非另有说明,“多个”的含义是两个或两个以上。
实施例一
本发明实施例提供了一种唤醒模型生成方法,该方法可以应用于服务器中, 如图1所示,该方法可以包括步骤:
101,对样本音频集合中的唤醒词音频所包含的每个唤醒词的起止时间进行标注,得到标注后的唤醒词音频,其中,唤醒词音频的时间长度不固定。
其中,样本音频集合中包含多个唤醒词音频,各个唤醒词音频均包含至少一个唤醒词。具体实现时,可以在安静的环境下,录制包含唤醒词的多个唤醒词音频,其中,在录制一个唤醒词音频时,相邻的唤醒词之间需保留一定的时间间隔,各个唤醒词的内容均相同,例如“小biu小biu”。在本实施例中,每个唤醒词音频的时间长度大致为几秒到几分钟,唤醒词的时间长度大致为1秒左右。
具体地,识别唤醒词音频中的仅包含唤醒词的至少一个关键音频段,根据各个关键音频段各自的起止时间,分别标注每个唤醒词的起止时间,得到标注音频。在具体实施时,可以通过人工的方式在服务器上为唤醒词音频中的每个唤醒词进行标注起止时间,得到标注后的唤醒词音频。
其中,起止时间包括开始时间和结束时间,对唤醒词进行开始时间结束时刻的标注,例如,可以start N和end N分别作为第N个唤醒词的开始时间和结束时间如图2所示,图2示出了本发明实施例提供的唤醒词的起止时间标注示意图,其中,黑色部分表示为唤醒词。
102,利用包含背景噪声的负样本音频对标注后的唤醒词音频进行加噪,得到正样本音频。
其中,可以预先录制不同场景下的背景噪声,得到负样本音频,这里不同场景下可以是各种场景,例如,播放电视时的场景、做饭时的场景或其他场景等。
具体地,从负样本音频中截取与标注后的唤醒词音频的时长相同的负样本音频段,对负样本音频段的振幅均值进行调整,利用调整后的负样本音频段对标注音频进行混合加噪,得到正样本音频。
在具体实施时,可以先将负样本音频段的振幅均值调整为等于标注音频的振幅均值,然后再将负样本音频段的振幅均值降低到该振幅均值的预设百分比,其中,该预设百分比可以介于5%至10%之间。
本实施例中,为扩增正样本音频数据集,可以使用N个负样本音频对M个唤醒词音频中的每一个唤醒词音频进行加噪,得到N*M个正样本音频。
103,从正样本音频和负样本音频中分别提取多个音频帧特征,并对正样本音频和负样本音频进行帧标签的标注,获得多个音频训练样本。
具体地,从正样本音频和负样本音频中分别提取多个音频帧特征,该过程可以包括:
从正样本音频的每个音频帧中以及负样本音频的每个音频帧中分别提取多个音频帧特征,生成正样本音频的特征频谱图和负样本音频的特征频谱图,其中,音频帧特征具体可以为梅尔频率倒谱系数特征,特征频谱图为梅尔倒频谱图,也即梅尔频率倒谱系数(Mel Frequency Cepstrum Coefficient,MFCC)的谱图,梅尔倒频谱中的每个特征向量表示每个音频帧的MFCC特征向量。
图3示出了本发明实施例提供的MFCC特征向量获取示意图。如图3所示,针对每一个正样本音频以及每一个负样本音频,可以分别以预设的窗口宽度W、移动步长S以及梅尔频率倒谱系数C Mel,计算梅尔频率倒谱系数特征,生成梅尔倒频谱图。
具体地,对正样本音频和负样本音频进行帧标签的标注,其中,帧标签包括正标签、负标签和中间标签,该过程可以包括:
针对正样本音频的每个音频帧,判断音频帧的部分或全部是否落入任一唤醒词的起止时间段内,若判断为是,则将音频帧标记为中间标签;若判断为否,则判断音频帧的前一个音频帧是否落入任一唤醒词的起止时间段内,且音频帧首次不包含唤醒词的结束时间,若是,则将音频帧标记为正标签,否则,则将音频帧标记为负标签;针对负样本音频的每个音频帧,将音频帧标记为负标签。
本实施例中,正标签、负标签和中间标签可以分别表示为“Positive”、“Negative”、“Middle”或者“1”、“-1”、“0”。
图4示出了本发明实施例提供的帧标签的标注示意图。如图4所示,假设窗口的开始时间记为t,窗口宽度为w,对于正样本音频的每个音频帧,若该音 频帧落入任一唤醒词的起止时间段外,则将该音频帧标记为“Negative”,即:(end N-1<t)&&(t+w<start N);若该音频帧的部分或全部落入任一唤醒词的起止时间段内,则将该音频帧标记为“Middle”,即:(start N<t+w)&&(t<end N);若该音频帧的前一个音频帧落入任一唤醒词的起止时间段内,且该音频帧首次不包含唤醒词的结束时间,即:(end N≤t)&&(t-1<end N),则将该音频帧标记为“Positive”。
可以理解的是,将负样本音频的每个音频帧均标记为“Negative”。
104,使用多个音频训练样本对循环神经网络进行训练,生成唤醒模型。
具体地,针对每一个音频训练样本的第N个音频帧,将该音频帧的帧特征作为循环神经网络的输入层t时刻的输入数据,将该音频帧的帧标签作为循环神经网络的输出层t时刻的输出结果,并结合循环神经网络的隐层t时刻的上一时刻的状态值S t-1,计算循环神经网络的隐层t时刻的状态值S t,依次计算得到循环神经网络的隐层的各个时刻的状态值,生成唤醒模型。
需要说明的是,本发明实施例在生成唤醒模型后,可以将该唤醒模型部署到智能终端上,以便利用该唤醒模型对智能终端进行唤醒处理。
本发明实施例提供了一种唤醒模块生成方法,由于唤醒词音频的时间长度不固定,利用唤醒词音频作为变长输入数据进行循环神经网络RNN的训练,从而避免了手工截取数据,节约了人力成本,且对语速较慢的数据也能够识别;同时,由于样本音频集合中可以包含长音频,能够实现不间断地训练RNN,从而提高唤醒词的识别精度,有利于提高智能终端的唤醒效果。
实施例二
本发明实施例提供了一种智能终端唤醒方法,该方法可以应用于智能终端中,该智能终端预先部署有基于上述实施例一中的唤醒模型生成方法所生成的唤醒模型,如图5所示,该方法可以包括步骤:
501,智能终端获取当前时刻的实时音频。
具体地,智能终端可以利用麦克风采集场景中当前时刻的实时音频。其中, 智能终端包括但不限于机器人、智能手机、可穿戴设备、智能家居、车载终端等。
502,从实时音频中提取多个音频帧特征。
具体地,以预设的窗口宽度W、移动步长S以及梅尔频率倒谱系数C Mel,从实时音频的每一个音频帧中分别提取梅尔频率倒谱系数特征,得到多个音频帧特征。
进一步地,为提高唤醒词的识别精度,提高唤醒效果,在执行步骤202之前,本发明实施例提供的方法还可以包括:
对当前时刻的实时音频进行预处理,其中,预处理包括但不限于回声消除和降噪处理。
503,将提取到的多个音频帧特征依次输入到预先部署的唤醒模型中,并结合唤醒模型前一个时刻保存的状态进行计算,以获得实时音频中是否包含唤醒词的唤醒结果。
具体地,按照提取到的多个音频帧特征对应在实时音频中的时序,将各个音频帧特征依次输入唤醒模型中,结合唤醒模型前一个时刻保存的状态进行计算,根据唤醒模型的输出结果,获得当前时刻的实时音频的多个音频帧分别对应的帧标签以及唤醒模型当前时刻的状态,保存唤醒模型当前时刻的状态,并根据多个音频帧分别对应的帧标签,获取实时音频中是否包含唤醒词的唤醒结果,其中,当多个音频帧分别对应的帧标签中包含正标签时,则确定实时音频中包含唤醒词。
下面结合图6a至图6b对本发明实施例的智能终端唤醒方法作进一步说明。
假设智能终端的内存每次只能存储N帧数据,如图6a所示,在智能终端首次上电时,将t=1时刻的实时音频加载到内存,唤醒模型中的RNN网络前一个时刻的状态S 0为0,需要将t=1时刻的实时音频特征输入唤醒模型的RNN网络中,得到t=1时RNN网络中的状态S 1,并输出识别结果。如图6b所示,在智能终端上电后的任意时刻,假设t=M,其中M大于1,只需要将t=M时新加入内存的实 时音频帧特征输入到唤醒模型的RNN网络中,结合RNN网络上一时刻保存的状态S M-1进行计算,而不需要重复计算内存中所有的数据。
本实施例中,由于目前智能终端大多采用低端芯片,终端内存的容量有限,而现有技术中,在终端唤醒阶段,由于神经网络每次都需要处理终端内存中的时间长度t的音频,这样相邻两个时间长度t之间就会有大量的重复数据需要处理,导致增加了终端的计算时间和功耗。本发明利用变长输入的RNN唤醒模型进行判断实时音频中是否包含唤醒词,无需重复计算旧数据,由此减小了计算量,加快了处理速度,降低功耗。
实施例三
作为对上述实施例一提供的唤醒模型生成方法的实现,本发明实施例提供了一种唤醒模型生成装置,如图7所示,该装置包括:
第一标注模块71,用于对样本音频集合中的唤醒词音频所包含的每个唤醒词的起止时间进行标注,得到标注后的唤醒词音频,其中,唤醒词音频的时间长度不固定;
加噪处理模块72,用于利用包含背景噪声的负样本音频对标注后的唤醒词音频进行加噪,得到正样本音频;
特征提取模块73,用于从正样本音频和负样本音频中分别提取多个音频帧特征;
第二标注模块74,用于对正样本音频和负样本音频进行帧标签的标注,获得多个音频训练样本;
模型生成模块75,用于使用多个音频训练样本对循环神经网络进行训练,生成唤醒模型。
进一步地,第一标注模块71具体用于:
识别唤醒词音频中的仅包含唤醒词的至少一个关键音频段;
根据各个关键音频段各自的起止时间,分别标注每个唤醒词的起止时间,得到标注音频。
进一步地,加噪处理模块72具体用于:
从负样本音频中截取与标注后的唤醒词音频的时长相同的负样本音频段;
对负样本音频段的振幅均值进行调整,利用调整后的负样本音频段对标注音频进行混合加噪,得到正样本音频。
进一步地,帧标签包括正标签、负标签和中间标签,第二标注模块74具体用于:
针对正样本音频的每个音频帧,判断音频帧的部分或全部是否落入任一唤醒词的起止时间段内,若判断为是,则将音频帧标记为中间标签;
若判断为否,则判断音频帧的前一个音频帧是否落入任一唤醒词的起止时间段内,且音频帧首次不包含唤醒词的结束时间,若是,则将音频帧标记为正标签,否则,则将音频帧标记为负标签;
针对负样本音频的每个音频帧,将音频帧标记为负标签。
本发明实施例提供的唤醒模型生成装置,与本发明实施例一所提供的唤醒模型生成方法属于同一发明构思,可执行本发明任意实施例所提供的唤醒模型生成方法,具备执行唤醒模型生成方法相应的功能模块和有益效果。未在本发明实施例中详尽描述的技术细节,可参见本发明实施例提供的唤醒模型生成方法,此处不再加以赘述。
实施例四
作为对上述实施例二提供的智能终端唤醒方法的实现,本发明实施例提供了一种智能终端唤醒装置,如图8所示,该装置包括:
音频获取模块81,用于智能终端获取当前时刻的实时音频;
特征提取模块82,用于从实时音频中提取多个音频帧特征;
模型识别模块83,用于将提取到的多个音频帧特征依次输入到预先部署的唤醒模型中,并结合唤醒模型前一个时刻保存的状态进行计算,以获得实时音频中是否包含唤醒词的唤醒结果;
其中,唤醒模型为利用实施例一中的唤醒模型生成方法生成的。
进一步地,为提高唤醒词的识别精度,提高唤醒效果,该装置还可以包括:
预处理模块,用于对当前时刻的实时音频进行预处理,其中,预处理包括但不限于回声消除和降噪处理。
特征提取模块82,还用于从预处理后的实时音频中提取多个音频帧特征。
本发明实施例提供的智能终端唤醒装置,与本发明实施例二所提供的智能终端唤醒方法属于同一发明构思,可执行本发明任意实施例所提供的智能终端唤醒方法,具备执行业务请求处理方法相应的功能模块和有益效果。未在本发明实施例中详尽描述的技术细节,可参见本发明实施例提供的智能终端唤醒方法,此处不再加以赘述。
此外,本发明另一实施例还提供了一种计算机设备,包括:
一个或者多个处理器;
存储器;
存储在存储器中的程序,当被一个或者多个处理器执行时,程序使处理器执行如上述实施例所述的唤醒模型生成方法的步骤。
此外,本发明另一实施例还提供了一种计算机设备,包括:
一个或者多个处理器;
存储器;
存储在存储器中的程序,当被一个或者多个处理器执行时,程序使处理器执行如上述实施例所述的智能终端唤醒方法的步骤。
此外,本发明另一实施例还提供了一种计算机可读存储介质,计算机可读存储介质存储有程序,当程序被处理器执行时,使得处理器执行如上述实施例所述的唤醒模型生成方法的步骤。
此外,本发明另一实施例还提供了一种计算机可读存储介质,计算机可读存储介质存储有程序,当程序被处理器执行时,使得处理器执行如上述实施例所述的智能终端唤醒方法的步骤。
本领域内的技术人员应明白,本发明实施例中的实施例可提供为方法、装置、 或计算机程序产品。因此,本发明实施例中可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本发明实施例中可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本发明实施例中是参照根据本发明实施例中实施例的方法、装置(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个liuc流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
尽管已描述了本发明实施例中的优选实施例,但本领域内的技术人员一旦得知了基本创造性概念,则可对这些实施例作出另外的变更和修改。所以,所附权利要求意欲解释为包括优选实施例以及落入本发明实施例中范围的所有变更和修改。
显然,本领域的技术人员可以对本发明进行各种改动和变型而不脱离本发明的精神和范围。这样,倘若本发明的这些修改和变型属于本发明权利要求及其等同技术的范围之内,则本发明也意图包含这些改动和变型在内。

Claims (10)

  1. 一种唤醒模型生成方法,其特征在于,所述方法包括:
    对样本音频集合中的唤醒词音频所包含的每个唤醒词的起止时间进行标注,得到标注后的唤醒词音频,其中,所述唤醒词音频的时间长度不固定;
    利用包含背景噪声的负样本音频对标注后的所述唤醒词音频进行加噪,得到正样本音频;
    从所述正样本音频和所述负样本音频中分别提取多个音频帧特征,并对所述正样本音频和所述负样本音频进行帧标签的标注,获得多个音频训练样本;
    使用所述多个音频训练样本对循环神经网络进行训练,生成唤醒模型。
  2. 根据权利要求1所述的方法,其特征在于,所述对样本音频集合中的唤醒词音频所包含的每个唤醒词的起止时间进行标注,得到标注后的唤醒词音频,包括:
    识别所述唤醒词音频中的仅包含所述唤醒词的至少一个关键音频段;
    根据各个所述关键音频段各自的起止时间,分别标注每个所述唤醒词的起止时间,得到所述标注音频。
  3. 根据权利要求1所述的方法,其特征在于,所述利用包含背景噪声的负样本音频对标注后的所述唤醒词音频进行加噪,得到正样本音频,包括:
    从所述负样本音频中截取与标注后的所述唤醒词音频的时长相同的负样本音频段;
    对所述负样本音频段的振幅均值进行调整,利用调整后的所述负样本音频段对所述标注音频进行混合加噪,得到所述正样本音频。
  4. 根据权利要求1至3任一所述的方法,其特征在于,所述帧标签包括正标签、负标签和中间标签,所述对所述正样本音频和所述负样本音频进行帧标签的标注,获得多个音频训练样本,包括:
    针对所述正样本音频的每个音频帧,判断所述音频帧的部分或全部是否落入任一唤醒词的起止时间段内,若判断为是,则将所述音频帧标记为中间标签;
    若判断为否,则判断所述音频帧的前一个音频帧是否落入任一所述唤醒词的起止时间段内,且所述音频帧首次不包含唤醒词的结束时间,若是,则将所述音频帧标记为正标签,否则,则将所述音频帧标记为负标签;
    针对所述负样本音频的每个音频帧,将所述音频帧标记为负标签。
  5. 一种智能终端唤醒方法,其特征在于,所述方法包括:
    智能终端获取当前时刻的实时音频;
    从所述实时音频中提取多个音频帧特征;
    将提取到的多个音频帧特征依次输入到预先部署的唤醒模型中,并结合所述唤醒模型前一个时刻保存的状态进行计算,以获得所述实时音频中是否包含唤醒词的唤醒结果;
    其中,所述唤醒模型为利用权利要求1至4中任一项所述的唤醒模型生成方法生成的。
  6. 一种唤醒模型生成装置,其特征在于,所述装置包括:
    第一标注模块,用于对样本音频集合中的唤醒词音频所包含的每个唤醒词的起止时间进行标注,得到标注后的唤醒词音频,其中,所述唤醒词音频的时间长度不固定;
    加噪处理模块,用于利用包含背景噪声的负样本音频对标注后的所述唤醒词音频进行加噪,得到正样本音频;
    特征提取模块,用于从所述正样本音频和所述负样本音频中分别提取多个音频帧特征;
    第二标注模块,用于对所述正样本音频和所述负样本音频进行帧标签的标注,获得多个音频训练样本;
    模型生成模块,用于使用所述多个音频训练样本对循环神经网络进行训练,生成唤醒模型。
  7. 根据权利要求6所述的装置,其特征在于,所述第一标注模块具体用于:
    识别所述唤醒词音频中的仅包含所述唤醒词的至少一个关键音频段;
    根据各个所述关键音频段各自的起止时间,分别标注每个所述唤醒词的起止时间,得到所述标注音频。
  8. 根据权利要求6所述的装置,其特征在于,所述加噪处理模块具体用于:
    从所述负样本音频中截取与标注后的所述唤醒词音频的时长相同的负样本音频段;
    对所述负样本音频段的振幅均值进行调整,利用调整后的所述负样本音频段对所述标注音频进行混合加噪,得到所述正样本音频。
  9. 根据权利要求6至8任一所述的装置,其特征在于,所述帧标签包括正标签、负标签和中间标签,所述第二标注模块具体用于:
    针对所述正样本音频的每个音频帧,判断所述音频帧的部分或全部是否落入任一唤醒词的起止时间段内,若判断为是,则将所述音频帧标记为中间标签;
    若判断为否,则判断所述音频帧的前一个音频帧是否落入任一所述唤醒词的起止时间段内,且所述音频帧首次不包含唤醒词的结束时间,若是,则将所述音频帧标记为正标签,否则,则将所述音频帧标记为负标签;
    针对所述负样本音频的每个音频帧,将所述音频帧标记为负标签。
  10. 一种智能终端唤醒装置,其特征在于,所述装置包括:
    音频获取模块,用于智能终端获取当前时刻的实时音频;
    特征提取模块,用于从所述实时音频中提取多个音频帧特征;
    模型识别模块,用于将提取到的多个音频帧特征依次输入到预先部署的唤醒模型中,并结合所述唤醒模型前一个时刻保存的状态进行计算,以获得所述实时音频中是否包含唤醒词的唤醒结果;
    其中,所述唤醒模型为利用权利要求1至4中任一项所述的唤醒模型生成方法生成的。
PCT/CN2020/105998 2019-10-28 2020-07-30 一种唤醒模型生成方法、智能终端唤醒方法及装置 WO2021082572A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CA3158930A CA3158930A1 (en) 2019-10-28 2020-07-30 Arousal model generating method, intelligent terminal arousing method, and corresponding devices

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911028892.5A CN110970016B (zh) 2019-10-28 2019-10-28 一种唤醒模型生成方法、智能终端唤醒方法及装置
CN201911028892.5 2019-10-28

Publications (1)

Publication Number Publication Date
WO2021082572A1 true WO2021082572A1 (zh) 2021-05-06

Family

ID=70029890

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/105998 WO2021082572A1 (zh) 2019-10-28 2020-07-30 一种唤醒模型生成方法、智能终端唤醒方法及装置

Country Status (3)

Country Link
CN (1) CN110970016B (zh)
CA (1) CA3158930A1 (zh)
WO (1) WO2021082572A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113903334A (zh) * 2021-09-13 2022-01-07 北京百度网讯科技有限公司 声源定位模型的训练与声源定位方法、装置
CN116110112A (zh) * 2023-04-12 2023-05-12 广东浩博特科技股份有限公司 基于人脸识别的智能开关的自适应调节方法以及装置

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110970016B (zh) * 2019-10-28 2022-08-19 苏宁云计算有限公司 一种唤醒模型生成方法、智能终端唤醒方法及装置
CN111653274B (zh) * 2020-04-17 2023-08-04 北京声智科技有限公司 唤醒词识别的方法、装置及存储介质
CN111833902A (zh) * 2020-07-07 2020-10-27 Oppo广东移动通信有限公司 唤醒模型训练方法、唤醒词识别方法、装置及电子设备
CN112201239B (zh) * 2020-09-25 2024-05-24 海尔优家智能科技(北京)有限公司 目标设备的确定方法及装置、存储介质、电子装置
CN112259085A (zh) * 2020-09-28 2021-01-22 上海声瀚信息科技有限公司 一种基于模型融合框架的两阶段语音唤醒算法
CN113223499B (zh) * 2021-04-12 2022-11-04 青岛信芯微电子科技股份有限公司 一种音频负样本的生成方法及装置

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108281137A (zh) * 2017-01-03 2018-07-13 中国科学院声学研究所 一种全音素框架下的通用语音唤醒识别方法及系统
CN108694940A (zh) * 2017-04-10 2018-10-23 北京猎户星空科技有限公司 一种语音识别方法、装置及电子设备
CN109036393A (zh) * 2018-06-19 2018-12-18 广东美的厨房电器制造有限公司 家电设备的唤醒词训练方法、装置及家电设备
CN110097876A (zh) * 2018-01-30 2019-08-06 阿里巴巴集团控股有限公司 语音唤醒处理方法和被唤醒设备
CN110364147A (zh) * 2019-08-29 2019-10-22 厦门市思芯微科技有限公司 一种唤醒训练词采集系统及方法
CN110970016A (zh) * 2019-10-28 2020-04-07 苏宁云计算有限公司 一种唤醒模型生成方法、智能终端唤醒方法及装置

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10719115B2 (en) * 2014-12-30 2020-07-21 Avago Technologies International Sales Pte. Limited Isolated word training and detection using generated phoneme concatenation models of audio inputs
WO2017217978A1 (en) * 2016-06-15 2017-12-21 Nuance Communications, Inc. Techniques for wake-up word recognition and related systems and methods
CN107358951A (zh) * 2017-06-29 2017-11-17 阿里巴巴集团控股有限公司 一种语音唤醒方法、装置以及电子设备
CN109215647A (zh) * 2018-08-30 2019-01-15 出门问问信息科技有限公司 语音唤醒方法、电子设备及非暂态计算机可读存储介质
CN110428808B (zh) * 2018-10-25 2022-08-19 腾讯科技(深圳)有限公司 一种语音识别方法及装置
CN109448725A (zh) * 2019-01-11 2019-03-08 百度在线网络技术(北京)有限公司 一种语音交互设备唤醒方法、装置、设备及存储介质
CN109785850A (zh) * 2019-01-18 2019-05-21 腾讯音乐娱乐科技(深圳)有限公司 一种噪声检测方法、装置和存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108281137A (zh) * 2017-01-03 2018-07-13 中国科学院声学研究所 一种全音素框架下的通用语音唤醒识别方法及系统
CN108694940A (zh) * 2017-04-10 2018-10-23 北京猎户星空科技有限公司 一种语音识别方法、装置及电子设备
CN110097876A (zh) * 2018-01-30 2019-08-06 阿里巴巴集团控股有限公司 语音唤醒处理方法和被唤醒设备
CN109036393A (zh) * 2018-06-19 2018-12-18 广东美的厨房电器制造有限公司 家电设备的唤醒词训练方法、装置及家电设备
CN110364147A (zh) * 2019-08-29 2019-10-22 厦门市思芯微科技有限公司 一种唤醒训练词采集系统及方法
CN110970016A (zh) * 2019-10-28 2020-04-07 苏宁云计算有限公司 一种唤醒模型生成方法、智能终端唤醒方法及装置

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113903334A (zh) * 2021-09-13 2022-01-07 北京百度网讯科技有限公司 声源定位模型的训练与声源定位方法、装置
CN113903334B (zh) * 2021-09-13 2022-09-23 北京百度网讯科技有限公司 声源定位模型的训练与声源定位方法、装置
CN116110112A (zh) * 2023-04-12 2023-05-12 广东浩博特科技股份有限公司 基于人脸识别的智能开关的自适应调节方法以及装置
CN116110112B (zh) * 2023-04-12 2023-06-16 广东浩博特科技股份有限公司 基于人脸识别的智能开关的自适应调节方法以及装置

Also Published As

Publication number Publication date
CN110970016B (zh) 2022-08-19
CN110970016A (zh) 2020-04-07
CA3158930A1 (en) 2021-05-06

Similar Documents

Publication Publication Date Title
WO2021082572A1 (zh) 一种唤醒模型生成方法、智能终端唤醒方法及装置
DE102018010463B3 (de) Tragbare Vorrichtung, computerlesbares Speicherungsmedium, Verfahren und Einrichtung für energieeffiziente und leistungsarme verteilte automatische Spracherkennung
CN107103903B (zh) 基于人工智能的声学模型训练方法、装置及存储介质
CN105632486B (zh) 一种智能硬件的语音唤醒方法和装置
CN110364143B (zh) 语音唤醒方法、装置及其智能电子设备
WO2021128741A1 (zh) 语音情绪波动分析方法、装置、计算机设备及存储介质
US8731936B2 (en) Energy-efficient unobtrusive identification of a speaker
CN110570873B (zh) 声纹唤醒方法、装置、计算机设备以及存储介质
CN111161714B (zh) 一种语音信息处理方法、电子设备及存储介质
CN104575504A (zh) 采用声纹和语音识别进行个性化电视语音唤醒的方法
KR20160007527A (ko) 타깃 키워드를 검출하기 위한 방법 및 장치
EP3739582A1 (en) Voice detection
CN109741753A (zh) 一种语音交互方法、装置、终端及服务器
CN112562742B (zh) 语音处理方法和装置
CN109697978B (zh) 用于生成模型的方法和装置
CN110706707B (zh) 用于语音交互的方法、装置、设备和计算机可读存储介质
CN108877779B (zh) 用于检测语音尾点的方法和装置
WO2023030235A1 (zh) 目标音频的输出方法及系统、可读存储介质、电子装置
WO2018095167A1 (zh) 声纹识别方法和声纹识别系统
CN106531195B (zh) 一种对话冲突检测方法及装置
CN111145763A (zh) 一种基于gru的音频中的人声识别方法及系统
CN108962226B (zh) 用于检测语音的端点的方法和装置
CN112802498B (zh) 语音检测方法、装置、计算机设备和存储介质
CN111179913B (zh) 一种语音处理方法及装置
US11769491B1 (en) Performing utterance detection using convolution

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20881623

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 3158930

Country of ref document: CA

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20881623

Country of ref document: EP

Kind code of ref document: A1