WO2017166651A1 - 语音识别模型训练方法、说话人类型识别方法及装置 - Google Patents

语音识别模型训练方法、说话人类型识别方法及装置 Download PDF

Info

Publication number
WO2017166651A1
WO2017166651A1 PCT/CN2016/096986 CN2016096986W WO2017166651A1 WO 2017166651 A1 WO2017166651 A1 WO 2017166651A1 CN 2016096986 W CN2016096986 W CN 2016096986W WO 2017166651 A1 WO2017166651 A1 WO 2017166651A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
speaker
voice
type
recognized
Prior art date
Application number
PCT/CN2016/096986
Other languages
English (en)
French (fr)
Inventor
张俊博
Original Assignee
乐视控股(北京)有限公司
乐视致新电子科技(天津)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 乐视控股(北京)有限公司, 乐视致新电子科技(天津)有限公司 filed Critical 乐视控股(北京)有限公司
Publication of WO2017166651A1 publication Critical patent/WO2017166651A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/065Adaptation
    • G10L15/07Adaptation to the speaker

Definitions

  • the present invention relates to the field of speech recognition technology, and in particular, to a speech recognition model training method for speaker type recognition, a speech recognition model training device, a speaker type recognition method and device.
  • voice recognition modules are configured, but the voice recognition modules are generally only used to extract language-related information of voice signals, identify keywords, and are used for information search and the like.
  • the distinction between user types cannot be achieved. Therefore, how to provide a speaker type identification scheme and realize identification of user types has become a technical problem mainly solved by those skilled in the art.
  • the invention provides a speech recognition model training method, a speech recognition model training device, a speaker type recognition method and a device, which are used to solve the problem that the user type cannot be realized in the prior art. Other calculation problems.
  • the embodiment of the invention provides a speech recognition model training method, which comprises:
  • training obtains a feature recognizer for extracting speaker features; wherein different speaker types correspond to different speaker characteristics;
  • the speaker feature corresponding to the different user types and the feature recognizer are used as a speaker type recognition model, and the speaker type recognition model is configured to extract the to-before by combining the feature identifier with the sound feature of the to-be-recognized voice. Identifying a speaker feature of the voice, and matching the speaker feature of the voice to be recognized with the speaker feature corresponding to the different user type, and identifying the user type corresponding to the speaker feature with the highest matching degree as the voice to be recognized user type.
  • An embodiment of the present invention provides a speaker type identification method, including:
  • the speaker type recognition model includes a feature recognizer and a speaker feature corresponding to different user types;
  • the feature identifier is obtained by using the acoustic feature training of the training voice;
  • the speaker feature corresponding to the different user type is extracted and obtained from the target voice corresponding to the user type by using the feature identifier;
  • the user type corresponding to the speaker feature with the highest matching degree is identified as the user type of the voice to be recognized.
  • the embodiment of the invention provides a speech recognition model training device, which comprises:
  • a first extraction module configured to acquire training speech and extract an acoustic feature of the training speech, where the training speech includes voices of different user types;
  • a training module configured to use the acoustic feature to obtain a feature identifier for extracting a speaker feature; wherein different speaker types correspond to different speaker characteristics;
  • a second extraction module configured to extract, by using the feature identifier, a speaker feature from a target voice corresponding to each user type, as a speaker feature corresponding to the user type;
  • a model generating module configured to use a speaker feature corresponding to different user types and the feature recognizer as a speaker type recognition model, wherein the speaker type recognition model is configured to combine the sound of the to-be-recognized voice by using the feature recognizer Feature, extracting a speaker feature of the to-be-recognized speech, and matching a speaker feature of the to-be-recognized speech with a speaker feature corresponding to a different user type, and identifying a user type corresponding to the speaker feature with the highest matching degree as The type of user of the voice to be recognized.
  • An embodiment of the present invention provides a speaker type identification apparatus, including:
  • a third extraction module configured to acquire a voice to be recognized, and extract an acoustic feature of the voice to be recognized
  • a fourth extraction module configured to extract a speaker feature of the to-be-recognized speech by using a feature identifier in the speaker type recognition model and the acoustic feature;
  • the speaker type recognition model includes a feature recognizer and different user types a corresponding speaker feature;
  • the feature recognizer is obtained by using the acoustic feature training of the training voice; and the speaker feature corresponding to the different user type is extracted from the target voice corresponding to the different user type by using the feature recognizer Obtain
  • a matching degree calculation module configured to separately calculate a speaker feature of the to-be-recognized voice, and a matching degree of a speaker feature corresponding to different user types in the speaker type recognition model;
  • the identification module is configured to identify the user type corresponding to the speaker feature with the highest matching degree as the user type of the to-be-recognized voice.
  • the embodiment of the present invention further provides a non-transitory computer readable storage medium, wherein the non-transitory computer-readable storage medium stores computer-executable instructions for performing any of the above Item speech recognition model training method, or any speaker type recognition method.
  • An embodiment of the present invention further provides an electronic device, including: one or more processors; and a memory; wherein the memory stores instructions executable by the one or more processors, the instructions being It is configured to perform any of the above speech recognition model training methods, or any speaker type recognition method.
  • Embodiments of the present invention also provide a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions, when the program instructions are When executed, the computer is caused to perform any of the above-described speech recognition model training methods, or any speaker type recognition method.
  • the speech recognition model training method, the speech recognition model training device, the speaker type recognition method and device provided by the embodiment of the invention acquire the training speech and extract the acoustic features of the training speech, and the training speech includes different user types of speech; Using the acoustic feature, training obtains a feature recognizer for extracting speaker features; wherein different speaker types correspond to different speaker features, and the feature recognizer is used to extract the speaker from the target voice corresponding to each user type.
  • a feature as a speaker feature corresponding to the user type; a speaker feature corresponding to different user types and the feature recognizer as a speaker type recognition a model, such that when the speaker type recognition is performed, the feature identifier in the speaker type recognition model is combined with the sound feature of the speech to be recognized, and the speaker feature of the speech to be recognized may be extracted, and the The speaker feature of the speech to be recognized is matched with the speaker feature corresponding to the different user types, and the user type corresponding to the speaker feature with the highest matching degree is the user type of the speech to be recognized, thereby realizing the recognition of the user type.
  • FIG. 1 is a flowchart of an embodiment of a voice recognition model training method according to an embodiment of the present invention
  • FIG. 2 is a flowchart of an embodiment of a speaker type identification method according to an embodiment of the present invention
  • FIG. 3 is a schematic structural diagram of an embodiment of a speech recognition model training apparatus according to an embodiment of the present invention.
  • FIG. 4 is a schematic structural diagram of an embodiment of a speaker type identification apparatus according to an embodiment of the present invention.
  • FIG. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
  • the technical solution of the present invention is applicable to a voice recognition scenario, and is used to distinguish different user types.
  • the user type may include an adult male, an adult female, an elderly person, or a child.
  • the user type distinction may be applied to different application scenarios, such as a smart TV. By distinguishing user types, different movie content and the like can be displayed to users of different user types.
  • model training is first performed, the training speech is acquired, and the acoustic features of the training speech are extracted, the training speech includes voices of different user types; Training to obtain a feature recognizer for extracting speaker features; wherein different speaker types correspond to different speaker features, and the feature recognizer is used to extract a speaker feature from the target voice corresponding to each user type as the user a speaker feature corresponding to the type; a speaker feature corresponding to different user types and the feature recognizer as a speaker type recognition model, thereby utilizing the speaker type recognition model when performing speaker type recognition
  • the feature identifier combines the sound features of the speech to be recognized, and can extract the speaker feature of the speech to be recognized, and match the speaker feature of the speech to be recognized with the speaker feature corresponding to different user types, and the matching degree is the highest.
  • the user type corresponding to the speaker feature is the said Voice user type, enabling the user identification of the type.
  • FIG. 1 is an embodiment of a speech recognition model training method according to an embodiment of the present invention.
  • Flowchart, the method can include the following steps:
  • the training voice includes voices of different user types.
  • Large-scale training speech is usually chosen, generally more than 50 hours.
  • Different user types may include adult males, adult females, elderly people, or children, and the amount of voice corresponding to different user types is the same or similar.
  • the acoustic features are first extracted, which may be MFCC (Mel Frequency Cepstrum Coefficient) features.
  • training obtains a feature recognizer for extracting speaker features.
  • the speaker feature is a feature that is not related to text. Obtained by using acoustic feature calculations. Thus with the acoustic features, a feature recognizer for extracting speaker features can be trained.
  • the speaker feature can be a fundamental frequency feature.
  • the inventors found in the study that the vocal fundamental frequency is generally between 140 Hz (Hz) and 300 Hz. Usually, the female has a higher fundamental frequency than the male, and the child has a higher fundamental frequency than the adult. Use the fundamental frequency feature to distinguish between different user types.
  • the speaker feature may be an i-Vector feature.
  • the i-Vector feature can reflect the speaker's acoustic differences so that different user types can be distinguished.
  • the feature recognizer can be trained using the acoustic features of the training speech for extracting speaker features.
  • the speaker feature is an i-Vector feature
  • the feature identifier is specifically a T matrix.
  • the feature identifier obtained by the training for extracting the speaker feature may specifically be:
  • UBM Universal Background Model
  • the target voice may be a target voice collected in an application environment for training.
  • the target voice of each user type when applied to a television set, may be a target voice of each user type obtained by using the microphone of the television set.
  • the feature recognizer obtained by the step 102 training can be used to extract the speaker feature.
  • the target speech of each user type may include a plurality of, so that the feature identifier may be used to extract the speaker features from the plurality of target speeches of each user type, and the extraction is obtained.
  • the average of the plurality of speaker features is the speaker feature corresponding to the user type.
  • a speaker feature corresponding to different user types and the feature identifier are used as a speaker type recognition model.
  • the feature identifier obtained by the training and the speaker feature corresponding to each user type extracted from the target voice by the feature recognizer are used as the speaker type recognition model.
  • the speaker type recognition model can be utilized.
  • the feature identifier combines the sound features of the speech to be recognized, extracts the speaker feature of the speech to be recognized, and matches the speaker feature of the speech to be recognized with the speaker feature corresponding to different user types, and the matching degree is the highest.
  • the user type corresponding to the speaker feature identifies the user type of the voice to be recognized.
  • the speaker type recognition model obtained by the training realizes the purpose of identifying the user type, thereby realizing the distinction between different user types.
  • the user type is determined, so that relevant information corresponding to the user type can be pushed to the user in a targeted manner.
  • FIG. 2 is a flowchart of an embodiment of a speaker type identification method according to an embodiment of the present invention. The method may include the following steps:
  • 201 Acquire a voice to be recognized, and extract an acoustic feature of the voice to be recognized.
  • the to-be-identified voice may be a voice input by a user collected by the device, and the voice to be recognized is identified to achieve the purpose of determining the user type of the user.
  • the speaker type recognition model includes a feature recognizer and a speaker feature corresponding to different user types; the feature recognizer is obtained by using an acoustic feature training of the training voice; and the speaker feature corresponding to the different user type utilizes the A feature recognizer is extracted from the target speech of the different user types.
  • the type of user corresponding to the speaker feature with the highest matching that is, the type of user identified as the voice to be recognized.
  • the matching degree of the speaker feature of the to-be-recognized speech and the speaker feature corresponding to the different user types in the speaker type recognition model may be:
  • the distance between the i-Vector feature of the speech to be recognized and the i-Vector feature of different user types in the speaker type recognition model is separately calculated as a matching degree; wherein the smaller the distance, the greater the matching degree.
  • the distance between the calculated i-Vector feature of the speech to be recognized and the i-Vector feature of the different user types in the speaker type recognition model may specifically be a cosine distance.
  • the user type corresponding to the minimum distance that is, the type of user identified as the voice to be recognized.
  • the determination of the user type is implemented, thereby realizing the purpose of distinguishing different user types according to the voice.
  • FIG. 3 is a schematic structural diagram of an embodiment of a speech recognition model training apparatus according to an embodiment of the present invention, where the apparatus may include:
  • the first extraction module 301 is configured to acquire training speech and extract an acoustic feature of the training speech.
  • the training speech includes speech of different user types.
  • Different user types may include adult males, adult females, elderly people or children.
  • the acoustic features are first extracted, which may be MFCC features.
  • the training module 302 is configured to use the acoustic feature to train to obtain a feature recognizer for extracting speaker features.
  • the speaker feature is a feature that is not related to text. Obtained by using acoustic feature calculations. Thus with the acoustic features, a feature recognizer for extracting speaker features can be trained.
  • the speaker feature can be a fundamental frequency feature.
  • the vocal fundamental frequency is generally between 140 Hz (hertz) and 300 Hz.
  • the female has a higher fundamental frequency than the male, and the child has a higher fundamental frequency than the adult, so that the fundamental frequency characteristics can be used to distinguish different user types.
  • the speaker feature may be an i-Vector feature.
  • the i-Vector feature can reflect the speaker's acoustic differences so that different user types can be distinguished.
  • the feature recognizer can be trained using the acoustic features of the training speech for extracting speaker features.
  • the speaker feature is an i-Vector feature
  • the feature identifier is specifically a T matrix.
  • the training module may include:
  • a first training unit configured to obtain a common background model by using the acoustic feature
  • a second training unit configured to use the universal background model to obtain a feature identifier for extracting speaker features.
  • a second extraction module 303 configured to use the feature identifier to correspond to each user type
  • the speaker feature is extracted from the target speech as the speaker feature corresponding to the user type.
  • the target voice may be a target voice collected in an application environment for training.
  • the target voice of each user type when applied to a television set, may be a target voice of each user type obtained by using the microphone of the television set.
  • the target voice of each user type may include multiple. Therefore, as still another embodiment, the second extraction module is specifically configured to utilize the feature identifier from multiple targets of each user type.
  • the speaker features are separately extracted from the speech, and the average of the obtained plurality of speaker features is extracted as the speaker feature corresponding to the user type.
  • the model generation module 304 is configured to use a speaker feature corresponding to different user types and the feature identifier as a speaker type recognition model.
  • the feature identifier obtained by the training and the speaker feature corresponding to each user type extracted from the target voice by the feature recognizer are used as the speaker type recognition model.
  • the feature identifier of the speaker type recognition model may be combined with the sound feature of the speech to be recognized, the speaker feature of the speech to be recognized is extracted, and the speech of the speech to be recognized is spoken.
  • the human feature is matched with the speaker feature corresponding to the different user types, and the user type corresponding to the speaker feature with the highest matching degree identifies the user type of the to-be-identified voice.
  • the speaker type recognition model obtained by the training realizes the purpose of identifying the user type, thereby realizing the distinction between different user types.
  • the user type is determined, so that relevant information corresponding to the user type can be pushed to the user in a targeted manner.
  • FIG. 4 is a schematic diagram of an embodiment of a speaker type identification device according to an embodiment of the present invention. Schematic diagram of the structure, the device may include:
  • the third extraction module 401 is configured to acquire a voice to be recognized, and extract an acoustic feature of the voice to be recognized.
  • the to-be-identified voice may be a voice input by a user collected by the device, and the voice to be recognized is identified to achieve the purpose of determining the user type of the user.
  • the fourth extraction module 402 is configured to extract a speaker feature of the to-be-recognized speech by using a feature identifier in the speaker type recognition model and the acoustic feature.
  • the speaker type recognition model includes a feature recognizer and a speaker feature corresponding to different user types; the feature recognizer is obtained by using an acoustic feature training of the training voice; and the speaker feature corresponding to the different user type utilizes the A feature recognizer is obtained from target speech extraction of the different user types.
  • the matching degree calculation module 403 is configured to separately calculate a matching degree between the speaker feature of the to-be-recognized speech and the speaker feature corresponding to different user types in the speaker type recognition model.
  • the identification module 404 is configured to identify the user type corresponding to the speaker feature with the highest matching degree as the user type of the to-be-recognized voice.
  • the type of user corresponding to the speaker feature with the highest matching that is, the type of user identified as the voice to be recognized.
  • the matching degree calculation module is specifically configured to:
  • the distance between the i-Vector feature of the speech to be recognized and the i-Vector feature of different user types in the speaker type recognition model is separately calculated as a matching degree; wherein the smaller the distance, the greater the matching degree.
  • the distance between the i-Vector feature of the speech to be recognized and the i-Vector feature of the different user types in the speaker type recognition model may specifically be a cosine distance.
  • the user type corresponding to the minimum distance that is, the type of user identified as the voice to be recognized.
  • the determination of the user type is implemented, thereby realizing the purpose of distinguishing different user types according to the voice.
  • the speaker type identification device shown in FIG. 4 can be configured in an intelligent electronic device such as a smart TV, a mobile phone, a tablet computer, etc., to implement user type recognition on the voice input by the user, thereby targeting different user types. Different information can be pushed or displayed.
  • an intelligent electronic device such as a smart TV, a mobile phone, a tablet computer, etc.
  • the embodiment of the present application further provides a non-transitory computer readable storage medium storing computer executable instructions, which can execute any of the foregoing method embodiments A speech recognition model training method in the speech, or a speaker type recognition method in any of the method embodiments.
  • FIG. 5 is a schematic diagram of a hardware structure of an electronic device for performing a speech recognition model training method and/or a speaker type identification method according to an embodiment of the present application. As shown in FIG. 5, the device includes:
  • processors 510 and memory 520 one processor 510 is taken as an example in FIG.
  • the apparatus for performing the speech recognition model training method and/or the speaker type identification method may further include: an input device 530 and an output device 540.
  • the processor 510, the memory 520, the input device 530, and the output device 540 may be connected by a bus or other means, as exemplified by a bus connection in FIG.
  • the memory 520 is a non-volatile computer readable storage medium, and can be used for storing a non-volatile software program, a non-volatile computer executable program, and a module, such as a speech recognition model training method in the embodiment of the present application and/or Or the program instruction corresponding to the speaker type identification method / Module (for example, the first extraction module 301, the training module 302, the second extraction module 303, and the model generation module 304 shown in FIG. 3, or the third extraction module 401 and the fourth extraction module 402 shown in FIG. , the matching degree calculation module 403 and the identification module 404).
  • a module such as a speech recognition model training method in the embodiment of the present application and/or Or the program instruction corresponding to the speaker type identification method / Module (for example, the first extraction module 301, the training module 302, the second extraction module 303, and the model generation module 304 shown in FIG. 3, or the third extraction module 401 and the fourth extraction module 402 shown in FIG. , the matching degree calculation module
  • the processor 510 executes various functional applications and data processing of the electronic device by executing non-volatile software programs, instructions, and modules stored in the memory 520, that is, implementing the voice recognition model training method and/or speaking of the above method embodiments. Human type identification method.
  • the memory 520 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application required for at least one function; the storage data area may store the training device according to the voice recognition model (such as FIG. 3) and/or Or data created by the use of the speaker type recognition device (such as FIG. 4). Further, the memory 520 may include a high speed random access memory, and may also include a nonvolatile memory such as at least one magnetic disk storage device, flash memory device, or other nonvolatile solid state storage device. In some embodiments, memory 520 can optionally include memory remotely located relative to processor 510 that can be coupled to the speech recognition model training device and/or speaker type identification device via a network. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
  • Input device 530 can receive input numeric or character information and generate key signal inputs related to user settings and function control of the speech recognition model training device and/or speaker type recognition device.
  • the output device 540 can include a display device such as a display screen.
  • the one or more modules are stored in the memory 520, and when executed by the one or more processors 510, perform a speech recognition model training method and/or a speaker type identification method in any of the above method embodiments .
  • the electronic device of the embodiment of the invention exists in various forms, including but not limited to:
  • Mobile communication devices These devices are characterized by mobile communication functions and are mainly aimed at providing voice and data communication.
  • Such terminals include: smart phones (such as iPhone), multimedia Mobile phones, functional phones, and low-end phones.
  • Ultra-mobile personal computer equipment This type of equipment belongs to the category of personal computers, has computing and processing functions, and generally has mobile Internet access.
  • Such terminals include: PDAs, MIDs, and UMPC devices, such as the iPad.
  • Portable entertainment devices These devices can display and play multimedia content. Such devices include: audio, video players (such as iPod), handheld game consoles, e-books, and smart toys and portable car navigation devices.
  • the server consists of a processor, a hard disk, a memory, a system bus, etc.
  • the server is similar to a general-purpose computer architecture, but because of the need to provide highly reliable services, processing power and stability High reliability in terms of reliability, security, scalability, and manageability.
  • the program when executed, may include the flow of an embodiment of the methods as described above.
  • the storage medium may be a magnetic disk, an optical disk, a read only memory (ROM), or a random access memory (RAM).
  • the device embodiments described above are merely illustrative, wherein the units described as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, ie may be located A place, or it can be distributed to multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the embodiment. Those of ordinary skill in the art are not creative In the case of labor, it can be understood and implemented.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

一种语音识别模型训练方法、说话人类型识别方法及装置,获取训练语音并提取所述训练语音的声学特征(101);利用所述声学特征,训练获得用于提取说话人特征的特征识别器(102);利用所述特征识别器从每一个用户类型对应的目标语音中提取说话人特征,作为所述用户类型对应的说话人特征(103);将不同用户类型对应的说话人特征以及所述特征识别器,作为说话人类型识别模型(104)。利用说话人类型识别模型中所述特征识别器结合待识别语音的声音特征,提取所述待识别语音的说话人特征,并将所述待识别语音的说话人特征与不同用户类型对应的说话人特征进行匹配,将匹配度最高的说话人特征对应的用户类型识别为所述待识别语音的用户类型。

Description

语音识别模型训练方法、说话人类型识别方法及装置
本申请要求于2016年3月30日提交中国专利局、申请号为201610195561.0、发明名称为“语音识别模型训练方法、说话人类型识别方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明涉及语音识别技术领域,尤其涉及一种用于说话人类型识别的语音识别模型训练方法、语音识别模型训练装置、说话人类型识别方法及装置。
背景技术
随着信息类型的多元化发展,例如影视剧类型的多样化,不同用户对信息类型的需求也不一样,例如儿童、成年和老年对影视剧需求即不一样,成年男和成年女对影视剧需求也不一样。因此针对用户类型推送或者显示不同的信息内容,可以极大提高用户体验。而为了实现针对用户类型推送或者显示不同的信息内容,就需要对用户类型进行区分。
目前的信息播放设备中,例如电视剧、电脑等中都配置有语音识别模块,但是语音识别模块通常只是用于提取语音信号的语言相关信息,识别关键词,并用于信息查找等方面。而并无法实现对用户类型的区分,因此如何提供一种说话人类型识别方案,实现对用户类型的识别,成为本领域技术人员主要解决的技术问题。
发明内容
本发明提供一种语音识别模型训练方法、语音识别模型训练装置、说话人类型识别方法及装置,用以解决现有技术中无法实现用户类型识 别的计算问题。
本发明实施例提供一种语音识别模型训练方法,包括:
获取训练语音并提取所述训练语音的声学特征,所述训练语音包括不同用户类型的语音;
利用所述声学特征,训练获得用于提取说话人特征的特征识别器;其中,不同用户类型对应的说话人特征不同;
利用所述特征识别器从每一个用户类型对应的目标语音中提取说话人特征,作为所述用户类型对应的说话人特征;
将不同用户类型对应的说话人特征以及所述特征识别器,作为说话人类型识别模型,所述说话人类型识别模型用于利用所述特征识别器结合待识别语音的声音特征,提取所述待识别语音的说话人特征,并将所述待识别语音的说话人特征与不同用户类型对应的说话人特征进行匹配,将匹配度最高的说话人特征对应的用户类型识别为所述待识别语音的用户类型。
本发明实施例提供一种说话人类型识别方法,包括:
获取待识别语音,并提取所述待识别语音的声学特征;
利用说话人类型识别模型中的特征识别器以及所述声学特征,提取所述待识别语音的说话人特征;所述说话人类型识别模型包括特征识别器以及不同用户类型对应的说话人特征;所述特征识别器利用训练语音的声学特征训练获得;所述不同用户类型对应的说话人特征利用所述特征识别器从所述用户类型对应的目标语音中提取获得;
分别计算所述待识别语音的说话人特征,与所述说话人类型识别模型中不同用户类型对应的说话人特征的匹配度;
将匹配度最高的说话人特征对应的用户类型识别为所述待识别语音的用户类型。
本发明实施例提供一种语音识别模型训练装置,包括:
第一提取模块,用于获取训练语音并提取所述训练语音的声学特征,所述训练语音包括不同用户类型的语音;
训练模块,用于利用所述声学特征,训练获得用于提取说话人特征的特征识别器;其中,不同用户类型对应的说话人特征不同;
第二提取模块,用于利用所述特征识别器从每一个用户类型对应的目标语音中提取说话人特征,作为所述用户类型对应的说话人特征;
模型生成模块,用于将不同用户类型对应的说话人特征以及所述特征识别器,作为说话人类型识别模型,所述说话人类型识别模型用于利用所述特征识别器结合待识别语音的声音特征,提取所述待识别语音的说话人特征,并将所述待识别语音的说话人特征与不同用户类型对应的说话人特征进行匹配,将匹配度最高的说话人特征对应的用户类型识别为所述待识别语音的用户类型。
本发明实施例提供一种说话人类型识别装置,包括:
第三提取模块,用于获取待识别语音,并提取所述待识别语音的声学特征;
第四提取模块,用于利用说话人类型识别模型中的特征识别器以及所述声学特征,提取所述待识别语音的说话人特征;所述说话人类型识别模型包括特征识别器以及不同用户类型对应的说话人特征;所述特征识别器利用训练语音的声学特征训练获得;所述不同用户类型对应的说话人特征利用所述特征识别器从所述不同用户类型对应的目标语音中提 取获得;
匹配度计算模块,用于分别计算所述待识别语音的说话人特征,与所述说话人类型识别模型中不同用户类型对应的说话人特征的匹配度;
识别模块,用于将匹配度最高的说话人特征对应的用户类型识别为所述待识别语音的用户类型。
本发明实施例还提供了一种非易失性计算机可读存储介质,其中,该非易失性计算机可读存储介质存储有计算机可执行指令,所述计算机可执行指令用于执行上述任一项语音识别模型训练方法,或任一项说话人类型识别方法。
本发明实施例还提供了一种电子设备,包括:一个或多个处理器;以及,存储器;其中,所述存储器存储有可被所述一个或多个处理器执行的指令,所述指令被设置为用于执行上述任一项语音识别模型训练方法,或任一项说话人类型识别方法。
本发明实施例还提供了一种计算机程序产品,所述计算机程序产品包括存储在非易失性计算机可读存储介质上的计算机程序,所述计算机程序包括程序指令,当所述程序指令被计算机执行时,使所述计算机执行上述任一项语音识别模型训练方法,或任一项说话人类型识别方法。
本发明实施例提供的语音识别模型训练方法、语音识别模型训练装置、说话人类型识别方法及装置,获取训练语音并提取所述训练语音的声学特征,所述训练语音包括不同用户类型的语音;利用所述声学特征,训练获得用于提取说话人特征的特征识别器;其中,不同用户类型对应的说话人特征不同,利用所述特征识别器从每一个用户类型对应的目标语音中提取说话人特征,作为所述用户类型对应的说话人特征;将不同用户类型对应的说话人特征以及所述特征识别器,作为说话人类型识别 模型,从而在进行说话人类型识别时,利用所述说话人类型识别模型中的所述特征识别器结合待识别语音的声音特征,可以提取所述待识别语音的说话人特征,并将所述待识别语音的说话人特征与不同用户类型对应的说话人特征进行匹配,匹配度最高的说话人特征对应的用户类型即为所述待识别语音的用户类型,从而实现了用户类型的识别。
附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本发明实施例提供的语音识别模型训练方法一个实施例流程图;
图2为本发明实施例提供的说话人类型识别方法一个实施例流程图;
图3为本发明实施例提供的语音识别模型训练装置一个实施例结构示意图;
图4为本发明实施例提供的说话人类型识别装置一个实施例结构示意图;
图5为本发明实施例提供的一种电子设备的结构示意图。
具体实施方式
为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合 本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
本发明的技术方案适用于语音识别场景中,用于区分不同用户类型,用户类型可以包括成年男、成年女、老人或者儿童,对用户类型的区分可以应用于不同的应用场景中,例如智能电视通过区分用户类型可以向不同用户类型的用户展示不同的影视内容等。
在本发明实施例中,为了实现不同用户类型的区分,首先进行模型训练,获取训练语音并提取所述训练语音的声学特征,所述训练语音包括不同用户类型的语音;利用所述声学特征,训练获得用于提取说话人特征的特征识别器;其中,不同用户类型对应的说话人特征不同,利用所述特征识别器从每一个用户类型对应的目标语音中提取说话人特征,作为所述用户类型对应的说话人特征;将不同用户类型对应的说话人特征以及所述特征识别器,作为说话人类型识别模型,从而在进行说话人类型识别时,利用所述说话人类型识别模型中的所述特征识别器结合待识别语音的声音特征,可以提取所述待识别语音的说话人特征,并将所述待识别语音的说话人特征与不同用户类型对应的说话人特征进行匹配,匹配度最高的说话人特征对应的用户类型即为所述待识别语音的用户类型,从而实现了用户类型的识别。
下面将结合附图对本发明技术方案进行详细描述。
图1是本发明实施例提供的一种语音识别模型训练方法一个实施例 的流程图,该方法可以包括以下几个步骤:
101:获取训练语音并提取所述训练语音的声学特征。
其中,所述训练语音包括不同用户类型的语音。
通常选择大规模的训练语音,一般超过50小时。
不同用户类型可以包括成年男、成年女、老人或者儿童,不同用户类型对应的语音量相同或相近。
对于大量的训练语音,首先提取声学特征,该声学特征可以是MFCC(Mel Frequency Cepstrum Coefficient,梅尔频率倒谱系数)特征。
102:利用所述声学特征,训练获得用于提取说话人特征的特征识别器。
其中,不同用户类型对应的说话人特征不同。
其中,所述说话人特征为与文本无关的特征。通过利用声学特征计算获得。因此利用所述声学特征,可以训练用于提取说话人特征的特征识别器。
该说话人特征可以是基频特征,发明人在研究中发现,人声基频一般在140Hz(赫兹)到300Hz之间,通常女性比男性的基频高,儿童比成人基频高,从而可以利用基频特征进行不同用户类型的区分。
当然,为了进一步提高识别准确度,该说话人特征可以是i-Vector(i-向量)特征。i-Vector特征能够反映说话人声学差异,从而可以实现对不同用户类型的区分。
利用训练语音的声学特征可以训练特征识别器,以用于提取说话人特征。在说话人特征为i-Vector特征时,该特征识别器具体即是一个T矩阵。
其中,利用所述声学特征,训练获得用于提取说话人特征的特征识别器可以具体是:
利用所述声学特征可以首先训练获得UBM(Universal Background Model,通用背景模型),再利用UBM,训练获得用于提取说话人特征的特征识别器。
103:利用所述特征识别器从每一个用户类型对应的目标语音中提取说话人特征,作为所述用户类型对应的说话人特征。
目标语音可以是在应用环境中采集的目标语音,用于进行训练。
例如应用于电视机中时,每一个用户类型的目标语音可以是利用电视机的麦克风采集获得的每一个用户类型的目标语音。
其中这些目标语音具有一定时长,通常至少为1个小时,以提高识别精确度。
获得目标语音之后,即可以利用步骤102训练获得的特征识别器提取说话人特征。
为了提高识别准确度,每一个用户类型的目标语音可以包括多个,从而具体的可以是利用所述特征识别器从每一个用户类型的多个目标语音中分别提取说话人特征,并将提取获得的多个说话人特征的平均值作为所述用户类型对应的说话人特征。
104:将不同用户类型对应的说话人特征以及所述特征识别器,作为说话人类型识别模型。
训练获得的特征识别器以及利用特征识别器从目标语音中提取的每一用户类型对应的说话人特征,即作为说话人类型识别模型。
在进行说话人类型识别时,即可以利用所述说话人类型识别模型的 特征识别器结合待识别语音的声音特征,提取所述待识别语音的说话人特征,并将所述待识别语音的说话人特征与不同用户类型对应的说话人特征进行匹配,将匹配度最高的说话人特征对应的用户类型识别所述待识别语音的用户类型。
在本实施例中,通过训练获得的说话人类型识别模型,实现了识别用户类型目的,从而实现对不同用户类型的区分。
在实际应用中,通过识别用户语音,确定用户类型,以可以针对性的向用户推送其用户类型对应的相关信息等。
图2为本发明实施例提供的一种说话人类型识别方法一个实施例的流程图,该方法可以包括以下几个步骤:
201:获取待识别语音,并提取所述待识别语音的声学特征。
在实际应用中,该待识别语音可以是设备采集的用户输入的语音,通过对该待识别语音进行识别,以实现确定所述用户的用户类型的目的。
202:利用说话人类型识别模型中的特征识别器以及所述声学特征,提取所述待识别语音的说话人特征。
其中,所述说话人类型识别模型包括特征识别器以及不同用户类型对应的说话人特征;所述特征识别器利用训练语音的声学特征训练获得;所述不同用户类型对应的说话人特征利用所述特征识别器从所述不同用户类型的目标语音中提取获得。
其中,所述说话人类型识别模型的具体训练过程可以参见图1对应实施例,在此不再赘述。
203:分别计算所述待识别语音的说话人特征与所述说话人类型识别 模型中不同用户类型对应的说话人特征的匹配度。
204:将匹配度最高的说话人特征对应的用户类型识别为所述待识别语音的用户类型。
匹配度最高的说话人特征对应的用户类型,即识别为待识别语音的用户类型。
其中,说话人特征为i-Vector特征时,计算所述待识别语音的说话人特征与所述说话人类型识别模型中不同用户类型对应的说话人特征的匹配度具体可以是:
分别计算所述待识别语音的i-Vector特征与所述说话人类型识别模型中不同用户类型的i-Vector特征的距离作为匹配度;其中距离越小,匹配度越大。
计算的待识别语音的i-Vector特征与所述说话人类型识别模型中不同用户类型的i-Vector特征的距离具体可以是余弦距离。
从而最小距离对应的用户类型,即识别为所述待识别语音的用户类型。
通过本实施例,实现了用户类型的确定,从而实现了根据语音区分不同用户类型的目的。
图3为本发明实施例提供的一种语音识别模型训练装置一个实施例的结构示意图,该装置可以包括:
第一提取模块301,用于获取训练语音并提取所述训练语音的声学特征。
所述训练语音包括不同用户类型的语音。
不同用户类型可以包括成年男、成年女、老人或者儿童。
对于大量的训练语音,首先提取声学特征,该声学特征可以是MFCC特征。
训练模块302,用于利用所述声学特征,训练获得用于提取说话人特征的特征识别器。
其中,不同用户类型对应的说话人特征不同。
其中,所述说话人特征为与文本无关的特征。通过利用声学特征计算获得。因此利用所述声学特征,可以训练用于提取说话人特征的特征识别器。
该说话人特征可以是基频特征。人声基频一般在140Hz(赫兹)到300Hz之间,通常女性比男性的基频高,儿童比成人基频高,从而可以利用基频特征进行不同用户类型的区分。
当然,为了进一步提高识别准确度,该说话人特征可以是i-Vector特征。i-Vector特征能够反映说话人声学差异,从而可以实现对不同用户类型的区分。
利用训练语音的声学特征可以训练特征识别器,以用于提取说话人特征。在说话人特征为i-Vector特征时,该特征识别器具体即是一个T矩阵。
作为又一个实施例,该训练模块可以包括:
第一训练单元,用于利用所述声学特征,训练获得通用背景模型;
第二训练单元,用于利用所述通用背景模型,训练获得用于提取说话人特征的特征识别器。
第二提取模块303,用于利用所述特征识别器从每一个用户类型对应 的目标语音中提取说话人特征,作为所述用户类型对应的说话人特征。
目标语音可以是在应用环境中采集的目标语音,用于进行训练。
例如应用于电视机中时,每一个用户类型的目标语音可以是利用电视机的麦克风采集获得的每一个用户类型的目标语音。
为了提高识别准确度,每一个用户类型的目标语音可以包括多个,因此,作为又一个实施例,所述第二提取模块具体用于利用所述特征识别器从每一个用户类型的多个目标语音中分别提取说话人特征,并提取获得的多个说话人特征的平均值作为所述用户类型对应的说话人特征。
模型生成模块304,用于将不同用户类型对应的说话人特征以及所述特征识别器,作为说话人类型识别模型。
训练获得的特征识别器以及利用特征识别器从目标语音中提取的每一用户类型对应的说话人特征,即作为说话人类型识别模型。
在进行说话人类型识别时,即可以利用所述说话人类型识别模型的特征识别器结合待识别语音的声音特征,提取所述待识别语音的说话人特征,并将所述待识别语音的说话人特征与不同用户类型对应的说话人特征进行匹配,将匹配度最高的说话人特征对应的用户类型识别所述待识别语音的用户类型。
在本实施例中,通过训练获得的说话人类型识别模型,实现了识别用户类型目的,从而实现对不同用户类型的区分。
在实际应用中,通过识别用户语音,确定用户类型,以可以针对性的向用户推送其用户类型对应的相关信息等。
图4为本发明实施例提供的一种说话人类型识别装置一个实施例的 结构示意图,该装置可以包括:
第三提取模块401,用于获取待识别语音,并提取所述待识别语音的声学特征。
在实际应用中,该待识别语音可以是设备采集的用户输入的语音,通过对该待识别语音进行识别,以实现确定所述用户的用户类型的目的。
第四提取模块402,用于利用说话人类型识别模型中的特征识别器以及所述声学特征,提取所述待识别语音的说话人特征。
其中,所述说话人类型识别模型包括特征识别器以及不同用户类型对应的说话人特征;所述特征识别器利用训练语音的声学特征训练获得;所述不同用户类型对应的说话人特征利用所述特征识别器从所述不同用户类型的目标语音提取中获得。
其中,所述说话人类型识别模型的具体训练过程可以参见上述实施例中所述,在此不再赘述。
匹配度计算模块403,用于分别计算所述待识别语音的说话人特征与所述说话人类型识别模型中不同用户类型对应的说话人特征的匹配度。
识别模块404,用于将匹配度最高的说话人特征对应的用户类型识别为所述待识别语音的用户类型。
匹配度最高的说话人特征对应的用户类型,即识别为待识别语音的用户类型。
其中,说话人特征为i-Vector特征时,所述匹配度计算模块具体用于:
分别计算所述待识别语音的i-Vector特征与所述说话人类型识别模型中不同用户类型的i-Vector特征的距离作为匹配度;其中距离越小,匹配度越大。
计算待识别语音的i-Vector特征与所述说话人类型识别模型中不同用户类型的i-Vector特征的距离具体可以是余弦距离。
从而最小距离对应的用户类型,即识别为所述待识别语音的用户类型。
通过本实施例,实现了用户类型的确定,从而实现了根据语音区分不同用户类型的目的。
在实际应用中,图4所示的说话人类型识别装置可以配置在诸如智能电视、手机、平板电脑等智能电子设备中,实现对用户输入的语音进行用户类型识别,从而针对不同的用户类型,可以推送或显示不同的信息。
本申请实施例还提供了一种非易失性计算机可读存储介质,所述非易失性计算机可读存储介质存储有计算机可执行指令,该计算机可执行指令可执行上述任一方法实施例中的语音识别模型训练方法,或任一方法实施例中的说话人类型识别方法。
图5是本申请实施例提供的执行语音识别模型训练方法和/或说话人类型识别方法的电子设备的硬件结构示意图,如图5所示,该设备包括:
一个或多个处理器510以及存储器520,图5中以一个处理器510为例。
执行语音识别模型训练方法和/或说话人类型识别方法的设备还可以包括:输入装置530和输出装置540。
处理器510、存储器520、输入装置530和输出装置540可以通过总线或者其他方式连接,图5中以通过总线连接为例。
存储器520作为一种非易失性计算机可读存储介质,可用于存储非易失性软件程序、非易失性计算机可执行程序以及模块,如本申请实施例中的语音识别模型训练方法和/或说话人类型识别方法对应的程序指令/ 模块(例如,附图3所示的第一提取模块301、训练模块302、第二提取模块303和模型生成模块304,或者,附图4所示的第三提取模块401、第四提取模块402、匹配度计算模块403和识别模块404)。处理器510通过运行存储在存储器520中的非易失性软件程序、指令以及模块,从而执行电子设备的各种功能应用以及数据处理,即实现上述方法实施例语音识别模型训练方法和/或说话人类型识别方法。
存储器520可以包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需要的应用程序;存储数据区可存储根据语音识别模型训练装置(如附图3)和/或说话人类型识别装置(如附图4)的使用所创建的数据等。此外,存储器520可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他非易失性固态存储器件。在一些实施例中,存储器520可选包括相对于处理器510远程设置的存储器,这些远程存储器可以通过网络连接至语音识别模型训练装置和/或说话人类型识别装置。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。
输入装置530可接收输入的数字或字符信息,以及产生与语音识别模型训练装置和/或说话人类型识别装置的用户设置以及功能控制有关的键信号输入。输出装置540可包括显示屏等显示设备。
所述一个或者多个模块存储在所述存储器520中,当被所述一个或者多个处理器510执行时,执行上述任意方法实施例中的语音识别模型训练方法和/或说话人类型识别方法。
上述产品可执行本申请实施例所提供的方法,具备执行方法相应的功能模块和有益效果。未在本实施例中详尽描述的技术细节,可参见本申请实施例所提供的方法。
本发明实施例的电子设备以多种形式存在,包括但不限于:
(1)移动通信设备:这类设备的特点是具备移动通信功能,并且以提供话音、数据通信为主要目标。这类终端包括:智能手机(例如iPhone)、多媒 体手机、功能性手机,以及低端手机等。
(2)超移动个人计算机设备:这类设备属于个人计算机的范畴,有计算和处理功能,一般也具备移动上网特性。这类终端包括:PDA、MID和UMPC设备等,例如iPad。
(3)便携式娱乐设备:这类设备可以显示和播放多媒体内容。该类设备包括:音频、视频播放器(例如iPod),掌上游戏机,电子书,以及智能玩具和便携式车载导航设备。
(4)服务器:提供计算服务的设备,服务器的构成包括处理器、硬盘、内存、系统总线等,服务器和通用的计算机架构类似,但是由于需要提供高可靠的服务,因此在处理能力、稳定性、可靠性、安全性、可扩展性、可管理性等方面要求较高。
(5)其他具有数据交互功能的电子装置。
需要说明的是,本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的程序可存储于一非易失性计算机可读存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,所述的存储介质可为磁碟、光盘、只读存储记忆体(Read Only Memory,ROM)或随机存储记忆体(Random Access Memory,RAM)等。
以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的 劳动的情况下,即可以理解并实施。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件。基于这样的理解,上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在计算机可读存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行各个实施例或者实施例的某些部分所述的方法。
最后应说明的是:以上实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。

Claims (17)

  1. 一种语音识别模型训练方法,其特征在于,应用于电子设备,包括:
    获取训练语音并提取所述训练语音的声学特征,所述训练语音包括不同用户类型的语音;
    利用所述声学特征,训练获得用于提取说话人特征的特征识别器;其中,不同用户类型对应的说话人特征不同;
    利用所述特征识别器从每一个用户类型对应的目标语音中提取说话人特征,作为所述用户类型对应的说话人特征;
    将不同用户类型对应的说话人特征以及所述特征识别器,作为说话人类型识别模型,所述说话人类型识别模型用于利用所述特征识别器结合待识别语音的声音特征,提取所述待识别语音的说话人特征,并将所述待识别语音的说话人特征与不同用户类型对应的说话人特征进行匹配,将匹配度最高的说话人特征对应的用户类型识别为所述待识别语音的用户类型。
  2. 根据权利要求1所述的方法,其特征在于,所述利用所述声学特征,训练获得用于提取说话人特征的特征识别器包括:
    利用所述声学特征,训练用于计算i-Vector特征的T矩阵,所述T矩阵为特征识别器,所述i-Vector特征为说话人特征。
  3. 根据权利要求1所述的方法,其特征在于,利用所述特征识别器从每一个用户类型对应的目标语音中提取说话人特征,作为所述用户类型对应的说话人特征包括:
    利用所述特征识别器从每一个用户类型的多个目标语音中分别 提取说话人特征,并提取获得的多个说话人特征的平均值作为所述用户类型对应的说话人特征。
  4. 根据权利要求1所述的方法,其特征在于,所述利用所述声学特征,训练获得用于提取说话人特征的特征识别器包括:
    利用所述声学特征,训练获得通用背景模型;
    利用所述通用背景模型,训练获得用于提取说话人特征的特征识别器。
  5. 一种说话人类型识别方法,其特征在于,应用于电子设备,包括:
    获取待识别语音,并提取所述待识别语音的声学特征;
    利用说话人类型识别模型中的特征识别器以及所述声学特征,提取所述待识别语音的说话人特征;所述说话人类型识别模型包括特征识别器以及不同用户类型对应的说话人特征;所述特征识别器利用训练语音的声学特征训练获得;所述不同用户类型对应的说话人特征利用所述特征识别器从所述不同用户类型的目标语音中提取获得;
    分别计算所述待识别语音的说话人特征,与所述说话人类型识别模型中不同用户类型对应的说话人特征的匹配度;
    将匹配度最高的说话人特征对应的用户类型识别为所述待识别语音的用户类型。
  6. 根据权利要求5所述的方法,其特征在于,所述说话人特征为i-Vector特征;
    所述分别计所述算所述待识别语音的说话人特征,与所述说话人类型识别模型中不同用户类型对应的说话人特征的匹配度包括:
    分别计算所述待识别语音的i-Vector特征,与所述说话人类型识 别模型中不同用户类型对应的i-Vector特征的距离作为匹配度;其中距离越小,匹配度越大。
  7. 一种语音识别模型训练装置,其特征在于,包括:
    第一提取模块,用于获取训练语音并提取所述训练语音的声学特征,所述训练语音包括不同用户类型的语音;
    训练模块,用于利用所述声学特征,训练获得用于提取说话人特征的特征识别器;其中,不同用户类型对应的说话人特征不同;
    第二提取模块,用于利用所述特征识别器从每一个用户类型对应的目标语音中提取说话人特征,作为所述用户类型对应的说话人特征;
    模型生成模块,用于将不同用户类型对应的说话人特征以及所述特征识别器,作为说话人类型识别模型,所述说话人类型识别模型用于利用所述特征识别器结合待识别语音的声音特征,提取所述待识别语音的说话人特征,并将所述待识别语音的说话人特征与不同用户类型对应的说话人特征进行匹配,将匹配度最高的说话人特征对应的用户类型识别为所述待识别语音的用户类型。
  8. 根据权利要求7所述的装置,其特征在于,所述训练模块具体用于:
    利用所述声学特征,训练用于计算i-Vector特征的T矩阵,所述T矩阵为特征识别器,所述i-Vector特征为说话人特征。
  9. 根据权利要求7所述的装置,其特征在于,所述第二提取模块具体用于:
    利用所述特征识别器从每一个用户类型的多个目标语音中分别提取说话人特征,并提取获得的多个说话人特征的平均值作为所述用 户类型对应的说话人特征。
  10. 根据权利要求7所述的装置,其特征在于,所述训练模块包括:
    第一训练单元,用于利用所述声学特征,训练获得通用背景模型;
    第二训练单元,用于利用所述通用背景模型,训练获得用于提取说话人特征的特征识别器。
  11. 一种说话人类型识别装置,其特征在于,包括:
    第三提取模块,用于获取待识别语音,并提取所述待识别语音的声学特征;
    第四提取模块,用于利用说话人类型识别模型中的特征识别器以及所述声学特征,提取所述待识别语音的说话人特征;所述说话人类型识别模型包括特征识别器以及不同用户类型对应的说话人特征;所述特征识别器利用训练语音的声学特征训练获得;所述不同用户类型对应的说话人特征利用所述特征识别器从所述不同用户类型的目标语音中提取获得;
    匹配度计算模块,用于分别计算所述待识别语音的说话人特征,与所述说话人类型识别模型中不同用户类型对应的说话人特征的匹配度;
    识别模块,用于将匹配度最高的说话人特征对应的用户类型识别为所述待识别语音的用户类型。
  12. 根据权利要求11所述的装置,其特征在于,所述说话人特征为i-Vector特征;
    所述匹配度计算模块具体用于:
    分别计算所述待识别语音的i-Vector特征与所述说话人类型识别 模型中不同用户类型的i-Vector特征的距离作为匹配度;其中距离越小,匹配度越大。
  13. 一种非易失性计算机可读存储介质,存储有计算机可执行指令,所述计算机可执行指令设置为:
    获取训练语音并提取所述训练语音的声学特征,所述训练语音包括不同用户类型的语音;
    利用所述声学特征,训练获得用于提取说话人特征的特征识别器;其中,不同用户类型对应的说话人特征不同;
    利用所述特征识别器从每一个用户类型对应的目标语音中提取说话人特征,作为所述用户类型对应的说话人特征;
    将不同用户类型对应的说话人特征以及所述特征识别器,作为说话人类型识别模型,所述说话人类型识别模型用于利用所述特征识别器结合待识别语音的声音特征,提取所述待识别语音的说话人特征,并将所述待识别语音的说话人特征与不同用户类型对应的说话人特征进行匹配,将匹配度最高的说话人特征对应的用户类型识别为所述待识别语音的用户类型。
  14. 一种非易失性计算机可读存储介质,存储有计算机可执行指令,所述计算机可执行指令设置为:
    获取待识别语音,并提取所述待识别语音的声学特征;
    利用说话人类型识别模型中的特征识别器以及所述声学特征,提取所述待识别语音的说话人特征;所述说话人类型识别模型包括特征识别器以及不同用户类型对应的说话人特征;所述特征识别器利用训练语音的声学特征训练获得;所述不同用户类型对应的说话人特征利用所述特征识别器从所述不同用户类型的目标语音中提取获得;
    分别计算所述待识别语音的说话人特征,与所述说话人类型识别模型中不同用户类型对应的说话人特征的匹配度;
    将匹配度最高的说话人特征对应的用户类型识别为所述待识别语音的用户类型。
  15. 一种电子设备,包括:
    至少一个处理器;以及,
    与所述至少一个处理器通信连接的存储器;其中,
    所述存储器存储有可被所述一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够:
    获取训练语音并提取所述训练语音的声学特征,所述训练语音包括不同用户类型的语音;
    利用所述声学特征,训练获得用于提取说话人特征的特征识别器;其中,不同用户类型对应的说话人特征不同;
    利用所述特征识别器从每一个用户类型对应的目标语音中提取说话人特征,作为所述用户类型对应的说话人特征;
    将不同用户类型对应的说话人特征以及所述特征识别器,作为说话人类型识别模型,所述说话人类型识别模型用于利用所述特征识别器结合待识别语音的声音特征,提取所述待识别语音的说话人特征,并将所述待识别语音的说话人特征与不同用户类型对应的说话人特征进行匹配,将匹配度最高的说话人特征对应的用户类型识别为所述待识别语音的用户类型。
  16. 一种电子设备,包括:
    至少一个处理器;以及,
    与所述至少一个处理器通信连接的存储器;其中,
    所述存储器存储有可被所述一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够:
    获取待识别语音,并提取所述待识别语音的声学特征;
    利用说话人类型识别模型中的特征识别器以及所述声学特征,提取所述待识别语音的说话人特征;所述说话人类型识别模型包括特征识别器以及不同用户类型对应的说话人特征;所述特征识别器利用训练语音的声学特征训练获得;所述不同用户类型对应的说话人特征利用所述特征识别器从所述不同用户类型的目标语音中提取获得;
    分别计算所述待识别语音的说话人特征,与所述说话人类型识别模型中不同用户类型对应的说话人特征的匹配度;
    将匹配度最高的说话人特征对应的用户类型识别为所述待识别语音的用户类型。
  17. 一种计算机程序产品,所述计算机程序产品包括存储在非易失性计算机可读存储介质上的计算机程序,所述计算机程序包括程序指令,当所述程序指令被计算机执行时,使所述计算机执行权利要求1至6任一项所述的方法。
PCT/CN2016/096986 2016-03-30 2016-08-26 语音识别模型训练方法、说话人类型识别方法及装置 WO2017166651A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610195561.0 2016-03-30
CN201610195561.0A CN105895080A (zh) 2016-03-30 2016-03-30 语音识别模型训练方法、说话人类型识别方法及装置

Publications (1)

Publication Number Publication Date
WO2017166651A1 true WO2017166651A1 (zh) 2017-10-05

Family

ID=57014248

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/096986 WO2017166651A1 (zh) 2016-03-30 2016-08-26 语音识别模型训练方法、说话人类型识别方法及装置

Country Status (2)

Country Link
CN (1) CN105895080A (zh)
WO (1) WO2017166651A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111243607A (zh) * 2020-03-26 2020-06-05 北京字节跳动网络技术有限公司 用于生成说话人信息的方法、装置、电子设备和介质
CN112825256A (zh) * 2019-11-20 2021-05-21 百度在线网络技术(北京)有限公司 录制语音包功能的引导方法、装置、设备和计算机存储介质
CN113370923A (zh) * 2021-07-23 2021-09-10 深圳市元征科技股份有限公司 一种车辆配置的调整方法、装置、电子设备及存储介质

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105895080A (zh) * 2016-03-30 2016-08-24 乐视控股(北京)有限公司 语音识别模型训练方法、说话人类型识别方法及装置
CN107610706A (zh) * 2017-09-13 2018-01-19 百度在线网络技术(北京)有限公司 语音搜索结果的处理方法和处理装置
CN110288979B (zh) * 2018-10-25 2022-07-05 腾讯科技(深圳)有限公司 一种语音识别方法及装置
CN110797034A (zh) * 2019-09-23 2020-02-14 重庆特斯联智慧科技股份有限公司 一种用于老人及病患照料的自动语音视频识别对讲系统
CN112712792A (zh) * 2019-10-25 2021-04-27 Tcl集团股份有限公司 一种方言识别模型的训练方法、可读存储介质及终端设备
CN111462759B (zh) * 2020-04-01 2024-02-13 科大讯飞股份有限公司 一种说话人标注方法、装置、设备及存储介质
CN111739517B (zh) * 2020-07-01 2024-01-30 腾讯科技(深圳)有限公司 语音识别方法、装置、计算机设备及介质

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102737633A (zh) * 2012-06-21 2012-10-17 北京华信恒达软件技术有限公司 一种基于张量子空间分析的说话人识别方法及其装置
CN103310788A (zh) * 2013-05-23 2013-09-18 北京云知声信息技术有限公司 一种语音信息识别方法及系统
CN103413551A (zh) * 2013-07-16 2013-11-27 清华大学 基于稀疏降维的说话人识别方法
CN103824557A (zh) * 2014-02-19 2014-05-28 清华大学 一种具有自定义功能的音频检测分类方法
US20140214420A1 (en) * 2013-01-25 2014-07-31 Microsoft Corporation Feature space transformation for personalization using generalized i-vector clustering
US8831942B1 (en) * 2010-03-19 2014-09-09 Narus, Inc. System and method for pitch based gender identification with suspicious speaker detection
CN105139857A (zh) * 2015-09-02 2015-12-09 广东顺德中山大学卡内基梅隆大学国际联合研究院 一种自动说话人识别中针对语音欺骗的对抗方法
CN105895080A (zh) * 2016-03-30 2016-08-24 乐视控股(北京)有限公司 语音识别模型训练方法、说话人类型识别方法及装置

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104123933A (zh) * 2014-08-01 2014-10-29 中国科学院自动化研究所 基于自适应非平行训练的语音转换方法
CN105185372B (zh) * 2015-10-20 2017-03-22 百度在线网络技术(北京)有限公司 个性化多声学模型的训练方法、语音合成方法及装置

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8831942B1 (en) * 2010-03-19 2014-09-09 Narus, Inc. System and method for pitch based gender identification with suspicious speaker detection
CN102737633A (zh) * 2012-06-21 2012-10-17 北京华信恒达软件技术有限公司 一种基于张量子空间分析的说话人识别方法及其装置
US20140214420A1 (en) * 2013-01-25 2014-07-31 Microsoft Corporation Feature space transformation for personalization using generalized i-vector clustering
CN103310788A (zh) * 2013-05-23 2013-09-18 北京云知声信息技术有限公司 一种语音信息识别方法及系统
CN103413551A (zh) * 2013-07-16 2013-11-27 清华大学 基于稀疏降维的说话人识别方法
CN103824557A (zh) * 2014-02-19 2014-05-28 清华大学 一种具有自定义功能的音频检测分类方法
CN105139857A (zh) * 2015-09-02 2015-12-09 广东顺德中山大学卡内基梅隆大学国际联合研究院 一种自动说话人识别中针对语音欺骗的对抗方法
CN105895080A (zh) * 2016-03-30 2016-08-24 乐视控股(北京)有限公司 语音识别模型训练方法、说话人类型识别方法及装置

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112825256A (zh) * 2019-11-20 2021-05-21 百度在线网络技术(北京)有限公司 录制语音包功能的引导方法、装置、设备和计算机存储介质
CN111243607A (zh) * 2020-03-26 2020-06-05 北京字节跳动网络技术有限公司 用于生成说话人信息的方法、装置、电子设备和介质
CN113370923A (zh) * 2021-07-23 2021-09-10 深圳市元征科技股份有限公司 一种车辆配置的调整方法、装置、电子设备及存储介质
CN113370923B (zh) * 2021-07-23 2023-11-03 深圳市元征科技股份有限公司 一种车辆配置的调整方法、装置、电子设备及存储介质

Also Published As

Publication number Publication date
CN105895080A (zh) 2016-08-24

Similar Documents

Publication Publication Date Title
WO2017166651A1 (zh) 语音识别模型训练方法、说话人类型识别方法及装置
US11580993B2 (en) Keyword determinations from conversational data
US11568876B2 (en) Method and device for user registration, and electronic device
US10586541B2 (en) Communicating metadata that identifies a current speaker
CN109643549B (zh) 基于说话者识别的语音识别方法和装置
US20210225380A1 (en) Voiceprint recognition method and apparatus
US10270736B2 (en) Account adding method, terminal, server, and computer storage medium
WO2017181611A1 (zh) 在特定视频库中搜索视频的方法及其视频终端
CN103165131A (zh) 语音处理系统及语音处理方法
CN112653902B (zh) 说话人识别方法、装置及电子设备
CN108920128B (zh) 演示文稿的操作方法及系统
WO2017181609A1 (zh) 一种界面跳转管理的方法及装置
JP2015517709A (ja) コンテキストに基づくメディアを適応配信するシステム
US11721328B2 (en) Method and apparatus for awakening skills by speech
KR101995443B1 (ko) 화자 검증 방법 및 음성인식 시스템
WO2018043137A1 (ja) 情報処理装置及び情報処理方法
CN109190116B (zh) 语义解析方法、系统、电子设备及存储介质
WO2019150708A1 (ja) 情報処理装置、情報処理システム、および情報処理方法、並びにプログラム
WO2020154883A1 (zh) 语音信息的处理方法、装置、存储介质及电子设备
CN112905748A (zh) 一种演讲效果评估系统
JP6571587B2 (ja) 音声入力装置、その方法、及びプログラム
CN110516043A (zh) 用于问答系统的答案生成方法和装置
KR20190077296A (ko) 화자 검증 방법 및 음성인식 시스템
US10847158B2 (en) Multi-modality presentation and execution engine
CN108096841B (zh) 一种语音交互方法、装置、电子设备和可读存储介质

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16896421

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 16896421

Country of ref document: EP

Kind code of ref document: A1