WO2021169170A1 - 基于大数据的语音生成方法、装置、设备及介质 - Google Patents

基于大数据的语音生成方法、装置、设备及介质 Download PDF

Info

Publication number
WO2021169170A1
WO2021169170A1 PCT/CN2020/105040 CN2020105040W WO2021169170A1 WO 2021169170 A1 WO2021169170 A1 WO 2021169170A1 CN 2020105040 W CN2020105040 W CN 2020105040W WO 2021169170 A1 WO2021169170 A1 WO 2021169170A1
Authority
WO
WIPO (PCT)
Prior art keywords
speaker
robot
audio
information
emotional
Prior art date
Application number
PCT/CN2020/105040
Other languages
English (en)
French (fr)
Inventor
曹绪文
Original Assignee
深圳壹账通智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳壹账通智能科技有限公司 filed Critical 深圳壹账通智能科技有限公司
Publication of WO2021169170A1 publication Critical patent/WO2021169170A1/zh

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • G10L13/047Architecture of speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/26Recognition of special voice characteristics, e.g. for use in lie detectors; Recognition of animal voices
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Definitions

  • This application relates to the field of information technology, and in particular to a method, device, equipment and medium for generating speech based on big data.
  • Man-machine dialogue is a working mode of a computer, that is, between the computer operator or user and the computer, the work is carried out in a dialogue mode through the console or terminal display screen.
  • the dialogue voice of the robot is not emotional, and the speech is very blunt and single.
  • the dialogue experience is not good, so that emotional services cannot be provided, and it cannot be applied to, for example, psychological consultation, In scenarios such as dredging emotions, the application scenarios are limited.
  • the embodiments of the present application provide a method, device, device, and medium for generating speech based on big data, so as to solve the problem that the conversational speech in the existing human-machine conversation does not carry emotion, is rigid and single.
  • a voice generation method based on big data including:
  • the robot audio signal to be output is generated according to the audio factor of the robot.
  • the performing audio analysis on the speaker's audio signal to obtain the speaker's audio factors includes:
  • the time interval between two pauses and the number of spoken words in the audio signal of the speaker are acquired, and the speaking rate rule is matched according to the time interval and the number of spoken words, and the speaking rate information of the speaker is obtained.
  • the acquiring the speaker's emotional tag according to the speaker's audio factor includes:
  • the obtaining the emotional label of the robot corresponding to the emotional label of the speaker includes:
  • the dialogue emotion mapping relationship including the emotion label of the speaker and the emotion label of the corresponding robot;
  • the dialog emotional mapping relationship is inquired to obtain the emotional tag of the robot.
  • the obtaining the audio factor of the robot according to the emotion tag of the robot includes:
  • the mapping relationship between the emotion tag of the robot and the audio factor is inquired to obtain the audio factor of the robot.
  • the generating the robot audio signal to be output according to the audio factor of the robot includes:
  • the text information of the robot is converted into a robot audio signal to be output according to the audio factor of the robot.
  • a speech generating device based on big data including:
  • the audio signal acquisition module is used to acquire the speaker's audio signal
  • An audio signal analysis module configured to perform audio analysis on the speaker's audio signal to obtain the speaker's audio factors
  • the first label obtaining module is used to obtain the emotional label of the speaker according to the audio factor of the speaker;
  • the second label acquiring module is used to acquire the emotional label of the robot corresponding to the emotional label of the speaker;
  • An audio factor obtaining module configured to obtain the audio factor of the robot according to the emotion tag of the robot
  • the audio signal generating module is used to generate the robot audio signal to be output according to the audio factors of the robot.
  • the audio signal analysis module includes:
  • the establishment unit is used to establish the mapping relationship between frequency and tone, the mapping relationship between waveform and timbre, and speech rate rules through machine learning;
  • a pitch acquiring unit configured to acquire frequency information of the speaker's audio signal, query the mapping relationship between frequency and pitch according to the frequency information, and obtain the speaker's pitch information
  • a timbre acquiring unit configured to acquire waveform information of the speaker's audio signal, query the mapping relationship between the waveform and the timbre according to the waveform information, to obtain the speaker's timbre information, the timbre information including emotion information and age information;
  • the speech rate acquiring unit is used to acquire the time interval between two pauses and the number of spoken words in the speaker's audio signal, and match the speech rate rule according to the time interval and the number of spoken words to obtain the speaker's speech rate information.
  • a computer device includes a memory, a processor, and computer-readable instructions that are stored in the memory and can run on the processor, and the processor implements the following steps when the processor executes the computer-readable instructions:
  • the robot audio signal to be output is generated according to the audio factor of the robot.
  • One or more non-volatile readable storage media storing computer readable instructions.
  • the computer readable instructions execute the following steps:
  • the robot audio signal to be output is generated according to the audio factor of the robot.
  • FIG. 1 is a flowchart of a method for generating speech based on big data in an embodiment of the present application
  • FIG. 2 is a flowchart of step S102 in the big data-based speech generation method in an embodiment of the present application
  • FIG. 3 is a flowchart of step S103 in the big data-based speech generation method in an embodiment of the present application
  • FIG. 4 is a flowchart of step S104 in the big data-based speech generation method in an embodiment of the present application
  • FIG. 5 is a flowchart of step S105 in the method for generating speech based on big data in an embodiment of the present application
  • FIG. 6 is a flowchart of step S106 in the big data-based speech generation method in an embodiment of the present application.
  • FIG. 7 is a functional block diagram of a speech generating device based on big data in an embodiment of the present application.
  • Fig. 8 is a schematic diagram of a computer device in an embodiment of the present application.
  • the following is a detailed description of the voice generation method based on big data provided in this embodiment.
  • the purpose of the speech generation method based on big data provided by the embodiments of this application is to overcome the problems of poor conversation experience and limited application scenarios caused by the lack of emotion, rigid and simplistic speech in the existing human-machine conversations. Realize artificial intelligence (AI) to adjust the dialogue voice emotion of the robot.
  • AI artificial intelligence
  • the emotion tag determines the dialogue emotion of the robot, and processes the audio information to be output according to the dialogue emotion of the robot, generates the dialogue audio of the robot, and finally outputs the dialogue audio; thus, it realizes the integration of emotional factors into the robot dialogue, so that the robot can be rich Emotional dialogue with the user effectively solves the problem of non-emotional, rigid and single-sounding dialogue in the existing human-machine dialogue.
  • a method for generating speech based on big data includes the following steps:
  • step S101 an audio signal of the speaker is acquired.
  • the embodiment of the present application performs de-duplication and noise processing on the acquired speaker audio signal to eliminate interference information, so as to subsequently obtain accurate emotional information.
  • step S102 audio analysis is performed on the speaker's audio signal to obtain the speaker's audio factors.
  • the audio factors refer to elements that describe sound characteristics, including but not limited to pitch information, timbre information, and speech rate information.
  • the pitch refers to the height of the speaker's voice
  • the timbre information refers to the quality and characteristics of the speaker's voice
  • the speech rate information refers to the speed at which the speaker speaks.
  • a tone analysis module, a tone color analysis module, and a speech rate analysis module are preset, and the mapping relationship between frequency and pitch, the mapping relationship between waveform and timbre, and the rate of speech rule are established to realize the speaker audio signal Perform audio analysis.
  • FIG. 2 shows a specific implementation process of step S102 provided in an embodiment of the present application.
  • step S102 The audio analysis performed on the speaker's audio signal to obtain the speaker's audio factors includes:
  • step S201 the mapping relationship between the frequency and the pitch, the mapping relationship between the waveform and the timbre, and the speech rate rule are established through machine learning.
  • the pitch information is related to the frequency of the audio.
  • machine learning is performed to identify the relationship between the frequency and the pitch of the sound, and establish the mapping relationship between the frequency and the pitch.
  • the tonal label includes, but is not limited to, high-pitched audio, mid-range audio, and low-pitched audio.
  • the timbre information is related to the waveform of the audio.
  • the timbre information includes age information and emotional information.
  • age information and emotional information are obtained by analyzing audio waveforms.
  • machine learning is performed to identify the relationship between the waveform of the sound and the age information, and establish the relationship between the waveform and the age information.
  • Age characteristics are marked as role markings with age characteristics. Age characteristics are used to indicate age, and roles are used to indicate gender and personality. Therefore, the age characteristic markings include, but are not limited to, boy and girl voices, teenage voices, juvenile voices, and uncles. Voice, aunty, old age.
  • the tone color analysis module by inputting big data audio materials with emotional characteristics annotations into the tone color analysis module, machine learning is performed to identify the relationship between the waveform of the sound and the emotional information, and establish the relationship between the waveform and the emotional information.
  • the mapping relationship is an annotation with emotional characteristics, including but not limited to joy, happiness, excitement, sadness, surprise, and curiosity.
  • the speech rate information is related to the speed of speech, and the embodiment of the present application sets relevant speech rate rules according to the normal speech rate of human speech.
  • machine learning is performed to obtain the normal number of spoken words and the normal distribution of the number of spoken words in a preset unit time of human beings. According to the normal distribution, the normal number of spoken words is divided into multiples.
  • a speech rate level to establish speech rate rules.
  • the speech rate analysis module may identify two adjacent pauses (for example, 2 seconds of no speech, which is considered a pause), obtain the time information between the two pauses, and calculate the statistics.
  • Each speech rate rule is the correspondence between the number of spoken words in a preset time unit and the rate of speech; for example, suppose the number of regular spoken words obtained by the speech rate analysis module is [200-250) words per minute, When the corresponding setting is the speech rate level 5, when the words [0-50), [50-100), [100-150), and [150-200) are lowered every minute, set the speech rate levels 1, 2, 3, and 3 respectively. 4. When going up every minute [250-275), [275-300), [300-325), [325-350) words, respectively set the speech rate levels 6, 7, 8, 9 for 9 speech rates rule.
  • step S202 the frequency information of the speaker's audio signal is obtained, and the mapping relationship between the frequency and the pitch is queried according to the frequency information, and the pitch information of the speaker is obtained.
  • the speaker’s audio signal is input into the trained tone analysis module, and the frequency information of the speaker’s audio signal is recognized through the tone analysis module, and the frequency information obtained by the recognition and the difference between the frequency and the tone
  • the mapping relationship of the speaker is matched, and the pitch information of the speaker is obtained.
  • step S203 the waveform information of the speaker's audio signal is obtained, and the mapping relationship between the waveform and the timbre is queried according to the waveform information, and the timbre information of the speaker is obtained.
  • the timbre information includes emotion information and age information.
  • the speaker’s audio signal is input into the trained timbre analysis module, the waveform information of the speaker’s audio signal is recognized through the timbre analysis module, and the recognized waveform information and the waveform and the timbre Matching the mapping relationship between the two is specifically matching the identified waveform information with the mapping relationship between the waveform and the age information, and the mapping relationship between the waveform and the emotional information, to obtain the speaker's age information and emotional information.
  • step S204 the time interval between two pauses in the speaker's audio signal and the number of spoken words are acquired, and the speaking rate rule is matched according to the time interval and the number of spoken words, and the speaking rate information of the speaker is obtained.
  • the speaker’s audio signal is input to the trained speech rate analysis module, and the speech rate analysis module identifies two adjacent pauses (for example, if you do not speak for 2 seconds, it is considered a pause) to obtain The time information between the two pauses, and the number of spoken words between the two pauses are counted, the number of spoken words within a preset time unit is calculated according to the number of spoken words and the time information, and then the spoken words within the preset time unit are calculated The number of words and the rate of speech are matched to obtain the rate of speech, and thus the rate of speech of the speaker can be obtained.
  • step S103 the emotional tag of the speaker is obtained according to the audio factor of the speaker.
  • the emotional label of the speaker refers to the overall emotional information of the speaker in the current business scenario obtained based on the audio factor of the speaker.
  • the audio factors include pitch information, timbre information, and speech rate information
  • the speaker’s emotion label refers to the speaker’s overall emotion obtained based on the speaker’s pitch information, timbre information, and speech rate information. information.
  • FIG. 3 shows a specific implementation process of step S103 provided in an embodiment of the present application. As shown in FIG. 3, the step S103 for obtaining the speaker's emotional tag according to the speaker's audio factor includes:
  • step S301 the mapping relationship between the speaker's audio factors and emotional tags is set according to the business scenario.
  • the embodiment of the present application sets the mapping relationship between the speaker's audio factor and the emotional tag according to different business scenarios, so as to define the speaker's emotional model.
  • Each business scenario corresponds to the mapping relationship between one or more speakers' audio factors and emotional tags.
  • the emotional tags of speakers corresponding to the same audio factor are not exactly the same.
  • the pitch information is loud speaking level 3
  • the timbre information is happy loli level 4
  • the speaking rate information is speaking rate level 6
  • the corresponding speaker’s emotional label is excited loli Level 8
  • the pitch information is loud speaking level 3
  • timbre information is happy loli level 4
  • speaking rate information is speaking rate level 6
  • the corresponding speaker’s emotional label is happy loli level 4 .
  • step S302 the mapping relationship between the speaker's audio factors and emotional tags is queried according to the speaker's audio factors to obtain the speaker's emotional tags.
  • step S104 the emotion tag of the robot corresponding to the emotion tag of the speaker is obtained.
  • the speaker's emotional label refers to the speaker's overall emotional information in the current business scenario based on the speaker's audio factors.
  • the emotional label of the robot refers to the overall emotional information that the robot should have when facing the speaker in the current business scenario.
  • FIG. 4 shows a specific implementation process of step S104 provided in an embodiment of the present application. As shown in FIG. 4, obtaining the emotion label of the robot corresponding to the emotion label of the speaker in step S104 includes:
  • step S401 a dialogue emotion mapping relationship is set according to the business scenario, and the dialogue emotion mapping relationship includes the emotion label of the speaker and the emotion label of the corresponding robot.
  • this embodiment is based on the common sense of human dialogue emotions, and sets the correspondence between the speaker’s emotional label and the robot’s emotional label according to the business scenario to define the human-machine emotional model and realize the dialogue of the robot based on the speaker’s dialogue emotion. emotion.
  • the dialogue mapping relationship corresponding to different business scenarios is different.
  • the robot in a normal conversation scenario, when the emotional label of the speaker is cheerful, the emotional label of the corresponding robot is also cheerful, then the robot will conduct a dialogue with cheerful emotion; in the psychological counseling scene, when the emotional label of the speaker is When it’s sad, the corresponding robot’s emotional label is empathy, then the robot will talk with comforting emotions; in the dating scene, when the speaker’s emotional label is juvenile excitement level 5, the corresponding robot’s emotional label is girl’s excitement level 5 , Then the robot will have a dialogue with the girl’s level 5 emotions of excitement.
  • step S402 the dialogue emotion mapping relationship is queried according to the emotion tag of the speaker to obtain the emotion tag of the robot.
  • the configuration is simplified from three-to-three (speaker's audio, speaking rate, timbre vs. robot's audio, speaking rate, and timbre) to one-to-one (speaker's emotion tag vs. robot's emotion tag) configuration , which greatly simplifies the logic of configuring the overall emotions of the robot in practical applications; at the physical level, it can be separated from the underlying voice processing technology, so that the developer or salesperson can be clear at a glance, and it is convenient for the developer or salesperson in different Configure the human-machine emotion model in the business scenario.
  • step S105 the audio factor of the robot is obtained according to the emotion tag of the robot.
  • the emotional label of the robot refers to the overall emotional information that the robot should have when facing the speaker in the current business scenario.
  • the audio factors studied in this embodiment include, but are not limited to, pitch information, timbre information, and speech rate information. Therefore, when generating the robot audio signal to be output, the embodiment of the present application determines the pitch information, timbre information, and speech rate information of the robot audio signal to be output based on the emotion tag of the robot.
  • FIG. 5 shows a specific implementation process of step S105 provided in an embodiment of the present application. As shown in FIG. 5, the obtaining of the audio factors of the robot according to the emotion tags of the robot in step S105 includes:
  • step S501 the mapping relationship between the emotional tag of the robot and the audio factor is set according to the business scenario.
  • the embodiment of the present application sets the mapping relationship between the emotion label of the robot and the audio factor according to different business scenarios, so as to define the emotion model of the robot.
  • Each business scenario corresponds to the mapping relationship between one or more robot emotional tags and audio factors.
  • the audio factors corresponding to the emotion tags of the same robot are not exactly the same.
  • step S502 the mapping relationship between the emotion tag of the robot and the audio factor is queried according to the emotion tag of the robot to obtain the audio factor of the robot.
  • the robot After obtaining the robot’s emotional label, obtain the mapping relationship between the robot’s emotional label and the audio factor in the current business scenario, and then query the acquired mapping relationship based on the robot’s emotional label to obtain the robot’s audio factor, that is, in the current In business scenarios, the robot should have pitch information, timbre information, and speech rate information when facing the speaker.
  • step S106 a robot audio signal to be output is generated according to the audio factor of the robot.
  • the robot audio signal refers to the dialogue audio of the robot responding to the speaker.
  • the embodiment of the present application directly generates a robot audio signal according to the pitch information, timbre information, and speech rate information that the robot should have when facing the speaker in the current business scenario.
  • FIG. 6 shows a specific implementation process of step S106 provided in an embodiment of the present application.
  • the step S106 generating the robot audio signal to be output according to the audio factors of the robot includes:
  • step S601 the text information of the robot corresponding to the text information of the speaker's audio signal is obtained from the big data dialog table by using named entity recognition and relationship extraction technology.
  • the embodiment of the present application adopts the HMM acoustic model technology to convert the speaker audio signal into corresponding text information. Then, through named entity recognition and relationship extraction technology, the text information of the robot is obtained from the preset big data dialogue table according to the text information of the speaker. It should be understood that the text information of the robot is the text information of the robot responding to the speaker, which corresponds to the text information of the speaker, and is the content of the audio signal of the robot.
  • the big data dialogue table pre-stores the text information of the speaker in the human-machine dialogue and the corresponding text information of the robot.
  • step S602 the text information of the robot is converted into a robot audio signal to be output according to the audio factor of the robot.
  • the signal generator refers to the TIS speech synthesis technology.
  • the signal generator generates a corresponding robot audio signal with reference to the audio factors of the robot and the text information of the robot, so as to realize an emotional dialogue between humans and machines.
  • the embodiment of the present application uses a large amount of audio materials to establish the pitch mapping relationship, the timbre mapping relationship, and the speech rate mapping relationship; and then query the pitch mapping relationship, the timbre mapping relationship, and the speech rate mapping relationship according to the dialogue audio of the user.
  • a speech generation device based on big data is provided, and the speech generation device based on big data corresponds to the speech generation method based on big data in the above-mentioned embodiment in a one-to-one correspondence.
  • the big data-based voice generation device includes an audio signal acquisition module 71, an audio signal analysis module 72, a first label acquisition module 73, a second label acquisition module 74, an audio factor acquisition module 75, and an audio signal generation module.
  • Module 76 The detailed description of each functional module is as follows:
  • the audio signal acquisition module 71 is used to acquire the speaker's audio signal
  • the audio signal analysis module 72 is configured to perform audio analysis on the speaker's audio signal to obtain the speaker's audio factors
  • the first label obtaining module 73 is configured to obtain the emotional label of the speaker according to the audio factor of the speaker;
  • the second label obtaining module 74 is configured to obtain the emotional label of the robot corresponding to the emotional label of the speaker;
  • the audio factor obtaining module 75 is configured to obtain the audio factor of the robot according to the emotion tag of the robot;
  • the audio signal generating module 76 is configured to generate a robot audio signal to be output according to the audio factors of the robot.
  • the audio signal analysis module 72 includes:
  • the establishment unit is used to establish the mapping relationship between frequency and tone, the mapping relationship between waveform and timbre, and speech rate rules through machine learning;
  • a pitch acquiring unit configured to acquire frequency information of the speaker's audio signal, query the mapping relationship between frequency and pitch according to the frequency information, and obtain the speaker's pitch information
  • a timbre acquiring unit configured to acquire waveform information of the speaker's audio signal, query the mapping relationship between the waveform and the timbre according to the waveform information, to obtain the speaker's timbre information, the timbre information including emotion information and age information;
  • the speech rate acquiring unit is used to acquire the time interval between two pauses and the number of spoken words in the speaker's audio signal, and match the speech rate rule according to the time interval and the number of spoken words to obtain the speaker's speech rate information.
  • the first label obtaining module 73 includes:
  • the first mapping relationship setting unit is used to set the mapping relationship between the speaker's audio factors and emotional tags according to the business scenario
  • the first tag acquisition unit is configured to query the mapping relationship between the speaker's audio factors and emotional tags according to the speaker's audio factors to obtain the speaker's emotional tags.
  • the second label obtaining module 74 includes:
  • the second mapping relationship setting unit is configured to set the dialogue emotion mapping relationship according to the business scenario, the dialogue emotion mapping relationship including the corresponding relationship between the emotion label of the speaker and the emotion label of the robot;
  • the second tag acquisition unit is configured to query the dialogue emotion mapping relationship according to the emotion tag of the speaker to obtain the emotion tag of the robot.
  • the audio factor obtaining module 75 includes:
  • the third mapping relationship setting unit is used to set the mapping relationship between the emotion label of the robot and the audio factor according to the business scenario
  • the audio factor obtaining unit is configured to query the mapping relationship between the emotion tag of the robot and the audio factor according to the emotion tag of the robot to obtain the audio factor of the robot.
  • the audio signal generating module 76 includes:
  • a text information acquisition unit configured to acquire the text information of the robot corresponding to the text information of the speaker's audio signal from the big data dialogue table by using named entity recognition and relationship extraction technology
  • the audio signal generating unit is configured to convert the text information of the robot into the robot audio signal to be output according to the audio factors of the robot.
  • the various modules in the above-mentioned big data-based speech generating device can be implemented in whole or in part by software, hardware, and a combination thereof.
  • the above-mentioned modules may be embedded in the form of hardware or independent of the processor in the computer equipment, or may be stored in the memory of the computer equipment in the form of software, so that the processor can call and execute the operations corresponding to the above-mentioned modules.
  • a computer device is provided.
  • the computer device may be a server, and its internal structure diagram may be as shown in FIG. 8.
  • the computer equipment includes a processor, a memory, a network interface, and a database connected through a system bus.
  • the processor of the computer device is used to provide calculation and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system, computer readable instructions, and a database.
  • the internal memory provides an environment for the operation of the operating system and computer-readable instructions in the non-volatile storage medium.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the computer-readable instructions are executed by the processor to realize a big data-based voice generation method.
  • a computer device including a memory, a processor, and computer-readable instructions stored in the memory and running on the processor, and the processor implements the following steps when the processor executes the computer-readable instructions:
  • the robot audio signal to be output is generated according to the audio factor of the robot.
  • one or more non-volatile readable storage media storing computer readable instructions are provided.
  • the computer readable instructions are executed by one or more processors, the one or more Each processor performs the following steps:
  • the robot audio signal to be output is generated according to the audio factor of the robot.
  • Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Child & Adolescent Psychology (AREA)
  • General Health & Medical Sciences (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Toys (AREA)
  • Manipulator (AREA)

Abstract

一种基于大数据的语音生成方法,包括:获取说话人音频信号(S101);对说话人音频信号进行音频分析,得到说话人的音频因素(S102);根据说话人的音频因素获取说话人的情感标签(S103);获取说话人的情感标签对应的机器人的情感标签(S104);根据机器人的情感标签获取机器人的音频因素(S105);根据机器人的音频因素生成待输出的机器人音频信号(S106)。本方法实现了在机器人对话中融入情感因素,使得机器人可与用户进行富有情感的对话,有效地解决了现有人机对话中的对话语音不带情感、话术生硬和单一的问题。

Description

基于大数据的语音生成方法、装置、设备及介质
本申请要求于2020年2月28日提交中国专利局、申请号为202010127344.4,发明名称为“基于大数据的语音生成方法、装置、设备及介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
 
技术领域
本申请涉及信息技术领域,尤其涉及一种基于大数据的语音生成方法、装置、设备及介质。
 
背景技术  
人机对话是计算机的一种工作方式,即计算机操作员或用户与计算机之间,通过控制台或终端显示屏幕,以对话方式进行工作。目前的人机对话场景中,机器人的对话语音是不带情感的,话术也非常生硬和单一,对于用户来说,对话体验欠佳,从而无法提供情感类服务,无法应用到例如心理咨询、疏通情感等场景中,应用场景受限。
因此,寻找一种方法解决现有人机对话中的对话语音不带情感、话术生硬和单一的问题成为本领域技术人员亟需解决的技术问题。
 
发明内容
本申请实施例提供了一种基于大数据的语音生成方法、装置、设备及介质,以解决现有人机对话中的对话语音不带情感、话术生硬和单一的问题。
一种基于大数据的语音生成方法,包括:
获取说话人音频信号;
对所述说话人音频信号进行音频分析,得到所述说话人的音频因素;
根据所述说话人的音频因素获取说话人的情感标签;
获取所述说话人的情感标签对应的机器人的情感标签;
根据所述机器人的情感标签获取机器人的音频因素;
根据所述机器人的音频因素生成待输出的机器人音频信号。
可选地,所述对所述说话人音频信号进行音频分析,得到所述说话人的音频因素包括:
通过机器学习建立频率与音调之间的映射关系、波形与音色之间的映射关系以及语速规则;
获取所述说话人音频信号的频率信息,根据所述频率信息查询频率与音调之间的映射关系,得到说话人的音调信息;
获取所述说话人音频信号的波形信息,根据所述波形信息查询波形与音色之间的映射关系,得到说话人的音色信息,所述音色信息包括情感信息和年龄信息;
获取所述说话人音频信号中两次停顿之间的时间间隔以及说话字数,根据所述时间间隔和说话字数匹配所述语速规则,得到说话人的语速信息。
可选地,所述根据所述说话人的音频因素获取说话人的情感标签包括:
根据业务场景设置说话人的音频因素与情感标签之间的映射关系;
根据所述说话人的音频因素查询所述说话人的音频因素与情感标签之间的映射关系,得到所述说话人的情感标签。
可选地,所述获取所述说话人的情感标签对应的机器人的情感标签包括:
根据业务场景设置对话情感映射关系,所述对话情感映射关系包括说话人的情感标签及其对应的机器人的情感标签;
根据说话人的情感标签查询所述对话情感映射关系,得到机器人的情感标签。
可选地,所述根据所述机器人的情感标签获取机器人的音频因素包括:
根据业务场景设置机器人的情感标签与音频因素之间的映射关系;
根据所述机器人的情感标签查询所述机器人的情感标签与音频因素之间的映射关系,得到所述机器人的音频因素。
可选地,所述根据所述机器人的音频因素生成待输出的机器人音频信号包括:
通过命名实体识别和关系抽取技术从大数据对话表中获取所述说话人音频信号的文本信息对应的机器人的文本信息;
根据所述机器人的音频因素将所述机器人的文本信息转换为待输出的机器人音频信号。
一种基于大数据的语音生成装置,包括:
音频信号获取模块,用于获取说话人音频信号;
音频信号分析模块,用于对所述说话人音频信号进行音频分析,得到所述说话人的音频因素;
第一标签获取模块,用于根据所述说话人的音频因素获取说话人的情感标签;
第二标签获取模块,用于获取所述说话人的情感标签对应的机器人的情感标签;
音频因素获取模块,用于根据所述机器人的情感标签获取机器人的音频因素;
音频信号生成模块,用于根据所述机器人的音频因素生成待输出的机器人音频信号。
可选地,所述音频信号分析模块包括:
建立单元,用于通过机器学习建立频率与音调之间的映射关系、波形与音色之间的映射关系以及语速规则;
音调获取单元,用于获取所述说话人音频信号的频率信息,根据所述频率信息查询频率与音调之间的映射关系,得到说话人的音调信息;
音色获取单元,用于获取所述说话人音频信号的波形信息,根据所述波形信息查询波形与音色之间的映射关系,得到说话人的音色信息,所述音色信息包括情感信息和年龄信息;
语速获取单元,用于获取所述说话人音频信号中两次停顿之间的时间间隔以及说话字数,根据所述时间间隔和说话字数匹配所述语速规则,得到说话人的语速信息。
一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:
获取说话人音频信号;
对所述说话人音频信号进行音频分析,得到所述说话人的音频因素;
根据所述说话人的音频因素获取说话人的情感标签;
获取所述说话人的情感标签对应的机器人的情感标签;
根据所述机器人的情感标签获取机器人的音频因素;
根据所述机器人的音频因素生成待输出的机器人音频信号。
一个或多个存储有计算机可读指令的非易失性可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:
获取说话人音频信号;
对所述说话人音频信号进行音频分析,得到所述说话人的音频因素;
根据所述说话人的音频因素获取说话人的情感标签;
获取所述说话人的情感标签对应的机器人的情感标签;
根据所述机器人的情感标签获取机器人的音频因素;
根据所述机器人的音频因素生成待输出的机器人音频信号。
本申请的一个或多个实施例的细节在下面的附图和描述中提出,本申请的其他特征和优点将从说明书、附图以及权利要求变得明显。
 
附图说明
为了更清楚地说明本申请实施例的技术方案,下面将对本申请实施例的描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1是本申请一实施例中基于大数据的语音生成方法的一流程图;
图2是本申请一实施例中基于大数据的语音生成方法中步骤S102的一流程图;
图3是本申请一实施例中基于大数据的语音生成方法中步骤S103的一流程图;
图4是本申请一实施例中基于大数据的语音生成方法中步骤S104的一流程图;
图5是本申请一实施例中基于大数据的语音生成方法中步骤S105的一流程图;
图6是本申请一实施例中基于大数据的语音生成方法中步骤S106的一流程图;
图7 是本申请一实施例中基于大数据的语音生成装置的一原理框图;
图8是本申请一实施例中计算机设备的一示意图。
 
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
以下对本实施例提供的基于大数据的语音生成方法进行详细的描述。本申请实施例提供的基于大数据的语音生成方法的目的是为了克服现有人机对话中的对话语音不带情感、话术生硬和单一而导致的对话体验欠佳、应用场景受限的问题,实现人工智能(Artificial Intelligence,英文缩写为AI)调整机器人的对话语音情感。首先通过使用大量的音频素材建立音调映射关系、音色映射关系、语速规则;然后根据用户的对话音频查询所述音调映射关系、音色映射关系以及语速规则,得到用户的情感标签;基于所述情感标签确定机器人的对话情感,并按照所述机器人的对话情感处理待输出的音频信息,生成机器人的对话音频,最后输出所述对话音频;从而实现在机器人对话中融入情感因素,使得机器人可以富有情感的与用户进行对话,有效地解决了现有人机对话中的对话语音不带情感、话术生硬和单一的问题。
在一实施例中,如图1所示,一种基于大数据的语音生成方法,包括如下步骤:
在步骤S101中,获取说话人音频信号。
在这里,本申请实施例对获取的说话人音频信号进行去重噪声处理,排除干扰信息,以便后续获得准确的情感信息。
在步骤S102中,对所述说话人音频信号进行音频分析,得到所述说话人的音频因素。
在这里,所述音频因素是指描述声音特性的要素,包括但不限于音调信息、音色信息以及语速信息。其中音调是指说话人声音的高低,音色信息是指说话人声音的品质和特性,语速信息是指说话人说话的速度。本申请实施例预先设置音调分析模块、音色分析模块以及语速分析模块,并建立频率与音调之间的映射关系、波形与音色之间的映射关系以及语速规则,以实现对说话人音频信号进行音频分析。
可选地,图2示出了本申请实施例提供的步骤S102的具体实现流程。如图2所示,步骤S102 所述的对所述说话人音频信号进行音频分析,得到所述说话人的音频因素包括:
在步骤S201中,通过机器学习建立频率与音调之间的映射关系、波形与音色之间的映射关系以及语速规则。
在这里,音调信息与音频的频率有关。本实施例通过将带有音调标注信息的大数据音频素材输入所述音调分析模块,进行机器学习,以识别声音的频率与音调之间的关系,建立频率与音调之间的映射关系。其中,所述音调标注包括但不限于高音音频、中音音频以及低音音频。
音色信息与音频的波形有关。在本申请实施例中,所述音色信息包括年龄信息与情感信息。本申请实施例通过分析音频的波形分别获取年龄信息和情感信息。对于年龄信息,本实施例通过将带有年龄特性标注的大数据音频素材输入所述音色分析模块,进行机器学习,以识别声音的波形与年龄信息之间的关系,建立波形与年龄信息之间的映射关系。年龄特性标注为具有年龄特征的角色标注,年龄特征用于表示年龄大小,角色用于表示性别、性格,因此所述年龄特性标注包括但不限于男童音女童音、少女音、少年音、大叔音、大妈音、老年音。对于情感信息,本实施例通过将带有情感特性标注的大数据音频素材输入所述音色分析模块,进行机器学习,以识别声音的波形与情感信息之间的关系,建立波形与情感信息之间的映射关系。其中,所述情感特性标注为具有情感特征的标注,包括但不限于欢快、高兴、兴奋、悲伤、惊讶、好奇。
语速信息与说话快慢有关,本申请实施例根据人类说话的常规语速设置相关的语速规则。首先通过将大数据音频素材输入所述语速分析模块,进行机器学习,得到人类在预设单位时间内的常规说话字数以及说话字数的正太分布,根据正太分布以常规说话字数为基础划分出多个语速等级,以建立语速规则。可选地,对于每一个音频素材,所述语速分析模块可通过识别相邻的两次停顿(例如2s不说话,认为是停顿),获取该两次停顿之间的时间信息,并统计该两次停顿之间的说话字数,根据说话字数与时间信息计算预设时间单位内的说话字数;遍历所有输入的大数据音频素材,得到多个预设时间单位内的说话字数,并对所述预设时间单位内的说话字数进行分布分析,得到人类在预设时间单位内的常规说话字数以及说话字数的正太分布。每一个语速规则为在预设时间单位内的说话字数与语速等级之间的对应关系;示例性地,假设经过语速分析模块得到的常规说话字数为每分钟[200-250)字,对应设置为语速等级5时,往下每分钟[0-50)、[50-100)、[100-150)、[150-200)字时分别对应设置语速等级1、2、3、4,往上每分钟[250-275)、[275-300)、[300-325)、[325-350)字时分别对应设置语速等级6、7、8、9,供9条语速规则。
在步骤S202中,获取所述说话人音频信号的频率信息,根据所述频率信息查询频率与音调之间的映射关系,得到说话人的音调信息。
在对说话者进行音频分析时,将说话人音频信号输入训练好的所述音调分析模块,通过音调分析模块识别说话人音频信号的频率信息,以及将识别得到的频率信息和频率与音调之间的映射关系进行匹配,得到说话人的音调信息。
在步骤S203中,获取所述说话人音频信号的波形信息,根据所述波形信息查询波形与音色之间的映射关系,得到说话人的音色信息,所述音色信息包括情感信息和年龄信息。
同理,对于说话人的音色信息,将说话人音频信号输入训练好的所述音色分析模块,通过音色分析模块识别说话人音频信号的波形信息,以及将识别得到的波形信息和波形与音色之间的映射关系进行匹配,具体为将识别得到的波形信息和波形与年龄信息之间的映射关系、波形与情感信息之间的映射关系进行匹配,得到说话人的年龄信息和情感信息。
在步骤S204中,获取所述说话人音频信号中两次停顿之间的时间间隔以及说话字数,根据所述时间间隔和说话字数匹配所述语速规则,得到说话人的语速信息。
对于说话人的语速信息,将说话人音频信号输入训练好的所述语速分析模块,通过所述语速分析模块识别相邻的两次停顿(例如2s不说话,认为是停顿),获取该两次停顿之间的时间信息,并统计该两次停顿之间的说话字数,根据说话字数与时间信息计算在预设时间单位内的说话字数,再将所述预设时间单位内的说话字数和语速规则进行匹配,得到语速等级,从而得到说话人的语速信息。
在步骤S103中,根据所述说话人的音频因素获取说话人的情感标签。
所述说话人的情感标签是指基于说话人的音频因素得到的说话人在当前业务场景下的整体情感信息。在上述实施例中,所述音频因素包括音调信息、音色信息以及语速信息,所述说话人的情感标签是指基于说话人的音调信息、音色信息以及语速信息得到的说话人的整体情感信息。可选地,图3示出了本申请实施例提供的步骤S103的具体实现流程。如图3所示,步骤S103所述的根据所述说话人的音频因素获取说话人的情感标签包括:
在步骤S301中,根据业务场景设置说话人的音频因素与情感标签之间的映射关系。
在这里,本申请实施例根据不同的业务场景,设置说话人的音频因素与情感标签之间的映射关系,以定义说话人的情感模型。每一个业务场景对应一条或多条说话人的音频因素与情感标签之间的映射关系。业务场景不同,相同的音频因素对应的说话人的情感标签不完全相同。示例性地,在游乐场中,音调信息为大声说话3级、音色信息为高兴的萝莉音4级别、语速信息为语速等级6,对应的说话人的情感标签为兴奋激动的萝莉8级;在一般场所中,音调信息为大声说话3级、音色信息为高兴的萝莉音4级别、语速信息为语速等级6,对应的说话人的情感标签为开心的萝莉4级。
在步骤S302中,根据所述说话人的音频因素查询所述说话人的音频因素与情感标签之间的映射关系,得到所述说话人的情感标签。
在得到说话人的音频因素之后,获取当前业务场景下的说话人的音频因素与情感标签之间的映射关系,然后基于说话人的音频因素,查询所获取的映射关系,得到说话人的情感标签,从而得到用户在当前业务场景中的整体情感信息。
在步骤S104中,获取所述说话人的情感标签对应的机器人的情感标签。
如前所述,所述说话人的情感标签是指基于说话人的音频因素得到的说话人在当前业务场景下的整体情感信息。对应的,机器人的情感标签是指在当前业务场景下机器人面对说话人的应有的整体情感信息。可选地,图4示出了本申请实施例提供的步骤S104的具体实现流程。如图4所示,步骤S104所述的获取所述说话人的情感标签对应的机器人的情感标签包括:
在步骤S401中,根据业务场景设置对话情感映射关系,所述对话情感映射关系包括说话人的情感标签及其对应的机器人的情感标签。
在这里,本实施例基于人类对话情感常理,根据业务场景设置说话人的情感标签与机器人的情感标签之间的对应关系,以定义人机情感模型,实现基于说话人的对话情感选择机器人的对话情感。不同的业务场景对应的对话映射关系是不相同的。示例性地,在普通对话场景中,当说话人的情感标签是欢快时,对应机器人的情感标签也是欢快,那么机器人将以欢快的情感进行对话;在心理咨询场景中,当说话人的情感标签是悲伤时,对应机器人的情感标签是感同身受,那么机器人将以安慰的情感进行对话;在交友场景中,当说话人的情感标签是少年兴奋5级,对应的机器人的情感标签是少女兴奋5级,那么机器人将以少女兴奋5级的情感进行对话。
在步骤S402中,根据说话人的情感标签查询所述对话情感映射关系,得到机器人的情感标签。
在得到说话人的情感标签之后,获取当前业务场景下的对话情感映射关系,然后基于说话人的情感标签,查询所获取的对话情感映射关系,得到机器人的情感标签,从而得到在当前业务场景中机器人面对用户的应有的整体情感信息。
通过配置对话情感映射关系,由三对三(说话人的音频、语速、音色对机器人的音频、语速、音色)简化为一对一(说话人的情感标签对机器人的情感标签)进行配置,极大地简化了在实际应用中配置机器人整体情感的逻辑;在物理层面上,实现从底层的语音处理技术中抽离出来,使得开发员或者业务员一目了然,便于开发员或者业务员在不同的业务场景中配置人机情感模型。
在步骤S105中,根据所述机器人的情感标签获取机器人的音频因素。
如前所述,机器人的情感标签是指在当前业务场景下机器人面对说话人的应有的整体情感信息。本实施例中所研究的音频因素包括但不限于音调信息、音色信息以及语速信息。因此,在生成待输出的机器人音频信号时,本申请实施例基于机器人的情感标签确定待输出的机器人音频信号的音调信息、音色信息以及语速信息。可选地,图5示出了本申请实施例提供的步骤S105的具体实现流程。如图5所示,步骤S105所述的根据所述机器人的情感标签获取机器人的音频因素包括:
在步骤S501中,根据业务场景设置机器人的情感标签与音频因素之间的映射关系。
在这里,本申请实施例根据不同的业务场景,设置机器人的情感标签与音频因素之间的映射关系,以定义机器人的情感模型。每一个业务场景对应一条或多条机器人的情感标签与音频因素之间的映射关系。业务场景不同,相同的机器人的情感标签对应的音频因素不完全相同。
在步骤S502中,根据所述机器人的情感标签查询所述机器人的情感标签与音频因素之间的映射关系,得到所述机器人的音频因素。
在得到机器人的情感标签之后,获取当前业务场景下的机器人的情感标签与音频因素之间的映射关系,然后基于机器人的情感标签,查询所获取的映射关系,得到机器人的音频因素,即在当前业务场景中机器人面对说话人应该具有的音调信息、音色信息以及语速信息。
在步骤S106中,根据所述机器人的音频因素生成待输出的机器人音频信号。
在这里,所述机器人音频信号是指机器人回应说话人的对话音频。本申请实施例直接根据当前业务场景中机器人面对说话人应该具有的音调信息、音色信息以及语速信息生成机器人音频信号。可选地,图6示出了本申请实施例提供的步骤S106的具体实现流程。如图6所示,步骤S106所述的根据所述机器人的音频因素生成待输出的机器人音频信号包括:
在步骤S601中,通过命名实体识别和关系抽取技术从大数据对话表中获取所述说话人音频信号的文本信息对应的机器人的文本信息。
在这里,本申请实施例采用HMM声学模型技术将说话人音频信号转换为对应的文本信息。然后通过命名实体识别和关系抽取技术,根据所述说话人的文本信息从预设的大数据对话表中获取机器人的文本信息。应当理解,所述机器人的文本信息为机器人回应说话人的文本信息,与说话人的文本信息是对应的,是机器人音频信号的内容。所述大数据对话表中预先存储了人机对话中说话人的文本信息及对应的机器人文本信息。
在步骤S602中,根据所述机器人的音频因素将所述机器人的文本信息转换为待输出的机器人音频信号。
在得到机器人的文本信息之后,将所述机器人的文本信息和机器人的音频因素传入信号发生器。所述信号发生器是指TIS语音合成技术,信号发生器将参照所述机器人的音频因素和所述机器人的文本信息生成对应的机器人音频信号,实现人机之间的情感对话。
综上所述,本申请实施例通过使用大量的音频素材建立音调映射关系、音色映射关系、语速映射关系;然后根据用户的对话音频查询所述音调映射关系、音色映射关系以及语速映射关系,得到用户的情感标签;基于所述情感标签确定机器人的对话情感,并按照所述机器人的对话情感处理待输出的音频信息,生成机器人的对话音频,最后输出所述对话音频;从而实现在机器人对话中融入情感因素,使得机器人可与用户进行富有情感的对话,有效地解决了现有人机对话中的对话语音不带情感、话术生硬和单一的问题。
 
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
 
在一实施例中,提供一种基于大数据的语音生成装置,该基于大数据的语音生成装置与上述实施例中基于大数据的语音生成方法一一对应。如图7所示,该基于大数据的语音生成装置包括音频信号获取模块71、音频信号分析模块72、第一标签获取模块73、第二标签获取模块74、音频因素获取模块75、音频信号生成模块76。各功能模块详细说明如下:
音频信号获取模块71,用于获取说话人音频信号;
音频信号分析模块72,用于对所述说话人音频信号进行音频分析,得到所述说话人的音频因素;
第一标签获取模块73,用于根据所述说话人的音频因素获取说话人的情感标签;
第二标签获取模块74,用于获取所述说话人的情感标签对应的机器人的情感标签;
音频因素获取模块75,用于根据所述机器人的情感标签获取机器人的音频因素;
音频信号生成模块76,用于根据所述机器人的音频因素生成待输出的机器人音频信号。
可选地,所述音频信号分析模块72包括:
建立单元,用于通过机器学习建立频率与音调之间的映射关系、波形与音色之间的映射关系以及语速规则;
音调获取单元,用于获取所述说话人音频信号的频率信息,根据所述频率信息查询频率与音调之间的映射关系,得到说话人的音调信息;
音色获取单元,用于获取所述说话人音频信号的波形信息,根据所述波形信息查询波形与音色之间的映射关系,得到说话人的音色信息,所述音色信息包括情感信息和年龄信息;
语速获取单元,用于获取所述说话人音频信号中两次停顿之间的时间间隔以及说话字数,根据所述时间间隔和说话字数匹配所述语速规则,得到说话人的语速信息。
可选地,所述第一标签获取模块73包括:
第一映射关系设置单元,用于根据业务场景设置说话人的音频因素与情感标签之间的映射关系;
第一标签获取单元,用于根据所述说话人的音频因素查询所述说话人的音频因素与情感标签之间的映射关系,得到所述说话人的情感标签。
可选地,所述第二标签获取模块74包括:
第二映射关系设置单元,用于根据业务场景设置对话情感映射关系,所述对话情感映射关系包括说话人的情感标签与机器人的情感标签之间的对应关系;
第二标签获取单元,用于根据说话人的情感标签查询所述对话情感映射关系,得到机器人的情感标签。
可选地,所述音频因素获取模块75包括:
第三映射关系设置单元,用于根据业务场景设置机器人的情感标签与音频因素之间的映射关系;
音频因素获取单元,用于根据所述机器人的情感标签查询所述机器人的情感标签与音频因素之间的映射关系,得到所述机器人的音频因素。
可选地,所述音频信号生成模块76包括:
文本信息获取单元,用于通过命名实体识别和关系抽取技术从大数据对话表中获取所述说话人音频信号的文本信息对应的机器人的文本信息;
音频信号生成单元,用于根据所述机器人的音频因素将所述机器人的文本信息转换为待输出的机器人音频信号。
关于基于大数据的语音生成装置的具体限定可以参见上文中对于基于大数据的语音生成方法的限定,在此不再赘述。上述基于大数据的语音生成装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。
 
在一个实施例中,提供了一种计算机设备,该计算机设备可以是服务器,其内部结构图可以如图8所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机可读指令和数据库。该内存储器为非易失性存储介质中的操作系统和计算机可读指令的运行提供环境。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机可读指令被处理器执行时以实现一种基于大数据的语音生成方法。
在一个实施例中,提供了一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机可读指令,处理器执行计算机可读指令时实现以下步骤:
获取说话人音频信号;
对所述说话人音频信号进行音频分析,得到所述说话人的音频因素;
根据所述说话人的音频因素获取说话人的情感标签;
获取所述说话人的情感标签对应的机器人的情感标签;
根据所述机器人的情感标签获取机器人的音频因素;
根据所述机器人的音频因素生成待输出的机器人音频信号。
在一个实施例中,提供了一个或多个存储有计算机可读指令的非易失性可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:
获取说话人音频信号;
对所述说话人音频信号进行音频分析,得到所述说话人的音频因素;
根据所述说话人的音频因素获取说话人的情感标签;
获取所述说话人的情感标签对应的机器人的情感标签;
根据所述机器人的情感标签获取机器人的音频因素;
根据所述机器人的音频因素生成待输出的机器人音频信号。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一非易失性计算机可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink) DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内

Claims (20)

  1. 一种基于大数据的语音生成方法,其中,包括:
    获取说话人音频信号;
    对所述说话人音频信号进行音频分析,得到所述说话人的音频因素;
    根据所述说话人的音频因素获取说话人的情感标签;
    获取所述说话人的情感标签对应的机器人的情感标签;
    根据所述机器人的情感标签获取机器人的音频因素;
    根据所述机器人的音频因素生成待输出的机器人音频信号。
  2. 如权利要求1所述的基于大数据的语音生成方法,其中,所述对所述说话人音频信号进行音频分析,得到所述说话人的音频因素包括:
    通过机器学习建立频率与音调之间的映射关系、波形与音色之间的映射关系以及语速规则;
    获取所述说话人音频信号的频率信息,根据所述频率信息查询频率与音调之间的映射关系,得到说话人的音调信息;
    获取所述说话人音频信号的波形信息,根据所述波形信息查询波形与音色之间的映射关系,得到说话人的音色信息,所述音色信息包括情感信息和年龄信息;
    获取所述说话人音频信号中两次停顿之间的时间间隔以及说话字数,根据所述时间间隔和说话字数匹配所述语速规则,得到说话人的语速信息。
  3. 如权利要求1或2所述的基于大数据的语音生成方法,其中,所述根据所述说话人的音频因素获取说话人的情感标签包括:
    根据业务场景设置说话人的音频因素与情感标签之间的映射关系;
    根据所述说话人的音频因素查询所述说话人的音频因素与情感标签之间的映射关系,得到所述说话人的情感标签。
  4. 如权利要求1或2所述的基于大数据的语音生成方法,其中,所述获取所述说话人的情感标签对应的机器人的情感标签包括:
    根据业务场景设置对话情感映射关系,所述对话情感映射关系包括说话人的情感标签及其对应的机器人的情感标签;
    根据说话人的情感标签查询所述对话情感映射关系,得到机器人的情感标签。
  5. 如权利要求1或2所述的基于大数据的语音生成方法,其中,所述根据所述机器人的情感标签获取机器人的音频因素包括:
    根据业务场景设置机器人的情感标签与音频因素之间的映射关系;
    根据所述机器人的情感标签查询所述机器人的情感标签与音频因素之间的映射关系,得到所述机器人的音频因素。
  6. 如权利要求1或2所述的基于大数据的语音生成方法,其中,所述根据所述机器人的音频因素生成待输出的机器人音频信号包括:
    通过命名实体识别和关系抽取技术从大数据对话表中获取所述说话人音频信号的文本信息对应的机器人的文本信息;
    根据所述机器人的音频因素将所述机器人的文本信息转换为待输出的机器人音频信号。
  7. 一种基于大数据的语音生成装置,其中,所述装置包括:
    音频信号获取模块,用于获取说话人音频信号;
    音频信号分析模块,用于对所述说话人音频信号进行音频分析,得到所述说话人的音频因素;
    第一标签获取模块,用于根据所述说话人的音频因素获取说话人的情感标签;
    第二标签获取模块,用于获取所述说话人的情感标签对应的机器人的情感标签;
    音频因素获取模块,用于根据所述机器人的情感标签获取机器人的音频因素;
    音频信号生成模块,用于根据所述机器人的音频因素生成待输出的机器人音频信号。
  8. 如权利要求7所述的基于大数据的语音生成装置,其中,所述音频信号分析模块包括:
    建立单元,用于通过机器学习建立频率与音调之间的映射关系、波形与音色之间的映射关系以及语速规则;
    音调获取单元,用于获取所述说话人音频信号的频率信息,根据所述频率信息查询频率与音调之间的映射关系,得到说话人的音调信息;
    音色获取单元,用于获取所述说话人音频信号的波形信息,根据所述波形信息查询波形与音色之间的映射关系,得到说话人的音色信息,所述音色信息包括情感信息和年龄信息;
    语速获取单元,用于获取所述说话人音频信号中两次停顿之间的时间间隔以及说话字数,根据所述时间间隔和说话字数匹配所述语速规则,得到说话人的语速信息。
  9. 一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,其中,所述处理器执行所述计算机可读指令时实现如下步骤:
    获取说话人音频信号;
    对所述说话人音频信号进行音频分析,得到所述说话人的音频因素;
    根据所述说话人的音频因素获取说话人的情感标签;
    获取所述说话人的情感标签对应的机器人的情感标签;
    根据所述机器人的情感标签获取机器人的音频因素;
    根据所述机器人的音频因素生成待输出的机器人音频信号。
  10. 如权利要求9所述的计算机设备,其中,所述对所述说话人音频信号进行音频分析,得到所述说话人的音频因素包括:
    通过机器学习建立频率与音调之间的映射关系、波形与音色之间的映射关系以及语速规则;
    获取所述说话人音频信号的频率信息,根据所述频率信息查询频率与音调之间的映射关系,得到说话人的音调信息;
    获取所述说话人音频信号的波形信息,根据所述波形信息查询波形与音色之间的映射关系,得到说话人的音色信息,所述音色信息包括情感信息和年龄信息;
    获取所述说话人音频信号中两次停顿之间的时间间隔以及说话字数,根据所述时间间隔和说话字数匹配所述语速规则,得到说话人的语速信息。
  11. 如权利要求9或10所述的计算机设备,其中,所述根据所述说话人的音频因素获取说话人的情感标签包括:
    根据业务场景设置说话人的音频因素与情感标签之间的映射关系;
    根据所述说话人的音频因素查询所述说话人的音频因素与情感标签之间的映射关系,得到所述说话人的情感标签。
  12. 如权利要求9或10所述的计算机设备,其中,所述获取所述说话人的情感标签对应的机器人的情感标签包括:
    根据业务场景设置对话情感映射关系,所述对话情感映射关系包括说话人的情感标签及其对应的机器人的情感标签;
    根据说话人的情感标签查询所述对话情感映射关系,得到机器人的情感标签。
  13. 如权利要求9或10所述的计算机设备,其中,所述根据所述机器人的情感标签获取机器人的音频因素包括:
    根据业务场景设置机器人的情感标签与音频因素之间的映射关系;
    根据所述机器人的情感标签查询所述机器人的情感标签与音频因素之间的映射关系,得到所述机器人的音频因素。
  14. 如权利要求9或10所述的计算机设备,其中,所述根据所述机器人的音频因素生成待输出的机器人音频信号包括:
    通过命名实体识别和关系抽取技术从大数据对话表中获取所述说话人音频信号的文本信息对应的机器人的文本信息;
    根据所述机器人的音频因素将所述机器人的文本信息转换为待输出的机器人音频信号。
  15. 一个或多个存储有计算机可读指令的非易失性可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:
    获取说话人音频信号;
    对所述说话人音频信号进行音频分析,得到所述说话人的音频因素;
    根据所述说话人的音频因素获取说话人的情感标签;
    获取所述说话人的情感标签对应的机器人的情感标签;
    根据所述机器人的情感标签获取机器人的音频因素;
    根据所述机器人的音频因素生成待输出的机器人音频信号。
  16. 如权利要求15所述的非易失性可读存储介质,其中,所述对所述说话人音频信号进行音频分析,得到所述说话人的音频因素包括:
    通过机器学习建立频率与音调之间的映射关系、波形与音色之间的映射关系以及语速规则;
    获取所述说话人音频信号的频率信息,根据所述频率信息查询频率与音调之间的映射关系,得到说话人的音调信息;
    获取所述说话人音频信号的波形信息,根据所述波形信息查询波形与音色之间的映射关系,得到说话人的音色信息,所述音色信息包括情感信息和年龄信息;
    获取所述说话人音频信号中两次停顿之间的时间间隔以及说话字数,根据所述时间间隔和说话字数匹配所述语速规则,得到说话人的语速信息。
  17. 如权利要求15或16所述的非易失性可读存储介质,其中,所述根据所述说话人的音频因素获取说话人的情感标签包括:
    根据业务场景设置说话人的音频因素与情感标签之间的映射关系;
    根据所述说话人的音频因素查询所述说话人的音频因素与情感标签之间的映射关系,得到所述说话人的情感标签。
  18. 如权利要求15或16所述的非易失性可读存储介质,其中,所述获取所述说话人的情感标签对应的机器人的情感标签包括:
    根据业务场景设置对话情感映射关系,所述对话情感映射关系包括说话人的情感标签及其对应的机器人的情感标签;
    根据说话人的情感标签查询所述对话情感映射关系,得到机器人的情感标签。
  19. 如权利要求15或16所述的非易失性可读存储介质,其中,所述根据所述机器人的情感标签获取机器人的音频因素包括:
    根据业务场景设置机器人的情感标签与音频因素之间的映射关系;
    根据所述机器人的情感标签查询所述机器人的情感标签与音频因素之间的映射关系,得到所述机器人的音频因素。
  20. 如权利要求15或16所述的非易失性可读存储介质,其中,所述根据所述机器人的音频因素生成待输出的机器人音频信号包括:
    通过命名实体识别和关系抽取技术从大数据对话表中获取所述说话人音频信号的文本信息对应的机器人的文本信息;
    根据所述机器人的音频因素将所述机器人的文本信息转换为待输出的机器人音频信号。
     
PCT/CN2020/105040 2020-02-28 2020-07-28 基于大数据的语音生成方法、装置、设备及介质 WO2021169170A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010127344.4A CN111445906A (zh) 2020-02-28 2020-02-28 基于大数据的语音生成方法、装置、设备及介质
CN202010127344.4 2020-02-28

Publications (1)

Publication Number Publication Date
WO2021169170A1 true WO2021169170A1 (zh) 2021-09-02

Family

ID=71650673

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/105040 WO2021169170A1 (zh) 2020-02-28 2020-07-28 基于大数据的语音生成方法、装置、设备及介质

Country Status (2)

Country Link
CN (1) CN111445906A (zh)
WO (1) WO2021169170A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114299964A (zh) * 2021-12-23 2022-04-08 北京达佳互联信息技术有限公司 声线识别模型的训练方法和装置、声线识别方法和装置

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111445906A (zh) * 2020-02-28 2020-07-24 深圳壹账通智能科技有限公司 基于大数据的语音生成方法、装置、设备及介质
CN112423106A (zh) * 2020-11-06 2021-02-26 四川长虹电器股份有限公司 一种自动翻译伴音的方法及系统
DK180951B1 (en) 2020-11-27 2022-08-10 Gn Audio As System with post-conversation representation, electronic device, and related methods

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107393529A (zh) * 2017-07-13 2017-11-24 珠海市魅族科技有限公司 语音识别方法、装置、终端及计算机可读存储介质
US20170365277A1 (en) * 2016-06-16 2017-12-21 The George Washington University Emotional interaction apparatus
JP2018132624A (ja) * 2017-02-15 2018-08-23 トヨタ自動車株式会社 音声対話装置
CN109215679A (zh) * 2018-08-06 2019-01-15 百度在线网络技术(北京)有限公司 基于用户情绪的对话方法和装置
CN109274819A (zh) * 2018-09-13 2019-01-25 广东小天才科技有限公司 通话时用户情绪调整方法、装置、移动终端及存储介质
CN109346076A (zh) * 2018-10-25 2019-02-15 三星电子(中国)研发中心 语音交互、语音处理方法、装置和系统
CN110211563A (zh) * 2019-06-19 2019-09-06 平安科技(深圳)有限公司 面向情景及情感的中文语音合成方法、装置及存储介质
CN111445906A (zh) * 2020-02-28 2020-07-24 深圳壹账通智能科技有限公司 基于大数据的语音生成方法、装置、设备及介质

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170365277A1 (en) * 2016-06-16 2017-12-21 The George Washington University Emotional interaction apparatus
JP2018132624A (ja) * 2017-02-15 2018-08-23 トヨタ自動車株式会社 音声対話装置
CN107393529A (zh) * 2017-07-13 2017-11-24 珠海市魅族科技有限公司 语音识别方法、装置、终端及计算机可读存储介质
CN109215679A (zh) * 2018-08-06 2019-01-15 百度在线网络技术(北京)有限公司 基于用户情绪的对话方法和装置
CN109274819A (zh) * 2018-09-13 2019-01-25 广东小天才科技有限公司 通话时用户情绪调整方法、装置、移动终端及存储介质
CN109346076A (zh) * 2018-10-25 2019-02-15 三星电子(中国)研发中心 语音交互、语音处理方法、装置和系统
CN110211563A (zh) * 2019-06-19 2019-09-06 平安科技(深圳)有限公司 面向情景及情感的中文语音合成方法、装置及存储介质
CN111445906A (zh) * 2020-02-28 2020-07-24 深圳壹账通智能科技有限公司 基于大数据的语音生成方法、装置、设备及介质

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114299964A (zh) * 2021-12-23 2022-04-08 北京达佳互联信息技术有限公司 声线识别模型的训练方法和装置、声线识别方法和装置

Also Published As

Publication number Publication date
CN111445906A (zh) 2020-07-24

Similar Documents

Publication Publication Date Title
WO2021169170A1 (zh) 基于大数据的语音生成方法、装置、设备及介质
US11380327B2 (en) Speech communication system and method with human-machine coordination
US11115541B2 (en) Post-teleconference playback using non-destructive audio transport
US11361751B2 (en) Speech synthesis method and device
US10057707B2 (en) Optimized virtual scene layout for spatial meeting playback
US20200127865A1 (en) Post-conference playback system having higher perceived quality than originally heard in the conference
US10516782B2 (en) Conference searching and playback of search results
US10522151B2 (en) Conference segmentation based on conversational dynamics
US10334384B2 (en) Scheduling playback of audio in a virtual acoustic space
CN107818798A (zh) 客服服务质量评价方法、装置、设备及存储介质
US20150348538A1 (en) Speech summary and action item generation
US20180190266A1 (en) Conference word cloud
WO2019242414A1 (zh) 语音处理方法、装置、存储介质及电子设备
EP3254455A2 (en) Selective conference digest
US11545136B2 (en) System and method using parameterized speech synthesis to train acoustic models
US10854182B1 (en) Singing assisting system, singing assisting method, and non-transitory computer-readable medium comprising instructions for executing the same
US10199035B2 (en) Multi-channel speech recognition
CN109616116B (zh) 通话系统及其通话方法
Ward et al. Interactional and pragmatics-related prosodic patterns in Mandarin dialog
CN113192484A (zh) 基于文本生成音频的方法、设备和存储介质
WO2021134592A1 (zh) 语音处理方法、装置、设备以及存储介质
US20220270503A1 (en) Pronunciation assessment with dynamic feedback
KR102378895B1 (ko) 음성 인식을 위한 호출어 학습 방법 및 이를 실행하기 위하여 기록매체에 기록된 컴퓨터 프로그램
WO2022254809A1 (ja) 情報処理装置、信号処理装置、情報処理方法、及びプログラム
Maciel et al. Multiplatform instantiation speech engines produced with five

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20920949

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 19.01.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 20920949

Country of ref document: EP

Kind code of ref document: A1