WO2021155662A1 - 文本信息的处理方法及装置、计算机设备和可读存储介质 - Google Patents

文本信息的处理方法及装置、计算机设备和可读存储介质 Download PDF

Info

Publication number
WO2021155662A1
WO2021155662A1 PCT/CN2020/115007 CN2020115007W WO2021155662A1 WO 2021155662 A1 WO2021155662 A1 WO 2021155662A1 CN 2020115007 W CN2020115007 W CN 2020115007W WO 2021155662 A1 WO2021155662 A1 WO 2021155662A1
Authority
WO
WIPO (PCT)
Prior art keywords
sentence
target
category
text
initial
Prior art date
Application number
PCT/CN2020/115007
Other languages
English (en)
French (fr)
Inventor
邓利群
魏建生
张旸
王雅圣
孙文华
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP20917614.8A priority Critical patent/EP4102397A4/en
Publication of WO2021155662A1 publication Critical patent/WO2021155662A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/9032Query formulation
    • G06F16/90332Natural language query formulation or dialogue systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation

Definitions

  • This application relates to the field of information processing, and in particular to a method and device for processing text information, computer equipment and a readable storage medium.
  • ASR automatic speech recognition technology
  • TTS speech synthesis technology
  • voiceprint recognition technology Enabling computers to be able to listen, see, speak, and feel is the future development direction of human-computer interaction, among which voice has become one of the most promising human-computer interaction methods in the future.
  • speech synthesis technology has made great progress, and machine voice broadcasting has been widely used in devices such as smart mobile terminals, smart homes, and car audio. People's requirements for speech synthesis are no longer just “intelligible", but are transformed into "highly realistic and emotional”. The quality of synthesized speech has become an important factor in measuring the competitiveness of intelligent speech products.
  • the embodiments of the present application provide a method and device for processing text information, a computer device, and a readable storage medium, which are conducive to generating voice information conforming to human emotion expression habits for text, and improve the personification of intelligent voice equipment.
  • an embodiment of the present application provides a method for processing text information, which includes: dividing a target text into sentences to obtain a sentence sequence; determining the emotional category of the target text; and separately determining the initial value of each sentence in the sentence sequence.
  • Sentiment category based on the sentiment category of the target text and the initial sentiment category of each sentence in the sentence sequence, a first key sentence is determined from the sentence sequence, and the initial sentiment category of the first key sentence is the same as that of the The emotion category of the target text is the same; the modified emotion category of the target sentence is obtained according to the initial emotion category of the first key sentence and the text feature of the target sentence, and the target sentence is the same as the first sentence in the sentence sequence.
  • a sentence adjacent to a key sentence, and the initial emotion category of the target sentence is different from the emotion category of the target text; the voice information of the target sentence is generated based on the modified emotion category of the target sentence.
  • the embodiment of the application not only considers the emotion category predicted for the sentence individual, but also considers the overall emotion category of the text where the sentence is located.
  • the method provided according to the embodiment of the application is The sentence in the text generates voice information, which is beneficial to generate voice information for the text that is more in line with people's emotional expression habits, and improves the personification degree of the intelligent voice device.
  • the text feature is generally a multi-dimensional vector.
  • the text feature of the target text may be obtained according to the text feature of each sentence in the sentence sequence.
  • said dividing the target text into sentences includes: dividing the target text into sentences according to intonation phrase dividing rules.
  • the sentence division of the target text includes: predicting prosodic information of the target text; and sentence division of the target text in units of intonation phrases to obtain a sentence sequence.
  • Each sentence in the sentence sequence is an intonation phrase.
  • the prosodic information of the text can be used to indicate the prosodic words, prosodic phrases, and intonation phrases in the target text.
  • Prosodic words are a group of syllables that are closely related in actual speech flow and are often pronounced together. Generally, the prosodic words in the target text can be predicted first.
  • Prosodic phrases are intermediate rhythmic blocks between prosodic words and intonation phrases. Prosodic phrases may be smaller than syntactic phrases.
  • a prosodic phrase generally includes one or more prosodic words. Prosodic rhythmic boundaries may appear between prosodic words within a prosodic phrase, and have relatively stable phrase intonation patterns and phrase accent patterns.
  • a prosodic phrase means that several prosodic words that make up a prosodic phrase sound to share a rhythm group. After the prosodic words of the target text are predicted, the prosodic phrases in the target text can be predicted based on the predicted prosodic words.
  • Intonation phrase is to connect several prosodic phrases according to a certain intonation pattern.
  • An intonation phrase generally includes one or more prosodic phrases. After the prosodic phrases of the target text are predicted, the intonation phrases in the target text can be predicted based on the predicted prosodic phrases.
  • the text feature of the first sentence in the sentence sequence may be obtained according to the text characteristics of each prosodic word in the first sentence, and the first sentence may be any sentence in the sentence sequence.
  • the text feature of the prosodic word may be generated according to the word vector of the prosodic word and/or the location feature of the prosodic word.
  • the word vector of the prosodic word may be obtained through a neural network, and the neural network may be obtained by training the Word2Vec model, the GloVe model, or the Bert model.
  • the location feature of a prosodic word can be used to indicate the position of the prosodic word in the intonation phrase.
  • the location feature of a prosodic word can be represented by a 25-dimensional vector.
  • the first to tenth dimensions of the vector are used to indicate the order of the prosodic word in the intonation phrase.
  • the eleventh to twentieth dimension of the vector Used to indicate the number of prosodic words in the intonation phrase.
  • the twenty-first to twenty-fifth dimensions of the vector are used for the prosodic result of the prosodic word.
  • the prosodic result can be used to indicate whether the prosodic word is in a prosodic phrase or intonation The end of the phrase.
  • the emotional category of the target text is preset, which is convenient for the user to set the emotional tone of the voice information according to preferences.
  • the emotion category of the target text may be obtained based on the text characteristics of the target text.
  • the separately determining the initial emotional category of each sentence in the sentence sequence is specifically: determining the initial emotional category of the sentence to be determined based on the text feature of the sentence to be determined in the sentence sequence .
  • obtaining the modified emotion category of the target sentence according to the initial emotion category of the first key sentence, the initial emotion category of the target sentence, and the text feature of the target sentence is specifically: Based on the initial emotion category of another adjacent sentence other than the first key sentence of the target sentence being the same as that of the target text, according to the initial emotion category of the first key sentence and the other phase The initial sentiment category of the neighbor sentence and the text feature of the target sentence obtain the revised sentiment category of the target sentence.
  • the method further includes: based on the target sentence other than the first key sentence The initial emotion category of another adjacent sentence is different from the emotion category of the target text, and the modified emotion of the other adjacent sentence is obtained according to the modified emotion category of the target sentence and the text characteristics of the other adjacent sentence Category; generating the voice information of the other neighboring sentence based on the modified emotion category of the another neighboring sentence.
  • an embodiment of the present application provides an apparatus for processing text information.
  • the apparatus includes one or more functional units for executing the foregoing first aspect or any one of the possible implementation methods of the first aspect. These functional units can be implemented by hardware, or can be implemented by hardware executing corresponding software, or by software combined with necessary hardware.
  • the device for processing text information may include: a sentence division module for sentence division of the target text to obtain a sentence sequence; a determination module for performing the following steps: determining the emotion of the target text Category; respectively determine the initial emotional category of each sentence in the sentence sequence; determine the first key sentence from the sentence sequence based on the emotional category of the target text and the initial emotional category of each sentence in the sentence sequence, The initial emotion category of the first key sentence is the same as the emotion category of the target text; the modified emotion category of the target sentence is obtained according to the initial emotion category of the first key sentence and the text characteristics of the target sentence, so The target sentence is a sentence adjacent to the first key sentence in the sentence sequence, and the initial emotion category of the target sentence is different from the emotion category of the target text; a speech generation module is used to determine The modified emotion category of the target sentence determined by the module generates voice information of the target sentence.
  • the sentence division module is configured to divide the target text into sentences according to intonation phrase division rules.
  • the emotion category of the target text is preset, or is obtained based on the text characteristics of the target text.
  • the determining module is configured to determine the initial emotion category of the sentence to be determined based on the text feature of the sentence to be determined in the sentence sequence.
  • the determining module is configured to, based on the fact that the initial sentiment category of another adjacent sentence of the target sentence other than the first key sentence is the same as the sentiment category of the target text, according to The initial emotion category of the first key sentence, the initial emotion category of the other adjacent sentence, and the text feature of the target sentence are described as the modified emotion category of the target sentence.
  • the determining module is further configured to, after obtaining the modified emotion category of the target sentence, based on the initial value of another adjacent sentence of the target sentence except the first key sentence
  • the emotion category is different from the emotion category of the target text, and the corrected emotion category of the other adjacent sentence is obtained according to the corrected emotion category of the target sentence and the text feature of the other adjacent sentence;
  • the speech generation module It is also used for generating the voice information of the another neighboring sentence based on the modified emotion category of the another neighboring sentence.
  • an embodiment of the present application provides a computer device, including: a processor and a memory; the memory is used to store computer execution instructions, and when the computer device is running, the processor executes the computer execution instructions stored in the memory, So that the computer device executes the method of the above-mentioned first aspect or any one of the possible implementation manners of the first aspect.
  • an embodiment of the present application provides a computer-readable storage medium that stores instructions in the computer-readable storage medium, which when run on a computer, enables the computer to execute any one of the first aspect or the first aspect. Ways of possible implementation.
  • the embodiments of the present application provide a computer program product containing instructions, which when run on a computer, enable the computer to execute the method of the first aspect or any one of the possible implementation manners of the first aspect.
  • an embodiment of the present application provides a chip system, which includes a processor, and is configured to support a computer device to implement the functions involved in the first aspect or any one of the possible implementation manners of the first aspect.
  • the chip system also includes a memory, and the memory is used to store the necessary program instructions and data of the computer equipment.
  • the chip system can be composed of chips, or include chips and other discrete devices.
  • FIG. 1A is a schematic diagram of a possible application scenario of an embodiment of the present application.
  • FIG. 1B is a schematic diagram of a possible structure of an intelligent voice device according to an embodiment of the present application.
  • FIG. 1C is a schematic diagram of another possible application scenario of an embodiment of the present application.
  • FIG. 1D is a schematic diagram of a possible structure of a server according to an embodiment of the present application.
  • FIG. 2 is a schematic diagram of a possible embodiment of a method for processing text information of the present application
  • FIG. 3 is a schematic diagram of sentence division of the target text in this application.
  • FIG. 4 is a schematic diagram of a possible refinement process of step 201;
  • Figure 5-1A is a schematic diagram of an embodiment of the first stage process of the text information processing method of the present application.
  • Figure 5-1B is a schematic diagram of a possible embodiment of the method for predicting prosody information based on the CRF model of the present application
  • Fig. 5-2A is a schematic diagram of an embodiment of the second stage process of the method for processing text information of the present application
  • Figure 5-2B is a schematic diagram of a possible structure of the emotional category classification model of this application.
  • Figure 5-2C is a possible prediction result of the proposed method for the initial emotion category of each intonation phrase in the "Monkey Wears Shoes” story and the modified emotion category under different global emotions;
  • Figure 5-3 is a schematic diagram of an embodiment of the third stage process of the text information processing method of this application.
  • 5-4A is a schematic diagram of an embodiment of the fourth stage process of the method for processing text information of the present application.
  • Figure 5-4B is a schematic diagram of a possible structure of the emotional acoustic model of the present application.
  • Fig. 6 is a schematic diagram of a possible embodiment of a text information processing apparatus of the present application.
  • ASR automatic speech recognition technology
  • TTS speech synthesis technology
  • voiceprint recognition technology Enabling computers to be able to listen, see, speak, and feel is the future development direction of human-computer interaction, among which voice has become one of the most promising human-computer interaction methods in the future.
  • the present invention provides a method for processing text information, which can be used in computer equipment to realize emotional speech synthesis.
  • FIG. 1A is a schematic diagram of a possible application scenario of an embodiment of this application.
  • the computer device may be an entity (referred to as an intelligent voice device 1) that has the ability to analyze and process emotions and the ability to synthesize and output speech.
  • the smart voice device 1 may be a smart phone, or a smart voice assistant on a wearable terminal that can speak, or a smart speaker, or a robot that can talk to people, etc.
  • the smart voice device 1 is a smart phone as shown in FIG. 1A. example.
  • the smart phone 1 can convert the text obtained through the Internet (shown as a dotted arrow in FIG. 1A) or locally stored into emotional voice information, and output the emotional voice information to the user (shown as a fluctuating curve in FIG. 1A).
  • FIG. 1B is a schematic diagram of an embodiment of the intelligent voice device 1 provided by the present application.
  • the intelligent voice device may include a processor 11, a memory 12, and a voice output module 13.
  • the memory 12 is used to store computer programs; the processor 11 is used to execute the computer programs in the memory 12 to execute the text information processing method provided in this application; the voice output module 13 is used to output emotional information to the user (human or other robot)
  • the voice information for example, the output module 13 may be a speaker.
  • the smart voice device 1 may also include an input module 14.
  • the input module 14 may include one or more of a touch screen, a camera, and a microphone array.
  • the touch screen is used to receive user touch instructions, and the camera It is used to detect image information, and the microphone array is used to detect audio data.
  • the smart voice device 1 further includes a communication interface 15 for communicating with other devices (for example, a server).
  • a communication interface 15 for communicating with other devices (for example, a server).
  • the various modules in the intelligent voice device may be connected to each other through the bus 16.
  • FIG. 1C is a schematic diagram of another possible application scenario of an embodiment of this application.
  • the computer device may be the server 2, and the server 2 may be communicatively connected with the intelligent voice device 1.
  • the intelligent voice device 1 is a robot as an example.
  • the server 2 can convert the text obtained through the Internet or sent by the robot 1 into emotional voice information, and send the obtained voice information to the robot 1, and the robot 1 outputs the text to the user. Emotional voice information (shown as a fluctuating curve in Figure 1C).
  • the computer device may include an intelligent voice device 1 and a server 2 that are communicatively connected.
  • the intelligent robot 1 and the server 2 can cooperate with each other to realize the functions of emotion analysis and processing and speech synthesis.
  • the server 2 realizes the function of emotion analysis and processing
  • the intelligent robot 1 realizes the speech synthesis and speech according to the emotion processing result of the server 2. Output.
  • an embodiment of the present application also provides a server 2.
  • the server 2 may include a processor 21 and a memory 22.
  • the memory 22 is used to store a computer program; the processor 21 is used to execute the computer program in the memory 22, and execute the text information processing method provided in this application.
  • the processor 21 and the memory 22 may be connected to each other through a bus 24.
  • the server 2 may further include a communication interface 23 for communicating with other devices (for example, the smart voice device 1).
  • the processor in FIG. 1B and/or FIG. 1D may be a central processing unit (CPU), a network processor (NP), or a combination of CPU and NP, or a digital signal processor (DSP). ), application specific integrated circuit (ASIC), ready-made programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components.
  • CPU central processing unit
  • NP network processor
  • DSP digital signal processor
  • ASIC application specific integrated circuit
  • FPGA ready-made programmable gate array
  • FPGA field programmable gate array
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
  • the steps in the method disclosed in this application can be directly embodied as being executed and completed by a hardware decoding processor, or executed and completed by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in a mature storage medium in the field, such as random access memory, flash memory, read-only memory, programmable read-only memory, or electrically erasable programmable memory, registers.
  • the storage medium is located in the memory, and the processor reads the information in the memory and completes the steps of the above method in combination with its hardware.
  • the apparatus may include multiple processors or the processors may include multiple processing units.
  • the processor may be a single-core processor, or a multi-core or many-core processor.
  • the processor may be an ARM architecture processor.
  • the memory in FIG. 1B and/or FIG. 1D is used to store computer instructions executed by the processor.
  • the memory can be a storage circuit or a memory.
  • the memory may be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory.
  • the non-volatile memory can be read-only memory (ROM), programmable read-only memory (programmable ROM, PROM), erasable programmable read-only memory (erasable PROM, EPROM), and electrically available Erase programmable read-only memory (electrically EPROM, EEPROM) or flash memory.
  • the volatile memory may be random access memory (RAM), which is used as an external cache.
  • the memory may be independent of the processor.
  • the processor and the memory may be connected to each other through a bus.
  • the bus may be a peripheral component interconnect standard (PCI) bus or an extended industry standard architecture (EISA) bus, etc.
  • PCI peripheral component interconnect standard
  • EISA extended industry standard architecture
  • the bus can be divided into an address bus, a data bus, a control bus, and so on.
  • the memory may also be a storage unit in the processor, which is directly attached to the processor, which is not limited here. Although only one memory is shown in the figure, the device may also include multiple memories or the memory may include multiple storage units.
  • an embodiment of the method for processing text information of the present application may include:
  • Computer equipment can obtain the text to be converted into voice information, which is called target text. Sentences that are too long are not convenient for subsequent speech generation, so the target text can be divided into sentences to obtain a sentence sequence, and the target text can be divided into shorter sentence units for subsequent speech generation.
  • step 201 is described below with an example.
  • multiple cross symbols represent the content of the target text.
  • Each cross symbol can represent one or more characters.
  • the characters here can be Chinese characters or non-Chinese characters (such as Arabic numerals or English symbols, etc.) .
  • Figure 3 shows 8 horizontal lines.
  • the cross symbol on the same horizontal line represents the division into the same sentence.
  • the 8 horizontal lines represent the sentence division of the target text to obtain 8 sentences.
  • the numbers below the horizontal line On behalf of the corresponding sentence, the sentences in the sentence sequence are: sentence 1, sentence 2, ..., sentence 8 sentence sequence.
  • the sentiment category of the target text can be determined.
  • the emotional category of the target text refers to an emotional category corresponding to the target text.
  • the emotion category of the target text may be set at the factory, or may be set by the user as required.
  • the sentiment category of the target text can be predicted according to a certain algorithm.
  • the initial sentiment category of the target text can be predicted based on the text characteristics of the target text.
  • the text features of the target text can be input into a trained neural network, and the neural network is used to predict the emotional category of the text, so that the emotional category of the target text can be obtained.
  • the text feature is generally a multi-dimensional vector.
  • the emotion category of each sentence (called the initial emotion category) can be predicted according to a certain algorithm.
  • the initial sentiment category of the sentence to be determined may be determined based on the text characteristics of the sentence to be determined.
  • the text characteristics of the sentence to be determined may be input into a trained neural network, and the neural network is used to classify the emotional category of the sentence, or to predict the emotional category of the sentence, so that the emotional category of the sentence can be obtained.
  • the neural network may be constructed by a classification model such as a deep neural network, a support vector machine, or a hidden Markov model, and trained using training corpus in advance.
  • a classification model such as a deep neural network, a support vector machine, or a hidden Markov model
  • the sentence whose initial sentiment category is the same as the sentiment category of the target text can be determined from the sentence sequence, which is called a key sentence.
  • the initial sentiment category and target of the key sentence The sentiment category of the text is the same.
  • a key sentence of the target text is called the first key sentence.
  • the emotional category of sentences other than the key sentence in the sentence sequence can be corrected.
  • a sentence (called a target sentence) that meets the following conditions can be determined from the sentence sequence: it is adjacent to the first key sentence, and the initial emotion category is different from the emotion category of the target text.
  • the modified emotion category of the target sentence can be obtained according to the initial emotion category of the first key sentence (that is, the emotion category of the target text) and the text characteristics of the target sentence.
  • sentence sequence of Fig. 3 suppose the initial sentiment categories of sentence 2, sentence 3, sentence 4, sentence 6, and sentence 8 in the sentence sequence are: B, C, D, B, and C. If sentence 1 is the first key sentence, then sentence 2 is the target sentence; if sentence 5 is the first key sentence, then sentence 4 and sentence 6 are both target sentences; if sentence 7 is the first key sentence, then sentences 6 and 8 Both are target sentences. Taking sentence 1 as the first key sentence and sentence 2 as the target sentence as an example, the revised emotion category of sentence 2 can be obtained according to the text characteristics of emotion category A and sentence 2.
  • the voice information of the target sentence can be generated based on the modified emotion category of the target sentence. After that, the voice information is played through the voice playback module, which can output the text content of the target sentence while expressing the emotion corresponding to the modified emotion category.
  • the embodiment of the present application can determine the emotional category of the sentence in the text, and generate voice information capable of expressing the emotion of the emotional category for the corresponding short sentence.
  • voice information capable of expressing the emotion of the emotional category for the corresponding short sentence.
  • the method provided in the embodiment of the application is the sentence in the text Generating voice information is conducive to generating voice information for text that is more in line with people's emotional expression habits, and improves the personification of intelligent voice devices.
  • the execution order of the steps corresponding to the step number in the embodiment of this application is only a possible execution order.
  • the embodiment of this application does not limit the time sequence relationship between step 202 and step 201 and step 203, as long as step 202 is before step 204 Just execute.
  • the voice information of the first key sentence may be generated based on the initial emotion category of the first key sentence. After that, when the voice information is played through the voice playing module, the text content of the first key sentence can be output, and the emotion corresponding to the initial emotion category of the first key sentence can be expressed.
  • the voice of each sentence can be generated according to each sentence and the emotional category corresponding to each sentence, and then the voice of each sentence can be generated according to The order of each sentence in the sequence of short sentences splices the voice of each sentence into the voice of the target text.
  • Step 205 when it is determined that another adjacent sentence of the target sentence except the first key sentence belongs to is a key sentence (that is, the initial emotion category of the other adjacent sentence is the same as the emotion category of the target text), Step 205 may specifically be:
  • the revised sentiment category of the target sentence is obtained.
  • the modified emotion category of the second sentence is closer to the emotion category of the target text.
  • a common sentence division method is to divide the target text according to punctuation marks (such as commas, periods, exclamation marks, etc.) in the target text.
  • punctuation marks such as commas, periods, exclamation marks, etc.
  • the granularity of the sentence division of the text determines the extremelyness of the emotions that the text speech can express. The larger the granularity, for example, if the target text is a sentence, then the speech information of the target text can only express one kind of emotion; Punctuation marks are divided into sentences, which may contain more content. Using such sentences as the smallest granularity to generate the voice information of the target text cannot reflect the fluctuation of emotion in the sentence, which is not conducive to improving theakiness of the emotion that can be expressed by the text and speech. .
  • step 201 may include: dividing the target text into sentences according to intonation phrase dividing rules to obtain a sentence sequence.
  • step 201 may specifically include the following steps:
  • the prosodic information of the text can be used to indicate the prosodic words, prosodic phrases, and intonation phrases in the target text.
  • Prosodic words are a group of syllables that are closely related in actual speech flow and are often pronounced together. Generally, the prosodic words in the target text can be predicted first.
  • Prosodic phrases are intermediate rhythmic blocks between prosodic words and intonation phrases. Prosodic phrases may be smaller than syntactic phrases.
  • a prosodic phrase generally includes one or more prosodic words. Prosodic rhythmic boundaries may appear between prosodic words within a prosodic phrase, and have relatively stable phrase intonation patterns and phrase accent patterns.
  • a prosodic phrase means that several prosodic words that make up a prosodic phrase sound to share a rhythm group. After the prosodic words of the target text are predicted, the prosodic phrases in the target text can be predicted based on the predicted prosodic words.
  • Intonation phrase is to connect several prosodic phrases according to a certain intonation pattern.
  • An intonation phrase generally includes one or more prosodic phrases. After the prosodic phrases of the target text are predicted, the intonation phrases in the target text can be predicted based on the predicted prosodic phrases.
  • the target text is divided into sentences with intonation phrases as the unit, and the sentence sequence is obtained;
  • the intonation phrases of the target text can be used as the unit to divide the target text into sentence sequences to obtain the sentence sequence.
  • Each sentence in the sentence sequence is an intonation phrase.
  • the sentence granularity obtained by dividing the sentence according to the intonation phrase is smaller, which is conducive to reflect the emotional fluctuation of the sentence between two punctuation marks, and is conducive to improving the emotion that the text can express.
  • the degree of delicacy shows that by dividing sentences in units of intonation phrases and predicting the emotional category of the sentence, the emotional prediction can be made more controllable without negatively affecting the prosody of the synthesized speech.
  • the text feature of the target text may be obtained according to the text feature of each sentence in the sentence sequence.
  • the text feature of the first sentence in the sentence sequence may be obtained according to the text characteristics of each prosodic word in the first sentence, and the first sentence may be any sentence in the sentence sequence.
  • the text feature of the prosodic word may be generated according to the word vector of the prosodic word and/or the location feature of the prosodic word.
  • the word vector of the prosodic word may be obtained through a neural network, and the neural network may be obtained by training the Word2Vec model, the GloVe model, or the Bert model.
  • the location feature of a prosodic word can be used to indicate the position of the prosodic word in the intonation phrase.
  • the location feature of a prosodic word can be represented by a 25-dimensional vector.
  • the first to tenth dimensions of the vector are used to indicate the order of the prosodic word in the intonation phrase.
  • the eleventh to twentieth dimension of the vector Used to indicate the number of prosodic words in the intonation phrase.
  • the twenty-first to twenty-fifth dimensions of the vector are used for the prosodic result of the prosodic word.
  • the prosodic result can be used to indicate whether the prosodic word is in a prosodic phrase or intonation The end of the phrase.
  • the initial emotional intensity control vector of the first sentence can be predicted according to the text characteristics and emotion category of the first sentence; The global sentiment intensity of the target text; afterwards, the global sentiment intensity level and the initial sentiment intensity control vector of the first sentence are used to determine the modified sentiment intensity of the first sentence.
  • the first intensity difference of the target sentence is greater than the second intensity difference of the target sentence, and the first intensity difference of the target sentence is the difference between the initial emotional intensity of the target sentence and the global emotional intensity of the target text ,
  • the second intensity difference of the target sentence is the difference between the modified emotional intensity of the target sentence and the global emotional intensity of the target text.
  • the voice information of the target sentence can be generated according to the modified emotion category of the target sentence and the modified emotion strength of the target sentence.
  • the following takes the emotional speech synthesis of the text "Monkey Wears Shoes” as an example to introduce a possible embodiment of the text information processing method of this application.
  • This embodiment is based on the speech synthesis framework of an end-to-end acoustic model (such as Tacotron). It is used to synthesize emotional speech for large sections of text.
  • Another possible embodiment of the method for processing text information in this application may include the following Stage steps:
  • the first stage S1 may include the following steps:
  • the non-Chinese characters in the text to be synthesized such as Arabic numerals, English symbols, various symbols, etc., are converted into corresponding Chinese characters according to their contextual semantics; this embodiment uses a rule-based method, that is, collecting and defining a set of rules, When encountering the text to be standardized, these rules are matched one by one to obtain the corresponding standardized measures.
  • the text of "Monkey Wearing Shoes" used in this embodiment has been a standardized Chinese text.
  • the prosodic information is used to indicate the prosodic structure in the target text.
  • the prosodic structure includes prosodic words, prosodic phrases, and intonation phrases.
  • Predict the prosody information of different levels of prosodic structures in sequence such as prosodic words, prosodic phrases, and intonation phrases.
  • the end of different prosodic structures is reflected in the synthesized speech as different pause durations.
  • Figure 5-1B shows an example of prosody prediction based on a conditional random forest (CRF) model used in this example.
  • the input text is the third sentence in Table 1.
  • the flowchart in Figure 5-1B is used to represent a possible refinement of S1-2, and the text on the right side of each step in the flowchart is used to exemplify the result of the corresponding step.
  • step S1-2 includes the following steps:
  • a can be used for adjectives, "d” for adverbs, “f” for localizers, “m” for numerals, “n” for nouns, “q” for quantifiers, and “r” for adverbs.
  • a can be used for adjectives, "d” for adverbs, "f” for localizers, "m” for numerals, “n” for nouns, “q” for quantifiers, and “r” for adverbs.
  • u for auxiliary words
  • v for verbs
  • w for punctuation
  • the third sentence in Table 1 includes two intonation phrases, namely "A fierce tiger came at this time” and "Monkeys climbed up the tree one after another.”
  • Sentences that are too long are not convenient for subsequent speech generation, so this step divides the input large section of text into shorter sentence units for subsequent speech generation.
  • a common way of dividing is to divide a large section of text into sub-sentences according to punctuation marks (such as period, exclamation mark, etc.).
  • punctuation marks such as period, exclamation mark, etc.
  • a smaller granular intonation phrase is used as the result of the divided short sentences, and any sentence in the sentence sequence is an intonation phrase.
  • the next step is to use the intonation phrase as the synthesis unit to perform emotional feature prediction and speech generation. This is because the experiment shows that the intonation phrase as the unit can make the emotional feature conversion more controllable without negatively affecting the prosody of the synthesized speech. Influence. That is, in the example shown in Figure 5-1B, the corresponding input sentence will be divided into two intonation phrases for the subsequent synthesis step.
  • the numbers after the pinyin are used to indicate the tones of the Chinese characters, for example, "1" for one tone, “2” for second tone, "3” for three tones, “4" for four tones, and "5" Represents other tones, such as soft voice.
  • this step can combine the above features into a phoneme-level text feature (called text feature A) that includes the above features.
  • text feature A a phoneme-level text feature
  • the text feature A of each intonation phrase is used
  • prosodic words and prosodic phrases of the corresponding intonation phrases are used.
  • the results of the phoneme-based text feature A are as follows:
  • represents the beginning of the sentence
  • $ is the end of the sentence
  • #0, #1, #2, #3 respectively represent the end position symbols of syllables, prosodic words, prosodic phrases and intonation phrases.
  • Word vector generation in the unit of intonation phrase, each prosodic word in each intonation phrase is passed through a pre-trained word vector model (this example uses the Word2Vec model, and other models, such as GloVe, Bert, etc.) ) Into the corresponding word vector.
  • a pre-trained word vector model this example uses the Word2Vec model, and other models, such as GloVe, Bert, etc.
  • the rhythmic words “this time”, “come”, “one”, “fierce”, and “tiger” are respectively converted through the Word2Vec model. Is a 200-dimensional word vector.
  • S1-7 Generate the text feature B based on the word vector of the corresponding intonation phrase according to each word vector;
  • the word vectors and context features of each word in it For each intonation phrase, combine the word vectors and context features of each word in it to generate a intonation phrase-level text feature.
  • the combination operation may specifically refer to a splicing operation.
  • the feature corresponding to each word finally includes a 200-dimensional word feature vector and a 25-dimensional context feature.
  • the context feature can use one-hot encoding to represent the position of the current word in the intonation phrase, the number of prosodic words in the current intonation phrase, and the prosodic result of the current word.
  • the steps of the second stage S2 can be executed.
  • the second stage can specifically include the following steps:
  • the text feature B of each intonation phrase output by S1-7 is used as the input of the classification model, and the initial emotion category corresponding to each intonation phrase is determined respectively.
  • the text sentiment classification model can be constructed by deep neural networks, support vector machines, hidden Markov models and other classification models and trained with training corpus in advance.
  • This example uses a recurrent neural network model, which is the 2-layer length shown in Figure 5-2B
  • the short-term memory (LSTM) network is used as an emotional category classification model.
  • the sentiment category of the current large segment of text can be specified by the user in advance, or can be automatically recognized by the sentiment classification model; if it is the latter, use the text feature B of all intonation phrases as the input feature, and use S2 -1
  • the pre-trained text emotion classification model for recognition can be specified by the user in advance, or can be automatically recognized by the sentiment classification model; if it is the latter, use the text feature B of all intonation phrases as the input feature, and use S2 -1
  • the pre-trained text emotion classification model for recognition can be specified by the user in advance, or can be automatically recognized by the sentiment classification model; if it is the latter, use the text feature B of all intonation phrases as the input feature, and use S2 -1
  • the pre-trained text emotion classification model for recognition can be specified by the user in advance, or can be automatically recognized by the sentiment classification model; if it is the latter, use the text feature B of all intonation phrases as the input feature, and use S2 -1
  • the "user” here can include the developer of the smart terminal program, or the user of the smart terminal, and the global emotion preference can be set to a positive (or positive) emotion, such as happiness.
  • the intonation phrases whose emotion categories obtained in S2-1 are consistent with the emotion categories obtained in S2-2 are marked as key intonation phrases.
  • the emotional category of the key intonation phrase will not be changed in the subsequent steps.
  • the specific method is to use emotional text training data to train a context-based emotional category correction model in advance.
  • the model is a two-layer LSTM neural network model, which can be trained based on a large amount of emotional speech data for recognition tasks, and its input is the left and right intonation of the current intonation phrase to be corrected
  • the emotion category of the phrase, the global emotion category, and the text feature A of the current intonation phrase are spliced together, and the output is the emotion category of the current intonation phrase.
  • this emotion category correction model centering on key intonation phrases, the emotional categories of the left and right non-key intonation phrases are sequentially modified until the category features of all non-key intonation phrases are corrected.
  • Figure 5-2C exemplarily shows the initial emotion category of each intonation phrase in the "Monkey Wears Shoes” story and the modified emotion category under different global emotions.
  • the symbol " ⁇ " represents the key intonation phrase determined by S2-3.
  • the initial sentiment category of some non-key intonation phrases will be revised, that is, the revised sentiment category is different from the initial sentiment category.
  • the steps of the third phase S3 can be executed.
  • the third phase S3 can specifically include the following steps:
  • the pre-trained emotional intensity acoustic feature prediction model is used, the text feature B of the intonation phrase obtained by S1 and the emotional category obtained by S2 are used as input, and the emotional acoustic feature vector is output.
  • the emotional intensity acoustic feature prediction model is constructed using a two-layer bidirectional long-term memory network (BLSTM) and a two-layer deep neural network (DNN) neural network model, and it is prepared for use in advance
  • the emotional training corpus is trained on the emotional training corpus.
  • the input is composed of the text feature B of the intonation phrase represented by the word vector and the emotional category of the intonation phrase, and the output is the seven-dimensional emotional intensity acoustics of the intonation phrase as shown in the following table Feature vector.
  • this step performs mapping processing on the obtained emotional intensity feature vector to convert it into a low-dimensional emotional intensity control vector (for example, the target dimension is 3 dimensions).
  • a multidimensional scaling (MDS) algorithm is used to perform the mapping.
  • MDS multidimensional scaling
  • the emotional intensity is positively correlated with the first-dimensional feature and the second-dimensional feature (ie, MDS1 and MDS2) of the three-dimensional control vector, but negatively correlated with the third-dimensional feature (MDS3). Therefore, if you want to increase the emotional intensity, you can increase the value of MDS1 and MDS2, or decrease the value of MDS3. Conversely, if you want to reduce the emotional intensity, you need to decrease MDS1 and MDS2, or increase MDS3.
  • the emotional intensity is set according to the user's setting value; suppose the user can set the emotional intensity as "strong”, “medium”,
  • the third class of "weak” can be initialized with the central value of the "strong” area, the central value of the entire space, and the central value of the "weak” area in the emotional control vector space, that is, the emotion strength.
  • Step S2-4 is used to modify the emotional category of non-key intonation phrases.
  • step S3-4 is used to modify the emotional strength of non-key intonation phrases, so that the emotional speech of adjacent intonation phrases can transition naturally and coherently in emotional strength.
  • This embodiment adopts an implementation method based on the emotional intensity level prediction model.
  • the model is a two-layer LSTM neural network model, which can be trained based on a large amount of emotional speech data used for recognition tasks. Its input is the emotional category and emotional strength of the left and right intonation phrases.
  • the global emotional category and emotional strength (that is, the emotional category and emotional strength of the key intonation phrase), the emotional category of the current intonation phrase and the emotional acoustic feature vector obtained from S3-1 are spliced together, and the output is the current intonation phrase Emotional intensity (three categories of "strong", “medium” and “weak”).
  • Emotional intensity three categories of "strong", “medium” and “weak”
  • the specific correction method can be:
  • the emotion intensity control vector of the current non-key intonation phrase obtained by S3-2 may not be adjusted
  • the emotional intensity control vector of the current non-key intonation phrase obtained in S3-2 is adjusted to increase the corresponding emotional intensity by a certain percentage (for example, increase The ratio of is recorded as a, 0 ⁇ a ⁇ 1);
  • the emotion intensity control vector of the current non-key intonation phrase obtained in S3-2 is adjusted so that the corresponding emotion intensity is reduced by a certain percentage (for example, reduced The small proportion is denoted as b, 0 ⁇ b ⁇ 1).
  • the emotion intensity control vector is a three-dimensional vector, in which the first two dimensions are positively correlated, and the latter dimension is negatively correlated. Therefore, as an example, the specific operation of increasing a can be to multiply the values of MDS1 and MDS2. Take (1+a), and the value of MDS3 is multiplied by (1-a); in the same way, for reducing b, the operation is reversed.
  • the steps of the fourth stage S4 can be executed to synthesize the voice information of the target text.
  • the text feature A output in the first stage S1 the emotional category of each intonation phrase determined in the second stage S2, and the emotional intensity control vector of each intonation phrase output in the third stage S3 can be pre-trained based on The end-to-end acoustic model of the deep neural network predicts the corresponding emotional acoustic features, and finally generates emotional speech through the vocoder.
  • the emotional speech corresponding to all the corpus phrases are sequentially spliced into the emotional speech corresponding to a large section of text.
  • the fourth stage S4 may specifically include the following steps:
  • the emotional acoustic model is constructed by a sound spectrum prediction network (Tacotron).
  • the Tacotron type includes an encoder, a decoder, and an attention used as a bridge between the encoder and the decoder, as shown in Figure 5-4B Show.
  • the input is the phoneme-based text feature A of each intonation phrase obtained by S1, the emotion category of each intonation phrase determined by S2, the emotion intensity control vector of each intonation phrase output by S3, and the output feature is the frame level of each intonation phrase ( For example, every 12.5 milliseconds is a frame) linear spectrum acoustic characteristics (1025 dimensions).
  • This step uses a vocoder (such as Griffin-Lim vocoder) to calculate the emotional acoustic features generated by S4-1, and synthesize the voice information (or audio information) of each intonation phrase, and the voice information of these intonation phrases are in sequence After splicing, it is the final synthesized speech information corresponding to a large segment of target text.
  • a vocoder such as Griffin-Lim vocoder
  • Intonation phrases are used for text feature processing, emotional feature prediction and speech generation. Compared with the original sentence-level unit, the smaller synthesis unit has greater operational flexibility, making the predicted emotional results richer, and the inter-unit Emotional performance is more controllable.
  • the method of "predicting each phrase unit independently and then revising it based on global emotion” can maximize the diversity of local emotion features while ensuring that the global emotional tone of a large section of text is controllable and the synthesized emotions Voice is more emotionally expressive.
  • the transition from one voice emotion to another voice emotion involves not only the conversion of emotion categories, but also the gradual transition of emotion intensity.
  • Adopting the emotion intensity correction method of the present invention makes the changes of emotion intensity more coherent.
  • any of the above method embodiments can be executed by the smart voice device 1.
  • the memory 12 is used to store computer instructions for executing the solution of this application, and the processor 11 is used to execute the computer instructions in the memory 12 to execute the instructions provided by this application.
  • the output module 13 is used to output synthesized emotional voice information.
  • any of the foregoing method embodiments can be used by the server 2.
  • the memory 22 is used to store computer instructions for executing the solution of the present application
  • the processor 21 is used to execute the computer instructions in the memory 22 to execute any of the computer instructions provided in the embodiments of the present application.
  • any of the above method embodiments can be executed by the server 2 and the intelligent voice device 1.
  • the intelligent voice device 1 is used to send the target text to the server 2, and the server 2 is used to determine the emotion of each sentence in the target text.
  • the emotional category is sent to the smart voice device 1, and the smart voice device 1 is also used to generate voice information of the target text according to the emotional category sent by the server 2, and output the voice information.
  • each functional module can be divided corresponding to each function, or two or more The functions are integrated in a functional module.
  • the above-mentioned integrated functional modules can be implemented either in the form of hardware or in the form of software functional units.
  • FIG. 6 shows a schematic structural diagram of a text information processing device.
  • an embodiment of the apparatus 600 for processing text information of the present application may include a sentence division module 601, a determination module 602, and a speech generation module 603.
  • the sentence division 601 is used to divide the target text into sentences to obtain sentences. sequence.
  • the determination module 602 is used to perform the following steps: determine the emotion category of the target text; respectively determine the initial emotion category of each sentence in the sentence sequence; based on the emotion category of the target text and the initial emotion category of each sentence in the sentence sequence, from the sentence sequence Determine the first key sentence.
  • the initial emotion category of the first key sentence is the same as the emotion category of the target text; the revised emotion category of the target sentence is obtained according to the initial emotion category of the first key sentence and the text characteristics of the target sentence, and the target sentence is the sentence The sentence adjacent to the first key sentence in the sequence, and the initial sentiment category of the target sentence is different from the sentiment category of the target text.
  • the voice generation module 603 is configured to generate voice information of the target sentence based on the modified emotion category of the target sentence determined by the determination module.
  • the sentence division module 601 is used to divide the target text into sentences according to intonation phrase division rules.
  • the emotion category of the target text is preset, or is obtained based on the text characteristics of the target text.
  • the determining module 602 is configured to determine the initial emotion category of the sentence to be determined based on the text feature of the sentence to be determined in the sentence sequence.
  • the determining module 602 is configured to, based on the initial sentiment category of another adjacent sentence of the target sentence other than the first key sentence being the same as the sentiment category of the target text, according to the initial sentiment category of the first key sentence The category, the initial emotion category of another adjacent sentence and the text feature of the target sentence obtain the modified emotion category of the target sentence.
  • the determining module 602 is also used to, after obtaining the modified emotion category of the target sentence, based on the initial emotion category of another adjacent sentence except the first key sentence and the emotion of the target text The categories are different. According to the modified emotion category of the target sentence and the text characteristics of another adjacent sentence, the modified emotion category of another adjacent sentence is obtained; the speech generation module 603 is also used to generate another adjacent sentence based on the modified emotion category of another adjacent sentence Voice information of an adjacent sentence.
  • the computer execution instructions or computer instructions in the embodiments of the present application may also be referred to as application program codes, which are not specifically limited in the embodiments of the present application.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium.
  • the computer instructions may be transmitted from a website, computer, server, or data center. Transmission to another website, computer, server or data center via wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.).
  • wired such as coaxial cable, optical fiber, digital subscriber line (DSL)
  • wireless such as infrared, wireless, microwave, etc.
  • the computer-readable storage medium may be any available medium that can be stored by a computer or a data storage device such as a server or a data center integrated with one or more available media.
  • the usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, and a magnetic tape), an optical medium (for example, a DVD), or a semiconductor medium (for example, a solid state disk (SSD)).
  • words such as “exemplary” or “for example” are used as examples, illustrations, or illustrations. Any embodiment or design solution described as “exemplary” or “for example” in the embodiments of the present application should not be construed as being more preferable or advantageous than other embodiments or design solutions. To be precise, words such as “exemplary” or “for example” are used to present related concepts in a specific manner.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Machine Translation (AREA)

Abstract

一种文本信息的处理方法及装置、计算机设备和可读存储介质,适用于人工智能领域的情感语音合成,所述处理方法在为文本中的语句确定情感类别的过程中,不仅考虑了为该语句个体所预测的情感类别,还考虑了语句所在文本的整体情感类别,按照所述方法为文本中语句生成语音信息,有利于为文本生成更加符合人的情感表达习惯的语音信息,提高智能语音设备的拟人程度。

Description

文本信息的处理方法及装置、计算机设备和可读存储介质
本申请要求于2020年2月3日提交中国专利局、申请号为“202010078977.0”、申请名称为“文本信息的处理方法及装置、计算机设备和可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及信息处理领域,尤其涉及一种文本信息的处理方法及装置、计算机设备和可读存储介质。
背景技术
语音技术(Speech Technology)的关键技术有自动语音识别技术(ASR)和语音合成技术(TTS)以及声纹识别技术。让计算机能听、能看、能说、能感觉,是未来人机交互的发展方向,其中语音成为未来最被看好的人机交互方式之一。近年来,语音合成技术取得了极大进步,机器语音播报在智能移动终端、智能家居、车载音响等设备上得以广泛应用。人们对语音合成的要求也不再仅仅是“能听清”,而是转变成“高度逼真,富有情感”,合成语音的质量成为衡量智能语音产品竞争力的一大重要因素。
但是,目前缺少为文本信息生成符合人的情感表达习惯的语音信息的研究,这制约了拟人机器人的发展。
发明内容
本申请实施例提供了一种文本信息的处理方法及装置、计算机设备和可读存储介质,有利于为文本生成符合人的情感表达习惯的语音信息,提高智能语音设备的拟人程度。
第一方面,本申请实施例提供一种文本信息的处理方法,包括:对目标文本进行语句划分,得到语句序列;确定所述目标文本的情感类别;分别确定所述语句序列中各语句的初始情感类别;基于所述目标文本的情感类别和所述语句序列中各语句的初始情感类别,从所述语句序列中确定出第一关键语句,所述第一关键语句的初始情感类别与所述目标文本的情感类别相同;根据所述第一关键语句的初始情感类别和所述目标语句的文本特征得到所述目标语句的修正情感类别,所述目标语句为所述语句序列中与所述第一关键语句相邻的语句,且所述目标语句的初始情感类别与所述目标文本的情感类别不同;基于所述目标语句的修正情感类别生成所述目标语句的语音信息。
人对文本中任一语句的含义的理解,通常不会孤立的进行,而是需要结合文本中的上下文来辅助理解该语句的含义。类似的,人对以语音形式表达的文本中任一语句的情感的理解和表达,同样如此。本申请实施例在为文本中的语句确定情感类别的过程中,不仅考虑了为该语句个体所预测的情感类别,还考虑了语句所在文本的整体情感类别,按照本申请实施例提供的方法为文本中语句生成语音信息,有利于为文本生成更加符合人的情感表 达习惯的语音信息,提高智能语音设备的拟人程度。
由于文本是非结构化的数据,为了便于计算机从文本中挖掘有用的信息,就需要将文本转化为计算机可处理的结构化形式的信息,称作文本特征,该文本特征一般为多维的向量。
在一种可能的实现方式中,目标文本的文本特征可以是根据语句序列中各语句的文本特征得到的。
在一种可能的实现方式中,所述对目标文本进行语句划分,包括:将所述目标文本按照语调短语划分规则进行语句划分。
在一种可能的实现方式中,所述对目标文本进行语句划分,包括:预测目标文本的韵律信息;以语调短语为单位对目标文本进行语句划分,得到语句序列。语句序列中的每个语句为一个语调短语。
在一种可能的实现方式中,文本的韵律信息可以用于指示目标文本中的韵律词、韵律短语和语调短语。
韵律词是一组在实际语流中联系密切的、经常联在一起发音的音节。一般,可以先行预测目标文本中的韵律词。
韵律短语是介于韵律词和语调短语之间的中等节奏组块。韵律短语可能小于句法上的短语,一个韵律短语一般包括一个或多个韵律词,韵律短语内部各个韵律词之间可能出现韵律上的节奏边界,具有相对稳定的短语语调模式和短语重音配置模式。韵律短语是指组成韵律短语的几个韵律词听起来是共用一个节奏群。预测得到目标文本的韵律词之后,可以根据预测得到的韵律词预测目标文本中的韵律短语。
语调短语就是将几个韵律短语按照一定的语调模式连接起来,一个语调短语一般包括一个或多个韵律短语。预测得到目标文本的韵律短语之后,可以根据预测得到的韵律短语预测目标文本中的语调短语。
在一种可能的实现方式中,语句序列中第一语句的文本特征可以为根据第一语句中各韵律词的文本特征得到的,第一语句可以为语句序列中的任意一个语句。
在一种可能的实现方式中,韵律词的文本特征可以为根据韵律词的词向量和/或韵律词的位置特征生成的。
在一种可能的实现方式中,韵律词的词向量可以是通过神经网络得到的,该神经网络可以是对Word2Vec模型或GloVe模型或Bert模型进行训练得到的。
在一种可能的实现方式中,韵律词的位置特征可以用于表示该韵律词在所在语调短语中的位置。例如,一个韵律词的位置特征可以用一个25维的向量表示,该向量的第一至第十维用于表示该韵律词在语调短语中的次序,该向量的第十一至第二十维用于表示该语调短语中韵律词的个数,该向量第二十一至二十五维用于该韵律词的韵律结果,例如,韵律结果可以用于表示该韵律词是否位于韵律短语或语调短语的结尾。
在一种可能的实现方式中,所述目标文本的情感类别为预先设定的,便于用户根据喜好设置语音信息的情感基调。
或者,目标文本的情感类别可以为基于所述目标文本的文本特征获得的。
在一种可能的实现方式中,所述分别确定所述语句序列中各语句的初始情感类别,具 体为:基于所述语句序列中待确定语句的文本特征确定所述待确定语句的初始情感类别。
在一种可能的实现方式中,所述根据所述第一关键语句的初始情感类别、目标语句的初始情感类别和所述目标语句的文本特征得到所述目标语句的修正情感类别,具体为:基于所述目标语句除所述第一关键语句外的另一相邻语句的初始情感类别与所述目标文本的情感类别相同,根据所述第一关键语句的初始情感类别、所述另一相邻语句的初始情感类别和所述目标语句的文本特征得到所述目标语句的修正情感类别。
在一种可能的实现方式中,在一种可能的实现方式中,在得到所述目标语句的修正情感类别之后,所述方法还包括:基于所述目标语句除所述第一关键语句外的另一相邻语句的初始情感类别与所述目标文本的情感类别不同,根据所述目标语句的修正情感类别和所述另一相邻语句的文本特征得到所述另一相邻语句的修正情感类别;基于所述另一相邻语句的修正情感类别生成所述另一相邻语句的语音信息。
以关键语句为中心,依次修正其左右非关键语调短语的情感类别,有利于保持目标文本中相邻语句间情感变化的连贯性。
第二方面,本申请实施例提供一种文本信息的处理装置,该装置包括用于执行上述第一方面或第一方面任意一种可能实现方式的方法的一个或多个功能单元。这些功能单元可以通过硬件实现,或者可以通过硬件执行相应的软件实现,或者由软件结合必要的硬件实现。
在一种可能的实现方式中,文本信息的处理装置可以包括:语句划分模块,用于对目标文本进行语句划分,得到语句序列;确定模块,用于执行如下步骤:确定所述目标文本的情感类别;分别确定所述语句序列中各语句的初始情感类别;基于所述目标文本的情感类别和所述语句序列中各语句的初始情感类别,从所述语句序列中确定出第一关键语句,所述第一关键语句的初始情感类别与所述目标文本的情感类别相同;根据所述第一关键语句的初始情感类别和所述目标语句的文本特征得到所述目标语句的修正情感类别,所述目标语句为所述语句序列中与所述第一关键语句相邻的语句,且所述目标语句的初始情感类别与所述目标文本的情感类别不同;语音生成模块,用于基于所述确定模块确定的所述目标语句的修正情感类别生成所述目标语句的语音信息。
在一种可能的实现方式中,所述语句划分模块用于,将所述目标文本按照语调短语划分规则进行语句划分。
在一种可能的实现方式中,所述目标文本的情感类别为预先设定的,或者,为基于所述目标文本的文本特征获得的。
在一种可能的实现方式中,所述确定模块用于,基于所述语句序列中待确定语句的文本特征确定所述待确定语句的初始情感类别。
在一种可能的实现方式中,所述确定模块用于,基于所述目标语句除所述第一关键语句外的另一相邻语句的初始情感类别与所述目标文本的情感类别相同,根据所述第一关键语句的初始情感类别、所述另一相邻语句的初始情感类别和所述目标语句的文本特征得到所述目标语句的修正情感类别所述。
在一种可能的实现方式中,所述确定模块还用于,在得到所述目标语句的修正情感类 别之后,基于所述目标语句除所述第一关键语句外的另一相邻语句的初始情感类别与所述目标文本的情感类别不同,根据所述目标语句的修正情感类别和所述另一相邻语句的文本特征得到所述另一相邻语句的修正情感类别;所述语音生成模块还用于,基于所述另一相邻语句的修正情感类别生成所述另一相邻语句的语音信息。
第三方面,本申请实施例提供一种计算机设备,包括:处理器和存储器;该存储器用于存储计算机执行指令,当该计算机设备运行时,该处理器执行该存储器存储的该计算机执行指令,以使该计算机设备执行如上述第一方面或第一方面任意一种可能实现方式的方法。
第四方面,本申请实施例提供一种计算机可读存储介质,该计算机可读存储介质中存储有指令,当其在计算机上运行时,使得计算机可以执行上述第一方面或第一方面任意一种可能实现方式的方法。
第五方面,本申请实施例提供一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机可以执行上述第一方面或第一方面任意一种可能实现方式的方法。
第六方面,本申请实施例提供一种芯片系统,该芯片系统包括处理器,用于支持计算机设备实现上述第一方面或第一方面任意一种可能的实现方式中所涉及的功能。在一种可能的设计中,芯片系统还包括存储器,存储器,用于保存计算机设备必要的程序指令和数据。该芯片系统,可以由芯片构成,也可以包含芯片和其他分立器件。
其中,第二方面、第三方面、第四方面、第五方面、第六方面中任一种实现方式所带来的技术效果可参见第一方面中相应实现方式所带来的技术效果,此处不再赘述。
附图说明
图1A是本申请实施例一个可能的应用场景的示意图;
图1B是本申请实施例智能语音设备一种可能的结构示意图;
图1C是本申请实施例另一个可能的应用场景的示意图;
图1D是本申请实施例服务器一种可能的结构示意图;
图2是本申请文本信息的处理方法一种可能的实施例示意图;
图3是本申请对目标文本进行语句划分的一个示意图;
图4是步骤201一种可能的细化流程示意图;
图5-1A是本申请文本信息的处理方法第一阶段过程的实施例示意图;
图5-1B是本申请基于CRF模型预测韵律信息的方法一种可能实施例示意图;
图5-2A是本申请文本信息的处理方法第二阶段过程的实施例示意图;
图5-2B是本申请情感类别分类模型一种可能的结构示意图;
图5-2C是本申请方法对《猴子穿鞋》故事中各语调短语的初始情感类别和不同全局情感下的修正情感类别的一种可能的预测结果;
图5-3是本申请文本信息的处理方法第三阶段过程的实施例示意图;
图5-4A是本申请文本信息的处理方法第四阶段过程的实施例示意图;
图5-4B是本申请情感声学模型的一种可能的结构示意图;
图6是本申请文本信息的处理装置一种可能的实施例示意图。
具体实施方式
下面结合附图,对本申请的实施例进行描述。
语音技术(Speech Technology)的关键技术有自动语音识别技术(ASR)和语音合成技术(TTS)以及声纹识别技术。让计算机能听、能看、能说、能感觉,是未来人机交互的发展方向,其中语音成为未来最被看好的人机交互方式之一。
本发明提出一种文本信息的处理方法,可以用于计算机设备,实现有情感的语音合成。
首先,对本申请实施例的应用场景进行介绍。
示例性的,图1A为本申请实施例一个可能的应用场景的示意图。在图1A对应的应用场景中,计算机设备可以是具备情感分析和处理的能力以及语音合成和输出的能力的实体(称作智能语音设备1)。示例性的,智能语音设备1可以是智能手机、或可发声的穿戴式终端上的智能语音助手、或智能音箱、或可与人对话的机器人等,图1A以智能语音设备1为智能手机为例。智能手机1可以将通过互联网获取(图1A以虚线箭头表示)的或本地存储的文本转化为有情感的语音信息,并向用户输出该有情感的语音信息(图1A以波动的曲线表示)。
图1B是本申请提供的智能语音设备1的一个实施例示意图。智能语音设备可以包括处理器11、存储器12和语音输出模块13。存储器12用于存储计算机程序;处理器11用于执行存储器12中的计算机程序,执行本申请提供的文本信息的处理方法;语音输出模块13用于向用户(人或其他机器人)输出有情感的语音信息,例如,输出模块13可以为扬声器。
在一种可能的实现方式中,智能语音设备1还可以包括输入模块14,输入模块14可以包括触摸屏、摄像头和麦克风阵列等中的一种或多种,触摸屏用于接收用户的触摸指令,摄像头用于检测图像信息,麦克风阵列用于检测音频数据。
在一种可能的实现方式中,智能语音设备1还包括通信接口15,用于与其他设备(例如服务器)进行通信。
在一种可能的实现方式中,智能语音设备中的各个模块可以通过总线16相互连接。
示例性的,图1C为本申请实施例另一个可能的应用场景的示意图。在图1C对应的一种应用场景中,计算机设备可以是服务器2,服务器2可以与智能语音设备1通信连接,图1C中以智能语音设备1为机器人为例。机器人1在与用户交流的过程中,服务器2可以将通过互联网获取的或机器人1发送的文本转化为有情感的语音信息,并将得到的语音信息发送给机器人1,由机器人1向用户输出该有情感的语音信息(图1C以波动的曲线表示)。
或者,继续参考图1C,在图1C对应的另一种应用场景中,计算机设备可以包括通信相连的智能语音设备1和服务器2。智能机器人1和服务器2可以相互配合,共同实现情感分析和处理以及语音合成的功能,例如,服务器2实现情感分析和处理的功能,智能机器人1根据服务器2的情感处理结果,实现语音合成和语音输出。
参考图1D,本申请实施例还提供一种服务器2。服务器2可以包括处理器21和存储器 22。存储器22用于存储计算机程序;处理器21用于执行存储器22中的计算机程序,执行本申请提供的文本信息的处理方法。
在一种可能的实现方式中,处理器21和存储器22可以通过总线24相互连接。
在一种可能的实现方式中,服务器2还可以包括通信接口23,用于与其他设备(例如智能语音设备1)进行通信。
图1B和/或图1D中的处理器可以是中央处理器(central processing unit,CPU),网络处理器(network processor,NP)或者CPU和NP的组合、数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现成可编程门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器,处理器读取存储器中的信息,结合其硬件完成上述方法的步骤。虽然图中仅仅示出了一个处理器,该装置可以包括多个处理器或者处理器包括多个处理单元。具体的,处理器可以是一个单核处理器,也可以是一个多核或众核处理器。该处理器可以是ARM架构处理器。
图1B和/或图1D中的存储器用于存储处理器执行的计算机指令。存储器可以是存储电路也可以是存储器。存储器可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(random access memory,RAM),其用作外部高速缓存。存储器可以独立于处理器,一种可能的实现方式中,处理器和存储器可以通过总线相互连接。总线可以是外设部件互连标准(peripheral component interconnect,PCI)总线或扩展工业标准结构(extended industry standard architecture,EISA)总线等。所述总线可以分为地址总线、数据总线、控制总线等。或者,存储器也可以是处理器中的存储单元,与处理器直接相连(attach),在此不做限定。虽然图中仅仅示出了一个存储器,该装置也可以包括多个存储器或者存储器包括多个存储单元。
下面对本申请实施例方法进行介绍。
参考图2,本申请文本信息的处理方法一个实施例可以包括:
201、对目标文本进行语句划分,得到语句序列;
计算机设备可以获取待转化为语音信息的文本,称作目标文本。过长的句子不便于后续语音生成,因此可以对目标文本进行语句划分,得到语句序列,通过将目标文本划分成更短的句子单位,以便后续语音生成。
为了便于理解,下面对步骤201进行举例说明。图3中,以多个交叉符号代表目标文 本的内容,每个交叉符号可以代表一个或多个字符,这里的字符可以为汉字字符,也可以为非汉字字符(例如阿拉伯数字或英文符号等)。图3示出了8条横线,同一横线上的交叉符号代表被划分为同一语句,8条横线代表将目标文本进行语句划分得到8个语句,为了便于描述,以横线下方的数字代表相应的语句,语句序列中的语句依次为:语句1、语句2、……、语句8句序列。
202、确定目标文本的情感类别;
获取目标文本后,可以确定目标文本的情感类别。目标文本的情感类别是指目标文本对应的一种情感类别。
在一种可能的实现方式中,目标文本的情感类别可以为出厂时设置的,也可以由用户根据需要设置。或者,在一种可能的实现方式中,目标文本的情感类别可以按照某种算法预测得到。在一种可能的实现方式中,可以基于目标文本的文本特征预测目标文本的初始情感类别。示例性的,可以将目标文本的文本特征输入训练好的神经网络,该神经网络用于预测文本的情感类别,从而可以得到目标文本的情感类别。
由于文本是非结构化的数据,为了便于计算机从文本中挖掘有用的信息,就需要将文本转化为计算机可处理的结构化形式的信息,称作文本特征,该文本特征一般为多维的向量。
203、分别确定语句序列中各语句的初始情感类别;
得到语句序列后,可以按照某种算法预测各语句的情感类别(称作初始情感类别)。
在一种可能的实现方式中,对于语句序列中任一待确定初始情感类别的语句(简称待确定语句),可以基于待确定语句的文本特征确定该待确定语句的初始情感类别。示例性的,可以将待确定语句的文本特征输入训练好的神经网络,该神经网络用于对语句的情感类别进行分类,或者说,预测语句的情感类别,从而可以得到该语句的情感类别。
示例性的,该神经网络可以通过深度神经网络或支持向量机或隐马尔科夫模型等分类模型构建并预先使用训练语料训练得到。
204、基于目标文本的情感类别和语句序列中各语句的初始情感类别,从语句序列中确定出第一关键语句;
确定目标文本的情感类别和语句序列中各语句的初始情感类别后,可以从语句序列中确定初始情感类别与目标文本的情感类别相同的语句,称作关键语句,关键语句的初始情感类别与目标文本的情感类别相同。
示例性的,继续参考图3,假设目标文本的情感类别为A,语句1、语句5和语句7的初始情感类别为A,可以确定语句1、语句5和语句7为语句序列中的关键语句,图3中以虚线框标识关键语句。
为了便于描述,将目标文本的一个关键语句称作第一关键语句。
205、根据第一关键语句的初始情感类别和目标语句的文本特征得到目标语句的修正情感类别;
确定语句序列中的第一关键语句,并且确定语句序列中各语句的初始情感类别之后,可以对语句序列中关键语句以外的语句的情感类别进行修正。
具体的,可以从语句序列中确定满足以下条件的语句(称作目标语句):与第一关键语 句相邻,并且初始情感类别与目标文本的情感类别不同。之后,可以根据第一关键语句的初始情感类别(即目标文本的情感类别)和目标语句的文本特征得到目标语句的修正情感类别。
继续以图3的语句序列为例,假设语句序列中语句2、语句3、语句4、语句6、语句8的初始情感类别依次为:B、C、D、B、C。若语句1为第一关键语句,那么语句2为目标语句;若语句5为第一关键语句,那么语句4和语句6均为目标语句;若语句7为第一关键语句,那么语句6和8均为目标语句。以语句1为第一关键语句,语句2为目标语句为例,可以根据情感类别A和语句2的文本特征,得到语句2的修正情感类别。
206、基于目标语句的修正情感类别生成目标语句的语音信息;
得到目标语句的修正情感类别之后,可以基于目标语句的修正情感类别生成目标语句的语音信息。之后,通过语音播放模块播放该语音信息,可以在输出目标语句的文本内容的同时,表达其修正情感类别对应的情感。
人对文本中任一语句的含义的理解,通常不会孤立的进行,而是需要结合文本中的上下文来辅助理解该语句的含义。类似的,人对以语音形式表达的文本中任一语句的情感的理解和表达,同样如此。本申请实施例可以为文本中的语句确定其情感类别,并为相应短句生成能够表达该情感类别的情感的语音信息。本申请实施例在为语句确定情感类别的过程中,不仅考虑了为该语句个体所预测的情感类别,还考虑了语句所在文本的整体情感类别,按照本申请实施例提供的方法为文本中语句生成语音信息,有利于为文本生成更加符合人的情感表达习惯的语音信息,提高智能语音设备的拟人程度。
本申请实施例中步骤号对应的步骤执行顺序仅作为一种可能的执行顺序,例如,本申请实施例不限定步骤202与步骤201和步骤203之间的时序关系,只要步骤202在步骤204之前执行即可。
在一种可能的实现方式中,可以基于第一关键语句的初始情感类别生成第一关键语句的语音信息。之后,通过语音播放模块播放该语音信息时,可以在输出第一关键语句的文本内容的同时,表达第一关键语句的初始情感类别对应的情感。
在一种可能的实现方式中,在确定关键语句的初始情感类别、非关键语句的修正情感类别之后,可以分别根据各语句和与各语句相应的情感类别生成各语句的语音,之后,可以按照各语句在短句序列中的次序将各语句的语音拼接为目标文本的语音。
在一种可能的实现方式中,当确定目标语句除所属第一关键语句外的另一相邻语句为关键语句(即另一相邻语句的初始情感类别与目标文本的情感类别相同)时,步骤205可以具体为:
根据第一关键语句的初始情感类别、另一相邻语句的初始情感类别和目标语句的文本特征得到目标语句的修正情感类别。
继续参考图3,假设第一关键语句为语句5,目标语句为语句6,由于语句7为关键语句,因此,可以根据情感类别A、情感类别A和目标语句的文本特征,得到语句6的修正情感类别。
在一种可能的实现方式中,若语句序列中的第一语句和第二语句的初始情感类别相同, 且与目标文本的情感类别不同,第一语句仅有一个相邻语句为关键语句,第二语句的两个相邻语句都是关键语句,那么,和第一语句的修正情感类别相比,第二语句的修正情感类别更接近目标文本的情感类别。
关于步骤201,一种常用语句划分方式是按目标文本中的标点符号(如逗号,句号,叹号等)对目标文本进行语句划分。对文本进行语句划分的粒度大小决定着文本语音所能表达的情感的细腻程度,粒度越大,例如以目标文本为一个语句,那么,该目标文本的语音信息只能表达一种情感类别;按照标点符号划分得到语句,其包含的内容可能较多,以这样的语句为最小粒度来生成目标文本的语音信息,无法体现语句内情感的波动,不利于提高文本语音所能表达的情感的细腻程度。
在一种可能的实现方式中,步骤201可以包括:将目标文本按照语调短语划分规则进行语句划分,得到语句序列。
在一种可能的实现方式中,参考图4,步骤201可以具体包括如下步骤:
2011、预测目标文本的韵律信息;
文本的韵律信息可以用于指示目标文本中的韵律词、韵律短语和语调短语。
韵律词是一组在实际语流中联系密切的、经常联在一起发音的音节。一般,可以先行预测目标文本中的韵律词。
韵律短语是介于韵律词和语调短语之间的中等节奏组块。韵律短语可能小于句法上的短语,一个韵律短语一般包括一个或多个韵律词,韵律短语内部各个韵律词之间可能出现韵律上的节奏边界,具有相对稳定的短语语调模式和短语重音配置模式。韵律短语是指组成韵律短语的几个韵律词听起来是共用一个节奏群。预测得到目标文本的韵律词之后,可以根据预测得到的韵律词预测目标文本中的韵律短语。
语调短语就是将几个韵律短语按照一定的语调模式连接起来,一个语调短语一般包括一个或多个韵律短语。预测得到目标文本的韵律短语之后,可以根据预测得到的韵律短语预测目标文本中的语调短语。
2012、以语调短语为单位对目标文本进行语句划分,得到语句序列;
预测得到目标文本的语调短句后,可以以语调短语为单位,对目标文本进行语句划分,得到语句序列。语句序列中的每个语句为一个语调短语。
和按标点符号对目标文本进行语句划分相比,按照语调短语进行语句划分得到的语句粒度更小,有利于体现两个标点符号之间语句的情感波动,有利于提高文本语音所能表达的情感的细腻程度。并且,实验结果显示,以语调短语为单位进行语句划分,预测语句的情感类别,可以使得情感预测更为可控,而又不会对合成的语音的韵律带来负面影响。
在一种可能的实现方式中,目标文本的文本特征可以是根据语句序列中各语句的文本特征得到的。
在一种可能的实现方式中,语句序列中第一语句的文本特征可以为根据第一语句中各韵律词的文本特征得到的,第一语句可以为语句序列中的任意一个语句。
在一种可能的实现方式中,韵律词的文本特征可以为根据韵律词的词向量和/或韵律词 的位置特征生成的。
示例性的,韵律词的词向量可以是通过神经网络得到的,该神经网络可以是对Word2Vec模型或GloVe模型或Bert模型进行训练得到的。
在一种可能的实现方式中,韵律词的位置特征可以用于表示该韵律词在所在语调短语中的位置。例如,一个韵律词的位置特征可以用一个25维的向量表示,该向量的第一至第十维用于表示该韵律词在语调短语中的次序,该向量的第十一至第二十维用于表示该语调短语中韵律词的个数,该向量第二十一至二十五维用于该韵律词的韵律结果,例如,韵律结果可以用于表示该韵律词是否位于韵律短语或语调短语的结尾。
在一种可能的实现方式中,在得到各语句的文本特征和情感类别(关键语句的初始情感类别和非关键语句的修正情感类别)后,可以分别根据各语句的文本特征和情感类别,预测各语句的情感强度。具体的,假设第一语句为语句序列中的任一语句,在一种可能的实现方式中,可以根据第一语句的文本特征及情感类别,预测第一语句的初始情感强度控制向量;可以确定该目标文本的全局情感强度;之后,利用全局情感强度等级和第一语句的初始情感强度控制向量确定第一语句的修正情感强度。
在一种可能的实现方式中,目标语句的第一强度差异大于目标语句的第二强度差异,目标语句的第一强度差异为目标语句的初始情感强度与目标文本的全局情感强度之间的差异,目标语句的第二强度差异为目标语句的修正情感强度与目标文本的全局情感强度之间的差异。
在一种可能的实现方式中,可以根据目标语句的修正情感类别和目标语句的修正情感强度生成目标语句的语音信息。
下面以《猴子穿鞋》这段文本的情感语音合成为例,介绍本申请文本信息的处理方法一种可能的实施例,该实施例基于端到端声学模型(如Tacotron)的语音合成框架,用于对大段文本进行情感语音的合成处理。
儿童故事《猴子穿鞋》的内容如下:
“一只小猴跑到山下去,看见人们都穿着鞋走路,感到很好玩。他偷偷地溜到一户人家,拿了一双鞋跑回山上,很得意地走来走去。这时来了一只凶猛的老虎,猴子们纷纷爬上了树。小猴穿着鞋怎么也爬不上树。猴妈妈大叫赶紧把鞋扔掉。小猴扔掉鞋很快爬上了树,从此再也不乱模仿别人了。”
以上述《猴子穿鞋》的内容为目标文本,下面描述本申请对目标文本的处理方法一种可能的具体实施例,本申请文本信息的处理方法另一种可能的实施例可以包括如下几个阶段的步骤:
参考图5-1A,第一阶段S1可以包括如下步骤:
S1-1、对目标文本进行规范化处理;
将待合成的文本中的非汉字字符,如阿拉伯数字、英文符号、各种符号等根据其上下文语义转化成对应的汉字字符;本实施例使用基于规则的方法,即收集和定义一个规则集合,遇到待规范化的文本时则同这些规则一一匹配,以得到对应的规范措施。本实施例所 用的《猴子穿鞋》文本已经为规范了的中文文本。可以另举例如:对于句子“遇到困难请拨打110”,其将会匹配到规则“(拨|打|呼叫|按|联系|call)(110|120|119|112|911)”,从而根据该规则判断得出数字应该按电报读法规范化。
S1-2、预测目标文本中的韵律信息;
预测目标文本中的韵律信息,韵律信息用于指示目标文本中的韵律结构,韵律结构包括韵律词、韵律短语和语调短语。依次预测不同层级的韵律结构的韵律信息,如韵律词、韵律短语、语调短语,不同的韵律结构的结尾在合成的语音中体现为不同的停顿时长。为了准确预测这些韵律信息,一般需要对文本事先进行分词和词性标注预测,继而依次按韵律词->韵律短语->语调短语的层级顺序进行预测。
图5-1B给出了本实例所用的基于条件随机森林(condition random forest,CRF)模型韵律预测例子,其输入文本为表1中的第三个句子。图5-1B中的流程图用于表示S1-2的一种可能的细化流程,流程图中每个步骤右侧的文字用于示例性表示相应步骤的结果。
参考图5-1B,步骤S1-2包括如下步骤:
S1-21、基于CRF模型的分词、词性标注;
以“/”代表分词的结尾,以字母代表前面分词的词性。
例如,可以“a”代表形容词,以“d”代表副词,以“f”代表方位词,以“m”代表数词,以“n”代表名词,以“q”代表量词,以“r”代表代词,以“u”代表助词,以“v”代表动词,以“w”代表标点符号。
S1-22、基于CRF模型的韵律词预测;
S1-23、基于CRF模型的韵律短语预测;
S1-24、基于CRF模型的语调短语预测;
表1中的第三个句子包括两个语调短语,分别为“这时来了一只凶猛的老虎”和“猴子们纷纷爬上了树”。
S1-3、以语调短语为单位对目标文本进行语句划分,得到语句序列;
过长的句子不便于后续语音生成,从而该步骤将输入的大段文本划分成更短的句子单位,以便后续语音生成。一种常用的划分方式是按标点符号(如句号,叹号等)将大段文本划分成各个子句子。而本实施例采用更小粒度的语调短语作为划分的短句子结果,语句序列中的任一语句为一个语调短语。后续步骤也即以语调短语为合成单位进行情感特征预测和语音生成,这是因为实验显示以语调短语为单位可以使得情感特征转换更为可控,而又不会对合成语音的韵律带来负面影响。即如图5-1B中所示的例子中,对应的输入句子将被划分成两个语调短语用于后续合成步骤。
S1-4、预测目标文本中汉字的注音;
对语调短语中的汉字预测其对应的拼音。如图5-1B中的两个语调短语的文字注音结果分别为:
zhe4 shi2 lai2 le5 yi4 zhi1 xiong1 meng3 de5 lao2 hu3;
hou2 zi5 men5 fen1 fen5 pa2 shang4 le5 shu4。
其中,拼音后的数字用于表示汉字的声调,例如,以“1”代表一声,以“2”代表二 声,以“3”代表三声,以“4”代表四声,以“5”代表其他声调,例如轻声。
S1-5、根据语句序列中各语调短语的韵律信息和注音生成各语调短语的基于音素的文本特征A;
经过步骤S1-1~S1-4的处理后,该步骤可以将以上特征组合成包括包含以上特征的音素级的文本特征(称作文本特征A),这样,每个语调短语的文本特征A用于指示相应语调短语的拼音、韵律词和韵律短语等。对于图5-1B中得到的两个语调短语,其基于音素的文本特征A结果分别如下:
^ #0 zhe4 #0 shi2 #2 lai2 #0 le5 #yi4 #0 zhi1 #1 xiong1 #0 meng3 #0 de5#1 lao2 #0 hu3#3 $;
^ #0 hou2 #0 zi5 #0 men5 #2 fen1 #0 fen5 #1 pa2 #0 shang4 #0 le5 #1 shu4 #3 $。
其中,^表示句首开始符,$为句尾结束符,#0,#1,#2,#3分别表示音节、韵律词、韵律短语和语调短语结束位置符号。
S1-6、为各语调短语中各韵律词生成词向量;
词向量生成,以语调短语为单位,将每个语调短语中的每个韵律词通过预训练好的词向量模型(本实施例使用的是Word2Vec模型,也可以使用其他模型,如GloVe,Bert等)转换成对应的词向量。比如对于“这时来了一只凶猛的老虎”这句语调短语,其中的韵律词“这时”、“来了”、“一只”、“凶猛的”、“老虎”分别通过Word2Vec模型转换为一个200维的词向量。
S1-7:根据各词向量生成相应语调短语的基于词向量的文本特征B;
对于每个语调短语,组合其中各个词的词向量以及上下文特征,生成一个语调短语级的文本特征。示例性的,该组合操作可以具体指拼接操作。示例性的,最终每个词对应的特征包括200维的词特征向量和25维的上下文特征。上下文特征可以采用独热(one-hot)编码的方式,表征当前词在语调短语中的位置、当前语调短语中韵律词的个数、当前词的韵律结果等。
在第一阶段S1完成之后,可以执行第二阶段S2的步骤,参考图5-2A,第二阶段可以具体包括如下步骤:
S2-1、识别各语调短语的初始情感类别;
利用预训练好的文本情感分类模型,以S1-7输出的各语调短语的文本特征B作为该分类模型的输入,分别确定各语调短语对应的初始情感类别。
该文本情感分类模型可通过深度神经网络、支持向量机、隐马尔科夫模型等分类模型构建并预先使用训练语料训练,本实例采用循环神经网络模型,即图5-2B所示的2层长短期记忆(long short-term memory,LSTM)网络作为情感类别分类模型。
S2-2、确定目标文本对应的全局情感类别;
当前大段文本的情感类别,或称全局情感类别,可以由用户事先指定,也可以通过情感分类模型自动识别;若为后者,则将所有的语调短语的文本特征B作为输入特征,使用S2-1所述的预先训练好的文本情感分类模型进行识别。
此处的“用户”可以包括智能终端程序的开发者,也可以智能终端的使用者,可以将全局情感偏好设置为积极(或正向)的情绪,例如高兴。
S2-3、确定关键语调短语;
根据S2-1和S2-2的识别结果,将S2-1得到的情感类别同S2-2所得的情感类别一致的语调短语标记为关键语调短语。关键语调短语的情感类别在后续步骤中将不会被改变。
S2-4、对语句序列中非关键语调短语的情感类别进行修正,得到修正情感类别;
其具体做法是事先使用情感文本训练数据训练一个基于上下文的情感类别修正模型。如同图5-2B所示,该模型是一个两层LSTM的神经网络模型,其可根据大量的用于识别任务的情感语音数据训练而成,其输入是由待修正的当前语调短语的左右语调短语的情感类别、全局情感类别、以及当前语调短语的文本特征A等拼接而成,而输出则为当前语调短语的情感类别。利用该情感类别修正模型,以关键语调短语为中心,依次修正其左右非关键语调短语的情感类别,直至所有非关键语调短语的类别特征被修正完毕为止。
图5-2C示例性示出了《猴子穿鞋》故事中各语调短语的初始情感类别和不同全局情感下的修正情感类别。在修正情感类别对应的列中,以符号“★”代表S2-3确定的关键语调短语。如图5-2C所示,在全局情感类别影响下,部分非关键语调短语的初始情感类别会得到修正,即修正情感类别与初始情感类别不同。
第二阶段完成之后,可以执行第三阶段S3的步骤,参考图5-3,第三阶段S3可以具体包括如下步骤:
S3-1、根据各语调短语的文本特征B和修正后的情感类别,预测各语调短语的情感强度声学特征向量;
该步骤会利用事先训练好的情感强度声学特征预测模型,将S1得到语调短语的文本特征B和S2得到的情感类别作为输入,输出情感声学特征向量。该情感强度声学特征预测模型使用2层双向长短时记忆网络(bidirectional long short-term memory,BLSTM)和2层深度神经网络(deep neural networks,DNN)层的神经网络模型构建,并事先使用准备好的情感训练语料训练而成,其输入是以词向量表征的语调短语的文本特征B和语调短语的情感类别拼接而成,而输出则为如下表所示的该语调短语的七维情感强度声学特征向量。
表1
Figure PCTCN2020115007-appb-000001
从而,对于每个语调短语,便可得到一个七维的情感强度声学特征向量。
S3-2、将各语调短语的情感强度声学特征向量映射成情感强度控制向量;
由于高维度的情感强度特征向量存在不易控制的缺点,所以该步骤对所获的情感强度特征向量进行映射处理,使其转换成低维度的情感强度控制向量(比如目标维度为3维)。
本实施例便采用多维缩放(multidimensional scaling,MDS)算法来进行该映射。对于 每种情感类别,我们事先利用该类别的训练语料训练一个多维标度法(multidimensional scaling,MDS)的特征矩阵M(M∈R3*7)。假设当前语调短语的情感强度声学特征向量为x(x∈R7*1),则最终该语调短语的情感控制向量为y=M'*x,得到一个三维的向量结果(其中M’为当前语调短语的情感类别对应的特征矩阵)。可以发现,情感强度同该三维控制向量的第一维特征和第二维特征(即MDS1和MDS2)正相关,而同第三维特征(MDS3)负相关。因而,若要增加情感强度,则可通过调大MDS1与MDS2的值,或调小MDS3值而获得;反之,若要减弱情感强度,则需调小MDS1与MDS2,或增大MDS3。
S3-3、确定关键语调短语的情感强度;
对于步骤S2-3所确定的关键语调短语,若用户事先设定偏好的情感强度,则以用户的设置值来设定情感强度;假设用户可设置的情感强度为“强”、“中”、“弱”三等,则可以分别以情感控制向量空间中的表示“强”的区域的中心值,整个空间的中心值,以及表示“弱”的区域的中心值来初始化情感控制向量,即情感强度。
这里仅以三个等级的情感强度为例,在实际使用中,也可以提供更多等级的情感强度。
S3-4、对语句序列中非关键语调短语的情感强度进行修正;
步骤S2-4用以修正非关键语调短语的情感类别,类似的,步骤S3-4用以修正非关键语调短语的情感强度,使得相邻语调短语的情感语音在情感强度上能过渡自然连贯。本实施例采用一种基于情感强度等级预测模型的实现方法。如同图5-2B所示,该模型是一个两层LSTM的神经网络模型,其可根据大量的用于识别任务的情感语音数据训练而成,其输入是由左右语调短语的情感类别和情感强度、全局情感类别和情感强度(即关键语调短语的情感类别和情感强度)、以及当前语调短语的情感类别和S3-1所得的情感声学特征向量等拼接而成,而输出则为当前语调短语的情感强度(“强”,“中”,“弱”三个类别)。利用该情感强度等级预测模型,以关键语调短语为中心,依次按以下步骤修正其左右非关键语调短语的情感强度,直至所有非关键语调短语的强度特征被修正完毕为止。
具体修正方式可以为:
1)参考S3-4,利用情感强度等级预测模型,预测得到当前非关键语调短语的初始情感强度。
2)确定S3-2所得情感强度控制向量对应的等级,与1)中的结果进行比较,根据比较结果对当前非关键语调短语的情感强度控制向量进行修正;
具体的,若二者相同,则可以不对S3-2得到的当前非关键语调短语的情感强度控制向量进行调整;
若S3-2所得情感强度对应的等级低于1)的结果,则对S3-2得到的当前非关键语调短语的情感强度控制向量进行调整,使得其对应的情感强度增加一定比例(例如,增加的比例记为a,0<a<1);
若S3-2所得情感强度对应的等级高于1)的结果,则对S3-2得到的当前非关键语调短语的情感强度控制向量进行调整,使得对应的情感强度减小一定比例(例如,减小的比例记为b,0<b<1)。
如S3-2所述,情感强度控制向量是一个三维的向量,其中前两维正相关,后一维负相 关,因此,作为举例,增加a的具体操作可以为将MDS1和MDS2的值分别乘以(1+a),而MDS3得值则乘以(1-a);同理,对于降低b,则其操作相反。
待以上三个阶段的步骤完成后,可以执行第四阶段S4的步骤,合成目标文本的语音信息。具体的,可以将以第一阶段S1输出的文本特征A、第二阶段S2确定的各语调短语的情感类别、第三阶段S3输出的各语调短语的情感强度控制向量,通过预训练好的基于深度神经网络的端到端声学模型,预测出对应的情感声学特征,最终通过声码器生成情感语音。所有的语料短语对应的情感语音按序拼接成大段文本对应的情感语音。
更为具体的,参考图5-4A,第四阶段S4可以具体包括如下步骤:
S4-1、根据各语调短语的文本特征A、情感类别和情感强度控制向量预测各语调短语的情感声学特征;
情感声学模型通过声谱预测网络(Tacotron)构建,Tacotron型包括编码器(encoder)、解码器(decoder)和用作编码器和解码器的桥接的注意力(attention),如图5-4B所示。其输入是S1所得的各语调短语的基于音素的文本特征A、S2确定的各语调短语的情感类别、S3输出的各语调短语的情感强度控制向量,输出特征则为各语调短语的帧级(如每12.5毫秒为一帧)的线性谱声学特征(维度1025维)。
S4-2、将各语调短语的情感声学特征通过声码器合成相应的语音信息;
该步骤使用声码器(例如Griffin-Lim声码器)将S4-1所生成的情感声学特征进行计算,合成各语调短语的语音信息(或称音频信息),这些语调短语的语音信息按序拼接后,即为大段目标文本对应的最终合成的语音信息。
以语调短语为单位进行文本特征处理,情感特征预测和语音生成,更小的合成单位较之原始的句子级单位,具有更大的操作灵活性,使得预测的情感结果更为丰富,单元间的情感表现更为可控。
对于情感特征预测,“先各个短语单元独立预测后基于全局情感进行修正”的方式,可使得既最大化局部情感特征的多样性,又可保证大段文本的全局情感基调可控,合成的情感语音更具情感表现力。
从一种语音情感过渡到另一种语音情感,其中不仅涉及情感类别的转换,也是需要考虑情感强度的逐渐过渡转变。采用本发明的情感强度修正方法,使得情感强度的变化衔接更连贯。
结合图1B,上述任一方法实施例可以由智能语音设备1执行,存储器12用于存储执行本申请方案的计算机指令,处理器11用于执行存储器12中的计算机指令时,执行本申请提供的任意一个方法实施例,输出模块13用于输出合成的有感情的语音信息。
结合图1D,上述任一方法实施例可以由服务器2,存储器22用于存储执行本申请方案的计算机指令,处理器21用于执行存储器22中的计算机指令时,执行本申请实施例提供的任意一个方法实施例。
结合图1D,上述任一方法实施例可以由服务器2和智能语音设备1共同执行,例如,智能语音设备1用于将目标文本发送给服务器2,服务器2用于确定目标文本中各语句的情感类别,并将情感类别发送给智能语音设备1,智能语音设备1还用于根据服务器2发 送的情感类别生成目标文本的语音信息,并输出该语音信息。
上面从方法和实体设备的角度对本申请实施例进行了介绍。下面,从功能模块的角度,介绍本申请实施例提供的文本信息的处理装置。
从功能模块的角度,本申请可以根据上述方法实施例对执行文本信息的处理方法的装置进行功能模块的划分,例如,可以对应各个功能划分各个功能模块,也可以将两个或两个以上的功能集成在一个功能模块中。上述集成的功能模块既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
比如,以采用集成的方式划分各个功能单元的情况下,图6示出了一种文本信息的处理装置的结构示意图。如图6所示,本申请文本信息的处理装置600的一个实施例可以包括语句划分模块601、确定模块602和语音生成模块603,其中,语句划分601用于对目标文本进行语句划分,得到语句序列。确定模块602,用于执行如下步骤:确定目标文本的情感类别;分别确定语句序列中各语句的初始情感类别;基于目标文本的情感类别和语句序列中各语句的初始情感类别,从语句序列中确定出第一关键语句,第一关键语句的初始情感类别与目标文本的情感类别相同;根据第一关键语句的初始情感类别和目标语句的文本特征得到目标语句的修正情感类别,目标语句为语句序列中与第一关键语句相邻的语句,且目标语句的初始情感类别与目标文本的情感类别不同。语音生成模块603,用于基于确定模块确定的目标语句的修正情感类别生成目标语句的语音信息。
在一种可能的实现方式中,语句划分模块601用于,将目标文本按照语调短语划分规则进行语句划分。
在一种可能的实现方式中,目标文本的情感类别为预先设定的,或者,为基于目标文本的文本特征获得的。
在一种可能的实现方式中,确定模块602用于,基于语句序列中待确定语句的文本特征确定待确定语句的初始情感类别。
在一种可能的实现方式中,确定模块602用于,基于目标语句除第一关键语句外的另一相邻语句的初始情感类别与目标文本的情感类别相同,根据第一关键语句的初始情感类别、另一相邻语句的初始情感类别和目标语句的文本特征得到目标语句的修正情感类别。
在一种可能的实现方式中,确定模块602还用于,在得到目标语句的修正情感类别之后,基于目标语句除第一关键语句外的另一相邻语句的初始情感类别与目标文本的情感类别不同,根据目标语句的修正情感类别和另一相邻语句的文本特征得到另一相邻语句的修正情感类别;语音生成模块603还用于,基于另一相邻语句的修正情感类别生成另一相邻语句的语音信息。
一种可能的实现方式,本申请实施例中的计算机执行指令或计算机指令也可以称之为应用程序代码,本申请实施例对此不作具体限定。
上述实施例,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现,当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。
所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机执行指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是 通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存储的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘Solid State Disk(SSD))等。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的术语在适当情况下可以互换,这仅仅是描述本申请的实施例中对相同属性的对象在描述时所采用的区分方式。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,以便包含一系列单元的过程、方法、系统、产品或设备不必限于那些单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它单元。在本申请实施例中,“多个”指两个或两个以上。
本申请实施例中,“示例性的”或者“例如”等词用于表示作例子、例证或说明。本申请实施例中被描述为“示例性的”或者“例如”的任何实施例或设计方案不应被解释为比其他实施例或设计方案更优选或更具优势。确切而言,使用“示例性的”或者“例如”等词旨在以具体方式呈现相关概念。
在本申请的各实施例中,为了方面理解,进行了多种举例说明。然而,这些例子仅仅是一些举例,并不意味着是实现本申请的最佳实现方式。
以上对本申请所提供的技术方案进行了详细介绍,本申请中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。

Claims (15)

  1. 一种文本信息的处理方法,其特征在于,包括:
    对目标文本进行语句划分,得到语句序列;
    确定所述目标文本的情感类别;
    分别确定所述语句序列中各语句的初始情感类别;
    基于所述目标文本的情感类别和所述语句序列中各语句的初始情感类别,从所述语句序列中确定出第一关键语句,所述第一关键语句的初始情感类别与所述目标文本的情感类别相同;
    根据所述第一关键语句的初始情感类别和所述目标语句的文本特征得到所述目标语句的修正情感类别,所述目标语句为所述语句序列中与所述第一关键语句相邻的语句,且所述目标语句的初始情感类别与所述目标文本的情感类别不同;
    基于所述目标语句的修正情感类别生成所述目标语句的语音信息。
  2. 根据权利要求1所述的方法,其特征在于,所述对目标文本进行语句划分,包括:
    将所述目标文本按照语调短语划分规则进行语句划分。
  3. 根据权利要求1所述的方法,其特征在于,所述目标文本的情感类别为预先设定的,或者,为基于所述目标文本的文本特征获得的。
  4. 根据权利要求1所述的方法,其特征在于,所述分别确定所述语句序列中各语句的初始情感类别,具体为:
    基于所述语句序列中待确定语句的文本特征确定所述待确定语句的初始情感类别。
  5. 根据权利要求1至4中任一项所述的方法,其特征在于,所述根据所述第一关键语句的初始情感类别、目标语句的初始情感类别和所述目标语句的文本特征得到所述目标语句的修正情感类别,具体为:
    基于所述目标语句除所述第一关键语句外的另一相邻语句的初始情感类别与所述目标文本的情感类别相同,根据所述第一关键语句的初始情感类别、所述另一相邻语句的初始情感类别和所述目标语句的文本特征得到所述目标语句的修正情感类别。
  6. 根据权利要求1至5中任一项所述的方法,其特征在于,在得到所述目标语句的修正情感类别之后,所述方法还包括:
    基于所述目标语句除所述第一关键语句外的另一相邻语句的初始情感类别与所述目标文本的情感类别不同,根据所述目标语句的修正情感类别和所述另一相邻语句的文本特征得到所述另一相邻语句的修正情感类别;
    基于所述另一相邻语句的修正情感类别生成所述另一相邻语句的语音信息。
  7. 一种文本信息的处理装置,其特征在于,包括:
    语句划分模块,用于对目标文本进行语句划分,得到语句序列;
    确定模块,用于执行如下步骤:
    确定所述目标文本的情感类别;
    分别确定所述语句序列中各语句的初始情感类别;
    基于所述目标文本的情感类别和所述语句序列中各语句的初始情感类别,从所述语句 序列中确定出第一关键语句,所述第一关键语句的初始情感类别与所述目标文本的情感类别相同;
    根据所述第一关键语句的初始情感类别和所述目标语句的文本特征得到所述目标语句的修正情感类别,所述目标语句为所述语句序列中与所述第一关键语句相邻的语句,且所述目标语句的初始情感类别与所述目标文本的情感类别不同;
    语音生成模块,用于基于所述确定模块确定的所述目标语句的修正情感类别生成所述目标语句的语音信息。
  8. 根据权利要求7所述的装置,其特征在于,所述语句划分模块用于,将所述目标文本按照语调短语划分规则进行语句划分。
  9. 根据权利要求7所述的装置,其特征在于,所述目标文本的情感类别为预先设定的,或者,为基于所述目标文本的文本特征获得的。
  10. 根据权利要求7所述的装置,其特征在于,所述确定模块用于,基于所述语句序列中待确定语句的文本特征确定所述待确定语句的初始情感类别。
  11. 根据权利要求7至10中任一项所述的装置,其特征在于,所述确定模块用于,基于所述目标语句除所述第一关键语句外的另一相邻语句的初始情感类别与所述目标文本的情感类别相同,根据所述第一关键语句的初始情感类别、所述另一相邻语句的初始情感类别和所述目标语句的文本特征得到所述目标语句的修正情感类别所述。
  12. 根据权利要求7至11中任一项所述的装置,其特征在于,所述确定模块还用于,在得到所述目标语句的修正情感类别之后,基于所述目标语句除所述第一关键语句外的另一相邻语句的初始情感类别与所述目标文本的情感类别不同,根据所述目标语句的修正情感类别和所述另一相邻语句的文本特征得到所述另一相邻语句的修正情感类别;
    所述语音生成模块还用于,基于所述另一相邻语句的修正情感类别生成所述另一相邻语句的语音信息。
  13. 一种计算机设备,其特征在于,包括处理器和存储器,所述处理器在运行所述存储器存储的计算机指令时,执行如权利要求1至6中任一项所述的方法。
  14. 一种计算机可读存储介质,其特征在于,包括指令,当所述指令在计算机上运行时,使得计算机执行如权利要求1至6中任一项所述的方法。
  15. 一种计算机程序产品,其特征在于,包括指令,当所述指令在计算机上运行时,使得计算机执行如权利要求1至6中任一项所述的方法。
PCT/CN2020/115007 2020-02-03 2020-09-14 文本信息的处理方法及装置、计算机设备和可读存储介质 WO2021155662A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP20917614.8A EP4102397A4 (en) 2020-02-03 2020-09-14 Text information processing method and apparatus, computer device, and readable storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010078977.0 2020-02-03
CN202010078977.0A CN111274807B (zh) 2020-02-03 2020-02-03 文本信息的处理方法及装置、计算机设备和可读存储介质

Publications (1)

Publication Number Publication Date
WO2021155662A1 true WO2021155662A1 (zh) 2021-08-12

Family

ID=71002027

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/115007 WO2021155662A1 (zh) 2020-02-03 2020-09-14 文本信息的处理方法及装置、计算机设备和可读存储介质

Country Status (3)

Country Link
EP (1) EP4102397A4 (zh)
CN (1) CN111274807B (zh)
WO (1) WO2021155662A1 (zh)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111274807B (zh) * 2020-02-03 2022-05-10 华为技术有限公司 文本信息的处理方法及装置、计算机设备和可读存储介质
CN111724765B (zh) * 2020-06-30 2023-07-25 度小满科技(北京)有限公司 一种文本转语音的方法、装置及计算机设备
CN111899575A (zh) * 2020-07-21 2020-11-06 北京字节跳动网络技术有限公司 听写内容发布方法、装置、设备和存储介质
CN115862584A (zh) * 2021-09-24 2023-03-28 华为云计算技术有限公司 一种韵律信息标注方法以及相关设备
CN113990286A (zh) * 2021-10-29 2022-01-28 北京大学深圳研究院 语音合成方法、装置、设备及存储介质
WO2023102931A1 (zh) * 2021-12-10 2023-06-15 广州虎牙科技有限公司 韵律结构的预测方法、电子设备、程序产品及存储介质

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1787074A (zh) * 2005-12-13 2006-06-14 浙江大学 基于情感迁移规则及语音修正的说话人识别方法
US20090258333A1 (en) * 2008-03-17 2009-10-15 Kai Yu Spoken language learning systems
CN103198827A (zh) * 2013-03-26 2013-07-10 合肥工业大学 基于韵律特征参数和情感参数关联性的语音情感修正方法
CN103366731A (zh) * 2012-03-31 2013-10-23 盛乐信息技术(上海)有限公司 语音合成方法及系统
CN103810994A (zh) * 2013-09-05 2014-05-21 江苏大学 基于情感上下文的语音情感推理方法及系统
CN107039033A (zh) * 2017-04-17 2017-08-11 海南职业技术学院 一种语音合成装置
CN108364632A (zh) * 2017-12-22 2018-08-03 东南大学 一种具备情感的中文文本人声合成方法
CN108962217A (zh) * 2018-07-28 2018-12-07 华为技术有限公司 语音合成方法及相关设备
US20190138606A1 (en) * 2016-07-12 2019-05-09 Huawei Technologies Co., Ltd. Neural network-based translation method and apparatus
CN111274807A (zh) * 2020-02-03 2020-06-12 华为技术有限公司 文本信息的处理方法及装置、计算机设备和可读存储介质

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102033865A (zh) * 2009-09-25 2011-04-27 日电(中国)有限公司 基于子句关联的文本情感分类系统和方法
US8682649B2 (en) * 2009-11-12 2014-03-25 Apple Inc. Sentiment prediction from textual data
CN103455562A (zh) * 2013-08-13 2013-12-18 西安建筑科技大学 一种文本倾向性分析方法及基于该方法的商品评论倾向判别器
CN106708789B (zh) * 2015-11-16 2020-07-14 重庆邮电大学 一种文本处理方法及装置
CN106815192B (zh) * 2015-11-27 2020-04-21 北京国双科技有限公司 模型训练方法及装置和语句情感识别方法及装置
US20170213542A1 (en) * 2016-01-26 2017-07-27 James Spencer System and method for the generation of emotion in the output of a text to speech system
CN107967258B (zh) * 2017-11-23 2021-09-17 广州艾媒数聚信息咨询股份有限公司 文本信息的情感分析方法和系统
CN108664469B (zh) * 2018-05-07 2021-11-19 首都师范大学 一种情感类别确定方法、装置及服务器
CN110110323B (zh) * 2019-04-10 2022-11-11 北京明略软件系统有限公司 一种文本情感分类方法和装置、计算机可读存储介质
CN110222178B (zh) * 2019-05-24 2021-11-09 新华三大数据技术有限公司 文本情感分类方法、装置、电子设备及可读存储介质
CN110297907B (zh) * 2019-06-28 2022-03-08 谭浩 生成访谈报告的方法、计算机可读存储介质和终端设备
KR20190104941A (ko) * 2019-08-22 2019-09-11 엘지전자 주식회사 감정 정보 기반의 음성 합성 방법 및 장치

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1787074A (zh) * 2005-12-13 2006-06-14 浙江大学 基于情感迁移规则及语音修正的说话人识别方法
US20090258333A1 (en) * 2008-03-17 2009-10-15 Kai Yu Spoken language learning systems
CN103366731A (zh) * 2012-03-31 2013-10-23 盛乐信息技术(上海)有限公司 语音合成方法及系统
CN103198827A (zh) * 2013-03-26 2013-07-10 合肥工业大学 基于韵律特征参数和情感参数关联性的语音情感修正方法
CN103810994A (zh) * 2013-09-05 2014-05-21 江苏大学 基于情感上下文的语音情感推理方法及系统
US20190138606A1 (en) * 2016-07-12 2019-05-09 Huawei Technologies Co., Ltd. Neural network-based translation method and apparatus
CN107039033A (zh) * 2017-04-17 2017-08-11 海南职业技术学院 一种语音合成装置
CN108364632A (zh) * 2017-12-22 2018-08-03 东南大学 一种具备情感的中文文本人声合成方法
CN108962217A (zh) * 2018-07-28 2018-12-07 华为技术有限公司 语音合成方法及相关设备
CN111274807A (zh) * 2020-02-03 2020-06-12 华为技术有限公司 文本信息的处理方法及装置、计算机设备和可读存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4102397A4

Also Published As

Publication number Publication date
CN111274807B (zh) 2022-05-10
EP4102397A4 (en) 2023-06-28
CN111274807A (zh) 2020-06-12
EP4102397A1 (en) 2022-12-14

Similar Documents

Publication Publication Date Title
WO2021155662A1 (zh) 文本信息的处理方法及装置、计算机设备和可读存储介质
CN110782870B (zh) 语音合成方法、装置、电子设备及存储介质
JP2022534764A (ja) 多言語音声合成およびクロスランゲージボイスクローニング
JP2021196598A (ja) モデルトレーニング方法、音声合成方法、装置、電子機器、記憶媒体およびコンピュータプログラム
KR20220035180A (ko) E2E(End-to-end) 음성 합성 시스템에서 표현력 제어
Zhou et al. Emotion intensity and its control for emotional voice conversion
WO2020098269A1 (zh) 一种语音合成方法及语音合成装置
Kaur et al. Conventional and contemporary approaches used in text to speech synthesis: A review
Chen et al. Learning multi-scale features for speech emotion recognition with connection attention mechanism
JP7379756B2 (ja) 韻律的特徴からのパラメトリックボコーダパラメータの予測
CN115620699B (zh) 语音合成方法、语音合成系统、语音合成设备及存储介质
WO2021134591A1 (zh) 语音合成方法、装置、终端及存储介质
Zhang et al. Extracting and predicting word-level style variations for speech synthesis
CN115827854A (zh) 语音摘要生成模型训练方法、语音摘要生成方法及装置
CN113593520B (zh) 歌声合成方法及装置、电子设备及存储介质
Li et al. Inferring speaking styles from multi-modal conversational context by multi-scale relational graph convolutional networks
Zee et al. Paradigmatic relations interact during the production of complex words: Evidence from variable plurals in Dutch
CN116312463A (zh) 语音合成方法、语音合成装置、电子设备及存储介质
CN113191140B (zh) 文本处理方法、装置、电子设备及存储介质
Tsunematsu et al. Neural Speech Completion.
CN113823259A (zh) 将文本数据转换为音素序列的方法及设备
CN114254649A (zh) 一种语言模型的训练方法、装置、存储介质及设备
CN115206281A (zh) 一种语音合成模型训练方法、装置、电子设备及介质
Ronanki Prosody generation for text-to-speech synthesis
Wu et al. Predicting tonal realizations in one Chinese dialect from another

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20917614

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2020917614

Country of ref document: EP

Effective date: 20220905