CN110782875A - Voice rhythm processing method and device based on artificial intelligence - Google Patents

Voice rhythm processing method and device based on artificial intelligence Download PDF

Info

Publication number
CN110782875A
CN110782875A CN201910984463.9A CN201910984463A CN110782875A CN 110782875 A CN110782875 A CN 110782875A CN 201910984463 A CN201910984463 A CN 201910984463A CN 110782875 A CN110782875 A CN 110782875A
Authority
CN
China
Prior art keywords
voice data
detected
tree model
speaker
detection result
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910984463.9A
Other languages
Chinese (zh)
Other versions
CN110782875B (en
Inventor
林炳怀
王丽园
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201910984463.9A priority Critical patent/CN110782875B/en
Publication of CN110782875A publication Critical patent/CN110782875A/en
Application granted granted Critical
Publication of CN110782875B publication Critical patent/CN110782875B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Abstract

The invention provides a voice rhythm processing method, a device, electronic equipment and a storage medium based on artificial intelligence; the method comprises the following steps: receiving voice data to be detected and text data corresponding to the voice data to be detected; aligning the voice data to be detected and the text data to obtain an alignment result; performing prosody detection on the voice data to be detected through a bilingual tree model based on the alignment result to obtain a first detection result, and performing prosody detection on the voice data to be detected through a parent-bilingual tree model to obtain a second detection result; and fusing the first detection result and the second detection result, and determining the fused detection result as the final prosody detection result of the voice data to be detected. By the method and the device, the pronunciation rhythm of the voice data to be detected can be accurately detected.

Description

Voice rhythm processing method and device based on artificial intelligence
Technical Field
The present invention relates to artificial intelligence voice processing technologies, and in particular, to an artificial intelligence based voice prosody processing method and apparatus, an electronic device, and a storage medium.
Background
Artificial Intelligence (AI) is a comprehensive technique in computer science, and by studying the design principles and implementation methods of various intelligent machines, the machines have the functions of perception, reasoning and decision making. The artificial intelligence technology is a comprehensive subject and relates to a wide range of fields, for example, natural language processing technology and machine learning/deep learning, etc., and along with the development of the technology, the artificial intelligence technology can be applied in more fields and can play more and more important values.
The voice prosody detection is an important application field of an artificial intelligence technology, and is mainly used for performing prosody detection on voice data of a user, and real-time feedback and correction are provided for the user by detecting error prosody appearing in the voice data so as to help the user improve the language level.
However, there is a lack of a scheme for accurately detecting the prosody of the user's pronunciation in the related art.
Disclosure of Invention
The embodiment of the invention provides a voice prosody processing method and device based on artificial intelligence, electronic equipment and a storage medium, which can accurately detect the pronunciation prosody of voice data to be detected.
The technical scheme of the embodiment of the invention is realized as follows:
the embodiment of the invention provides a voice prosody processing method based on artificial intelligence, which comprises the following steps:
receiving voice data to be detected and text data corresponding to the voice data to be detected;
aligning the voice data to be detected and the text data to obtain an alignment result;
performing prosody detection on the voice data to be detected through a bilingual tree model based on the alignment result to obtain a first detection result, and
performing prosody detection on the voice data to be detected through a parent speaker tree model to obtain a second detection result;
and fusing the first detection result and the second detection result, and determining the fused detection result as the final prosody detection result of the voice data to be detected.
The embodiment of the invention provides a voice rhythm processing device based on artificial intelligence, which comprises:
the receiving module is used for receiving voice data to be detected and text data corresponding to the voice data to be detected;
the alignment module is used for aligning the voice data to be detected and the text data to obtain an alignment result;
the first detection module is used for carrying out prosody detection on the voice data to be detected through a bilingual tree model based on the alignment result to obtain a first detection result;
the second detection module is used for carrying out prosody detection on the voice data to be detected through the parent speaker tree model based on the alignment result to obtain a second detection result;
and the fusion module is used for fusing the first detection result and the second detection result and determining the fused detection result as the final prosody detection result of the voice data to be detected.
In the above scheme, the alignment module is further configured to divide the voice data to be detected into N frames, extract a pitch and a pitch of each frame of voice data to be detected, and perform smoothing processing on the extracted pitch and pitch, where N is a positive integer;
performing voice recognition on each phoneme of each frame of voice data to be detected to obtain pronunciation starting and stopping time corresponding to each phoneme, and
and obtaining the pitch, the sound intensity and the pronunciation duration corresponding to each phoneme according to the corresponding relation between the frame number and the time.
In the foregoing solution, the first detecting module includes: the first accent detection submodule, the first pause detection submodule and the first boundary tone detection submodule;
the first accent detection submodule is used for detecting the rereading position of the voice data to be detected through an accent two-speaker tree model to obtain a first rereading position;
the first pause detection submodule is used for detecting the pause position of the voice data to be detected through a pause bilingual tree model to obtain a first pause position;
and the first boundary tone detection submodule is used for detecting the boundary tone type of the voice data to be detected through the boundary tone two-speaker tree model to obtain a first boundary tone type.
In the above scheme, the first accent detection submodule is further configured to obtain a two-speaker voice data sample and a corresponding accent position, and perform prosody detection processing on the two-speaker voice data sample to obtain a pitch of a syllable, a pitch intensity characteristic, a normalized pitch and a pitch intensity, and a pitch and pitch intensity variation trend characteristic of the syllable;
selecting characteristics with classification capability as nodes to construct an initial stress two-speaker tree model from the syllable pitch, the tone intensity characteristics, the normalized pitch and tone intensity, and the syllable pitch and tone intensity variation trend characteristics;
pruning the constructed initial stress two-speaker tree model to obtain the stress two-speaker tree model for detecting the first stress position.
In the above scheme, the first pause detection sub-module is further configured to obtain a speech data sample of a bilingual person and a corresponding pause position, and perform prosody detection processing on the speech data sample of the bilingual person to obtain a word pitch, a voice intensity characteristic, a normalized mute duration, and a pitch and voice intensity variation trend characteristic;
selecting the characteristics with classification capability as nodes to construct an initial pause bilingual tree model from the characteristics of the word pitch, the sound intensity, the normalized mute time and the variation trend of the pitch and the sound intensity;
pruning the constructed initial pause two-speaker tree model to obtain a pause two-speaker tree model for detecting the first pause position.
In the above scheme, the first boundary tone detection sub-module is further configured to obtain a two-speaker voice data sample and a corresponding boundary tone type, and perform prosody detection processing on the two-speaker voice data sample to obtain pronunciation features of different granularities and pronunciation variation trend features of different granularities;
selecting the characteristics with classification capability as nodes to construct an initial boundary tone bilingual tree model from the pronunciation characteristics with different granularities and the pronunciation change trend characteristics with different granularities;
pruning the constructed initial boundary tone bimorphism tree model to obtain the boundary tone bimorphism tree model for detecting the first boundary tone type.
In the foregoing solution, the second detecting module includes: a second accent detection sub-module, a second pause detection sub-module, and a second boundary key detection sub-module;
the second accent detection submodule is used for detecting the re-reading position of the voice data to be detected through the accent mother speaker tree model to obtain a second re-reading position;
the second pause detection submodule is used for detecting the pause position of the voice data to be detected through a pause parent speaker tree model to obtain a second pause position;
and the second boundary tone detection submodule is used for detecting the boundary tone type of the voice data to be detected through the boundary tone native speaker tree model to obtain a second boundary tone type.
In the above scheme, the second accent detection sub-module is further configured to obtain a native speaker voice data sample and a corresponding accent position, and perform prosody detection processing on the native speaker voice data sample to obtain a pitch of a syllable, a tone intensity characteristic, a normalized pitch and tone intensity, and a pitch of the syllable, a tone intensity variation trend characteristic;
selecting characteristics with classification capability from the syllable pitch, the tone intensity characteristics, the normalized pitch and tone intensity, and the syllable pitch and tone intensity variation trend characteristics as nodes to construct an initial stress vowel speaker tree model;
pruning the constructed initial stress vowel speaker tree model to obtain a stress vowel speaker tree model for detecting the second re-reading position.
In the above scheme, the second pause detection sub-module is further configured to obtain a native speaker voice data sample and a corresponding pause position, and perform prosody detection processing on the native speaker voice data sample to obtain a word pitch, a voice intensity characteristic, a normalized mute duration, and a pitch and voice intensity variation trend characteristic;
selecting the characteristics with classification capability as nodes to construct an initial pause mother speaker tree model from the characteristics of the word pitch, the sound intensity, the normalized mute time and the variation trend of the pitch and the sound intensity;
pruning the constructed initial pause parent speaker tree model to obtain a pause parent speaker tree model for detecting the second pause position.
In the above scheme, the second boundary tone detection sub-module is further configured to obtain a native speaker voice data sample and a corresponding boundary tone type, and perform prosody detection processing on the native speaker voice data sample to obtain pronunciation features of different granularities and pronunciation variation trend features of different granularities;
selecting the characteristics with classification capability as nodes to construct an initial boundary tone-mother-language tree model from the pronunciation characteristics with different granularities and the pronunciation change trend characteristics with different granularities;
and pruning the constructed initial boundary tone native speaker tree model to obtain a boundary tone native speaker tree model for detecting the second boundary tone type.
In the above scheme, the fusion module is further configured to perform voting on the first detection result and the second detection result, and determine the detection result with the largest number of votes as the final detection result of the to-be-detected voice data.
In the above scheme, the fusion module is further configured to perform weighting processing on the first detection result and the second detection result, and determine the weighted detection result as a final prosody detection result of the to-be-detected voice data.
The embodiment of the invention provides a voice prosody processing device based on artificial intelligence, which comprises:
a memory for storing executable instructions;
and the processor is used for realizing the artificial intelligence-based voice prosody processing method provided by the embodiment of the invention when the executable instructions stored in the memory are executed.
The embodiment of the invention provides a storage medium, which stores executable instructions and is used for causing a processor to execute so as to realize the artificial intelligence-based voice prosody processing method provided by the embodiment of the invention.
The embodiment of the invention has the following beneficial effects:
the embodiment of the invention provides a voice rhythm processing method based on artificial intelligence, which is characterized in that in consideration of the difference of pronunciation rhythms of two speakers and a native speaker, a two-speaker tree model and a native speaker tree model are respectively constructed based on a two-speaker voice data sample and a native speaker voice data sample, rhythm detection is carried out on voice data to be detected through the constructed two-speaker tree model and the native speaker tree model, and detection results are fused, so that the pronunciation rhythm of the voice data to be detected can be accurately detected.
Drawings
FIG. 1 is a schematic diagram of an alternative architecture of an artificial intelligence based speech prosody processing system according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of an alternative structure of an artificial intelligence based speech prosody processing device according to an embodiment of the present invention;
FIG. 3 is a schematic flow chart of an alternative method for processing artificial intelligence based speech prosody provided by the embodiment of the invention;
FIG. 4 is a schematic flow chart of an alternative method for processing artificial intelligence based speech prosody provided by the embodiment of the invention;
FIG. 5 is a structural diagram of a tree model for phonetic prosody detection according to an embodiment of the present invention;
FIG. 6 is a diagram illustrating an exemplary application scenario of a speech prosody processing method based on artificial intelligence according to an embodiment of the present invention;
FIG. 7 is a diagram illustrating the prosody detection results provided by an embodiment of the invention;
FIG. 8 is a schematic flow chart of an alternative method for processing artificial intelligence based speech prosody provided by the embodiment of the invention;
FIG. 9 is a schematic structural diagram of a word pause determination model according to an embodiment of the present invention;
FIG. 10 is a schematic structural diagram of a word stress determination model according to an embodiment of the present invention;
fig. 11 is a schematic structural diagram of a sentence boundary tone judgment model according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail with reference to the accompanying drawings, the described embodiments should not be construed as limiting the present invention, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.
In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.
In the description that follows, references to the terms "first", "second", and the like, are intended only to distinguish between similar objects and not to indicate a particular ordering for the objects, it being understood that "first", "second", and the like may be interchanged under certain circumstances or sequences of events to enable embodiments of the invention described herein to be practiced in other than the order illustrated or described herein.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein is for the purpose of describing embodiments of the invention only and is not intended to be limiting of the invention.
Before further detailed description of the embodiments of the present invention, terms and expressions mentioned in the embodiments of the present invention are explained, and the terms and expressions mentioned in the embodiments of the present invention are applied to the following explanations.
1) Sentence rereading: the words that are read heavily in the sentence correspond to the words that are read lightly.
2) Sentence pause: pauses between intonation phrases in a sentence.
3) Boundary adjustment: the tone curve of the end of a sentence refers to the tone variation trend from the last stressed syllable to the end of the sentence, and is divided into ascending, descending and the like.
The inventor finds that in the process of implementing the embodiment of the invention, when performing the voice prosody detection, the related art generally performs the prosody detection by extracting effective acoustic features and inputting the extracted acoustic features into a specific model. For example, the detection of the reread position and the pause position is performed by extracting valid acoustic features and inputting the extracted acoustic features to a Conditional random field (crf). Or combining the acoustic characteristics, the accent phoneme characteristics of the words in the dictionary and the grammar characteristics to detect the pause and the accent at the syllable level. Or, based on an N-Gram statistical model (an algorithm based on a statistical language model, performing a sliding window operation with a size of N on the content in the text according to bytes to form a byte fragment sequence with a length of N), using prosodic features to model uncertainty of word re-reading in the sentence, wherein the larger the uncertainty is, the smaller the feature probability product at a certain time is, the more likely it is that the word is re-read. Finally, the correlation degree of the rereading sensed by people and the rereading output by the model is calculated, and the correlation degree can reach about 0.55. Although it has been demonstrated that uncertainty in the prosodic features of sentences has some correlation with sentence re-reading, the correlation is not significant. In addition, the related art also considers that sentence re-reading is related to characteristics such as pronunciation energy, pitch, pronunciation duration, and phoneme attributes such as vowel, consonant, etc., and a Hidden Markov Model (HMM) is established by combining the above characteristics and a characteristic difference between adjacent syllables to detect a re-reading position in a sentence.
In this regard, considering that the voice prosody detection technology is mainly realized by two processes of extracting effective features and constructing an effective prosody detection model, the two processes can be optimized, for example, more effective prosody pronunciation features can be extracted, secondly, considering the difference between the pronunciation prosody of two speakers and the pronunciation prosody of a native speaker, a prosody detection tree model of two speakers and a prosody detection tree model of a native speaker can be respectively constructed, voice data to be detected are detected respectively through the prosody detection tree model of two speakers and the prosody detection tree model of the native speaker, and detection results are fused to achieve a better effect of detecting the pronunciation prosody of the voice data to be detected, so that the voice data to be detected and text data corresponding to the voice data to be detected can be received; aligning the voice data to be detected and the text data to obtain an alignment result; performing prosody detection on the voice data to be detected through a bilingual tree model based on the alignment result to obtain a first detection result, and performing prosody detection on the voice data to be detected through a parent-bilingual tree model to obtain a second detection result; and fusing the first detection result and the second detection result, and determining the fused detection result as the final prosody detection result of the voice data to be detected.
In view of this, embodiments of the present invention provide a method and an apparatus for processing speech prosody based on artificial intelligence, an electronic device, and a storage medium, which can accurately detect the pronunciation prosody of the speech data to be detected.
The following describes an exemplary application of the artificial intelligence based speech prosody processing device according to the embodiment of the present invention, and the artificial intelligence based speech prosody processing device according to the embodiment of the present invention may be implemented as various types of user terminals such as a notebook computer, a tablet computer, a desktop computer, a set-top box, a mobile device (e.g., a mobile phone, a portable music player, a personal digital assistant, a dedicated messaging device, and a portable game device), may also be implemented as a server or a server cluster, and may also be implemented in a manner that the user terminal and the server cooperate with each other. In the following, an exemplary application will be explained when the electronic device is implemented as a server.
Referring to fig. 1, fig. 1 is an alternative architecture diagram of an artificial intelligence based speech prosody processing system 100 according to an embodiment of the present invention, in which a user terminal 400 (an example of which is shown a user terminal 400-1 and a user terminal 400-2) is connected to a server 200 through a network 300, and the network 300 may be a wide area network or a local area network, or a combination of the two.
As shown in fig. 1, a user opens a speech prosody detection client 410 on a user terminal 400, inputs a sentence or a piece of text to be read aloud, and reads aloud according to the input text. The voice prosody detection client 410 then transmits the text data input by the user and the collected voice data corresponding to the text data to the server 200 through the network 300. After receiving the text data and the voice data reported by the voice prosody detection client 410, the server 200 may align the received text data and the received voice data by using an automatic voice recognition technology, so as to obtain an alignment result. Then, based on the alignment result, performing prosody detection on the received voice data through the bilingual tree model to obtain a first detection result, and performing prosody detection on the received voice data through the parent-speaker tree model to obtain a second detection result. After obtaining the first detection result and the second detection result, the server 200 performs fusion processing on the obtained first detection result and the second detection result, uses the fused detection result as a final prosody detection result of the voice data, and returns the obtained final prosody detection result to the voice prosody detection client 410 to help the user to find and correct problems in the pronunciation prosody.
Referring to fig. 2, fig. 2 is a schematic structural diagram of a server 200 according to an embodiment of the present invention, taking an example in which an artificial intelligence-based speech prosody processing device is implemented as the server 200, where the server 200 shown in fig. 2 includes: at least one processor 210, memory 250, at least one network interface 220, and a user interface 230. The various components in server 200 are coupled together by a bus system 240. It is understood that the bus system 240 is used to enable communications among the components. The bus system 240 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 240 in fig. 2.
The Processor 210 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor, or the like.
The user interface 230 includes one or more output devices 231, including one or more speakers and/or one or more visual display screens, that enable the presentation of media content. The user interface 230 also includes one or more input devices 232, including user interface components that facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.
The memory 250 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard disk drives, optical disk drives, and the like. Memory 250 optionally includes one or more storage devices physically located remotely from processor 210.
The memory 250 includes volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. The nonvolatile memory may be a Read Only Memory (ROM), and the volatile memory may be a Random Access Memory (RAM). The memory 250 described in embodiments of the invention is intended to comprise any suitable type of memory.
In some embodiments, memory 250 is capable of storing data, examples of which include programs, modules, and data structures, or a subset or superset thereof, to support various operations, as exemplified below.
An operating system 251 including system programs for processing various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and processing hardware-based tasks;
a network communication module 252 for communicating to other computing devices via one or more (wired or wireless) network interfaces 220, exemplary network interfaces 220 including: bluetooth, wireless compatibility authentication (WiFi), and Universal Serial Bus (USB), etc.;
a presentation module 253 to enable presentation of information (e.g., a user interface for operating peripherals and displaying content and information) via one or more output devices 231 (e.g., a display screen, speakers, etc.) associated with the user interface 230;
an input processing module 254 for detecting one or more user inputs or interactions from one of the one or more input devices 232 and translating the detected inputs or interactions.
In some embodiments, the artificial intelligence based speech prosody processing device provided by the embodiment of the present invention can be implemented in software, and fig. 2 shows an artificial intelligence based speech prosody processing device 255 stored in a memory 250, which can be software in the form of programs and plug-ins, and the like, and includes the following software modules: a receiving module 2551, an alignment module 2552, a first detection module 2553, a second detection module 2554 and a fusion module 2555, which are logical and thus can be arbitrarily combined or further split depending on the functions implemented. The functions of the respective modules will be explained below.
In other embodiments, the artificial intelligence based voice prosody processing Device provided by the embodiments of the present invention may be implemented in hardware, and for example, the artificial intelligence based voice prosody processing Device provided by the embodiments of the present invention may be a processor in the form of a hardware decoding processor, which is programmed to execute the artificial intelligence based voice prosody processing method provided by the embodiments of the present invention, for example, the processor in the form of the hardware decoding processor may employ one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), Field Programmable Gate Arrays (FPGAs), or other electronic elements.
The following describes an artificial intelligence based speech prosody processing method provided by an embodiment of the present invention in connection with an exemplary application of the artificial intelligence based speech prosody processing apparatus provided by an embodiment of the present invention when implemented as a server.
Referring to fig. 3, fig. 3 is an alternative flow chart of the artificial intelligence based speech prosody processing method according to the embodiment of the present invention, which will be described with reference to the steps shown in fig. 3.
In step S301, the client acquires voice data to be detected and receives text data corresponding to the voice data to be detected.
In step S302, the server receives the to-be-tested voice data and the text data corresponding to the to-be-tested voice data sent by the client.
In some embodiments, the voice data to be tested may be voice data of a user speaking freely, or may be voice data read by the user with respect to standard reference audio.
For example, the user may input an arbitrary sentence or a piece of text to be read in the speech prosody detection application interface, such as "I knock the fact, do you knock? And then clicking a start reading button to read according to the input text, clicking an end reading button after reading is finished, and finishing reading the sentence. Further, the speech prosody assessment application inputs the text data "I knock the fact, do you knock? And sending the collected voice data to a server.
For example, the voice prosody detection application may also provide some reference text data, and the user reads according to the reference text data provided by the voice prosody detection application, and then sends the reference text data and the voice data collected by the user by reading to the server.
In step S303, the server aligns the to-be-detected speech data with the text data to obtain an alignment result.
In some embodiments, the server may perform alignment processing on the speech data to be detected and the text data by using an automatic language recognition technology, so as to obtain an alignment result.
For example, after receiving the voice data to be detected reported by the client, the server divides the received voice data to be detected into a plurality of frames by taking 10 seconds as one frame, and extracts the pitch and the sound intensity of each frame of the voice data to be detected. Since the extracted pitch and intensity parameters are discrete points, the extracted pitch and intensity parameters are smoothed through a sliding window. And then, identifying each phoneme included in each frame of divided voice data to be detected through an automatic language identification technology to obtain the pronunciation starting and ending time corresponding to each phoneme, and obtaining the characteristics such as the pitch, the sound intensity, the pronunciation duration and the like corresponding to each phoneme through the corresponding relation between the frame number and the time.
In step S304, based on the alignment result, performing prosody detection on the voice data to be detected through the bilingual tree model to obtain a first detection result, and performing prosody detection on the voice data to be detected through the parent-bilingual tree model to obtain a second detection result.
Here, the bilingual tree model is a prosody detection tree model constructed based on bilingual speech data samples, and the native speaker tree model is a prosody detection tree model constructed based on native speaker speech data samples, where the bilingual speaker and the native speaker are two groups using different languages.
For example, if the artificial intelligence-based speech prosody processing method provided by the embodiment of the present invention detects prosody of english pronunciation of a user, at this time, the native speaker refers to people who use english as the native language, such as the united kingdom and the united states, and the bilingual speaker refers to people who are not in english, such as chinese with chinese as the native language, japanese with japanese as the native language, and the like. The method for constructing the parent-speaker tree model based on the parent-speaker voice data sample comprises the steps of collecting English voice data of English speakers, American speakers and the like as the parent-speaker voice data sample to construct the parent-speaker tree model; the two-speaker tree model is constructed based on the two-speaker voice data sample by collecting English voice data of Chinese, Japanese and the like as the two-speaker voice data sample.
For example, assuming that the speech prosody processing method based on artificial intelligence provided by the embodiment of the present invention detects prosody of chinese pronunciation of a user, at this time, the native speaker is a crowd with chinese as a native language, such as a chinese, and the bilingual speaker is a crowd with non-chinese native language, such as a american with english as a native language, a japanese with japanese as a native language, and the like. The method comprises the steps that a parent speaker tree model is built based on a parent speaker voice data sample, namely Chinese voice data of Chinese are collected to serve as the parent speaker voice data sample to build the parent speaker tree model; the construction of the bilingual tree model based on the bilingual speech data sample refers to the construction of the bilingual tree model by collecting Chinese speech data of Japanese, American and the like as the bilingual speech data sample.
Referring to fig. 4, fig. 4 is an optional flowchart of the artificial intelligence based speech prosody processing method according to the embodiment of the present invention, and in some embodiments, step S304 shown in fig. 3 may be implemented by steps S3041 to S3046 shown in fig. 4, which will be described in conjunction with each step.
In step S3041, the server detects a rereading position of the to-be-detected speech data through the accent bilingual tree model, so as to obtain a first rereading position.
Here, the first rereading position is a rereading position detected by an accent two-speaker tree model constructed based on a two-speaker voice data sample.
In some embodiments, during the re-reading position detection, feature extraction may be performed from a syllable angle, and the extracted syllable features are input into the stressed bilingual tree model, so as to obtain a first re-reading position of the voice data to be detected.
For example, based on the alignment result obtained in step S303, the server may extract relevant features of each syllable, such as: maximum pitch, minimum pitch, maximum intensity, minimum intensity, average intensity, amplitude of intensity rise and fall, amplitude of pitch rise and fall, duration of syllabic period, etc. Meanwhile, the characteristics are normalized in consideration of the fact that pitches and intensities of different users are not in a range. In addition, because whether the syllable is stressed or not is also related to other syllables in the word in which the syllable is located, the other syllable characteristics of the word are compared with the syllable characteristics to be used as stress characteristics. And generating multidimensional characteristics by combining the previous word characteristic and the next word characteristic of the current word based on the extracted syllable characteristics, and using the multidimensional characteristics as the input of the stress two-speaker tree model together, so that the stress two-speaker tree model obtains the first stress position of the voice data to be detected based on the input multidimensional characteristics.
In step S3042, the server detects the pause position of the to-be-detected speech data through the pause bilingual tree model, so as to obtain a first pause position.
Here, the first pause position refers to a pause position detected by a pause bilingual tree model constructed based on the bilingual speech data samples.
In some embodiments, during the pause position detection, feature extraction may be performed from the word perspective, and the extracted word features are input into the pause bilingual tree model, so as to obtain the first pause position of the speech data to be detected.
For example, because the pause in a sentence is mainly related to the silence duration after a word, the speech speed of different users is considered to be different, and therefore the speech speed normalization processing is performed on the silence duration first. In addition, since a word is often followed by a pause when the energy of the word is high, the pitch and intensity characteristics of the word and the statistics of the characteristics are combined as the characteristics for detecting the pause of the sentence. In addition, when the pitch or the intensity of sound of one word suddenly changes from another word, the mark is also generated by the pause, so that the change characteristics of the pitch and the intensity between adjacent words can be calculated as the characteristics for detecting the sentence pause. The server extracts the syllable features based on the alignment result obtained in step S303, and generates multidimensional features by combining the previous word feature and the next word feature of the current word, which are used together as the input of the pause two-speaker tree model, so that the pause two-speaker tree model obtains the first pause position of the voice data to be detected based on the input multidimensional features.
In step S3043, the server detects the boundary key type of the to-be-detected speech data through the boundary key bilingual tree model, so as to obtain a first boundary key type.
Here, the first boundary key type is a boundary key type detected by a boundary key bigram tree model constructed based on a bigram voice data sample.
In some embodiments, during boundary tone type detection, feature extraction may be performed from different granularities, and the extracted multi-granularity pronunciation features are input into a boundary tone bilingual tree model, so as to obtain a first boundary tone type of the voice data to be detected.
For example, since the judgment of the boundary key by different users is not consistent, some are judged based on the last word, some are judged based on the last syllable of the word, and some are judged based on the stressed syllable of the last word, the detection of the boundary key type can be performed by combining different levels of features such as phonemes, syllables, and words. For example, the server may extract the feature of the stressed syllable, the pronunciation feature of the non-stressed syllable, and the pronunciation feature of the phoneme and word level based on the alignment result obtained in step S303, and generate the multidimensional feature by combining the previous word feature and the next word feature of the current word, so as to be used as the input of the boundary tone bixburgh speaker tree model, so that the boundary tone bixburgh speaker tree model obtains the first boundary tone type of the speech data to be detected based on the input multidimensional feature.
In step S3044, the server detects the re-reading position of the voice data to be detected through the accent speaker tree model, and obtains a second re-reading position.
Here, the second playback position is a playback position detected by an accent speaker tree model constructed based on a speaker voice data sample.
The process of obtaining the second re-reading position of the voice data to be detected through the accent initial speaker tree model detection may refer to a specific process of obtaining the first re-reading position of the voice data to be detected through the accent initial speaker tree model detection, and the embodiment of the present invention is not described herein again.
In step S3045, the server detects the pause position of the to-be-detected voice data through the pause speaker tree model, so as to obtain a second pause position.
Here, the second pause position is a pause position detected by a pause native speaker tree model constructed based on a native speaker voice data sample.
The process of obtaining the second pause position of the to-be-detected voice data through the pause parent-speaker tree model detection may refer to a specific process of obtaining the first pause position of the to-be-detected voice data through the pause parent-speaker tree model detection, and the embodiment of the present invention is not described herein again.
In step S3046, the server detects the boundary tone type of the to-be-detected speech data through the boundary tone parent tree model, so as to obtain a second boundary tone type.
Here, the second boundary key type is a boundary key type detected by a boundary key native speaker tree model constructed based on a native speaker voice data sample.
The process of obtaining the second boundary key type of the voice data to be detected through the boundary key native speaker tree model detection may refer to a specific process of obtaining the first boundary key type of the voice data to be detected through the boundary key native speaker tree model detection, and the embodiment of the present invention is not described herein again.
It should be noted that, the steps S3041 to S3046 have no fixed sequence, and may be executed sequentially according to any sequence, or may be executed simultaneously. Further, before performing the prosody detection of step S304, it is first necessary to construct a bilingual tree model and a native-speaker tree model based on the bilingual speech data samples and the native-speaker speech data samples, respectively.
Referring to fig. 5, fig. 5 is a schematic structural diagram of a tree model for speech prosody detection according to an embodiment of the present invention, taking the tree model as a single decision tree, as shown in fig. 5, where the decision tree relates to node types including: a root node (RootNode), representing a collection of voice data samples, which may be further divided into two or more subsets; a decision node (DecisionNode), when a child node is further split into a plurality of child nodes, the child node is called a decision node; leaf nodes (Leaf/Terminal nodes), nodes that cannot be split any more are called Leaf nodes.
In some embodiments of constructing (training) a decision tree, the three stages include feature selection, decision tree generation, and decision tree pruning, which are described separately below.
In the feature selection stage, one feature is selected from a plurality of features related to a voice Data sample set as a standard of current node splitting, And how to select the features has different quantitative evaluation methods, so that different decision trees are derived, for example, an Input Data 3(ID3, Input Data3) algorithm selects the features through information gain, a C4.5 algorithm selects the features through information gain ratio, And a Classification Regression Tree (CART) algorithm selects the features through a Kini index. After the dataset is partitioned using a characteristic, each subset of data is purer (i.e., less uncertain) than the dataset D before the partitioning.
And in the generation stage of the decision tree, generating child nodes from top to bottom in a recursion mode according to the selected feature classification standard, and stopping the growth of the decision tree until the data set is not separable. This process is a process of continually partitioning a data set into more pure, less uncertain subsets using features that satisfy the partition criteria. For each partitioning of the current data set, the purity of each subset after partitioning according to a certain characteristic is made higher, and the uncertainty is smaller.
In the cutting nodes of the decision tree, aiming at the condition that the decision tree is easy to over-fit, the redundant nodes in the tree structure are reduced by pruning so as to reduce the scale of the decision tree and relieve the phenomenon of over-fit.
In some embodiments, prosody detection is performed using an integrated learning model including a plurality of Decision trees, such as a random forest model, a Gradient Boost Decision Tree (GBDT), each Decision Tree in the random forest model performs speech prosody detection independently, and the detection results of the plurality of Decision trees are combined in a voting manner to determine a final detection result, for example, a relative voting method, in which the weight of each Decision Tree is the same, and the detection results of each Decision Tree are added according to the weights to obtain the final detection result.
In step S305, the first detection result and the second detection result are fused, and the fused detection result is determined as the final prosody detection result of the voice data to be detected.
In some embodiments, the first detection result and the second detection result may be voted, and the detection result with the largest number of votes may be determined as the final prosody detection result of the voice data to be detected.
For example, in the case of sentence accent detection, the tree model is assumed to be a classification tree or a classification tree-based ensemble learning model, which only outputs a result of 0 or 1 to indicate whether the current speech data frame to be tested includes accents. And aiming at any frame of the voice data to be detected, respectively outputting a plurality of detection results by the accent bilingual tree model and the accent native speaker tree model, voting the output detection results, and taking the detection result with the largest number of votes as a judgment result of whether the current voice data frame to be detected comprises accents.
In other embodiments, the first detection result and the second detection result may be weighted, and the weighted detection result is determined as the final prosody detection result of the to-be-detected speech data.
For example, in the case of sentence accent detection, the tree model is assumed to be a regression tree or an inheritance learning model based on the regression tree, and the tree model outputs any value between 0 and 1 to represent the possibility that the current speech data frame to be detected includes accents. And aiming at any frame in the voice data to be detected, respectively outputting at least one detection result by the accent bilingual tree model and the accent native bilingual tree model, weighting the accent bilingual tree model and the accent native bilingual tree model according to corresponding weights to obtain a final value, judging whether the final value exceeds a specified threshold value or not, and determining that the current voice data frame to be detected comprises accents when the final value exceeds the specified threshold value.
In step S306, the server issues the final prosody detection result to the client for display.
In some embodiments, the server determines a prosody standard of the received text data corresponding to the pronunciation, compares the determined prosody standard with the final prosody detection result obtained in step S305, performs labeling on a corresponding position of the text data according to the comparison result, and returns the labeled text data to the client for displaying.
Continuing with the exemplary structure in which the artificial intelligence based speech prosody processing device 255 provided by the embodiment of the present invention is implemented as a software module, in some embodiments, as shown in fig. 2, the software module stored in the artificial intelligence based speech prosody processing device 255 of the memory 250 may include: a receiving module 2551, an alignment module 2552, a first detection module 2553, a second detection module 2554 and a fusion module 2555.
The receiving module 2551 is configured to receive voice data to be detected and text data corresponding to the voice data to be detected;
the alignment module 2552 is configured to perform alignment processing on the to-be-detected voice data and the text data to obtain an alignment result;
the first detecting module 2553 is configured to perform prosody detection on the to-be-detected voice data through a bilingual tree model based on the alignment result, so as to obtain a first detection result;
the second detecting module 2554 is configured to perform prosody detection on the to-be-detected voice data through a parent speaker tree model based on the alignment result, so as to obtain a second detection result;
the fusion module 2555 is configured to perform fusion processing on the first detection result and the second detection result, and determine the fused detection result as a final prosody detection result of the to-be-detected voice data.
In some embodiments, the alignment module 2552 is further configured to divide the voice data to be tested into N frames, extract a pitch and a pitch of each frame of the voice data to be tested, and smooth the extracted pitch and pitch, where N is a positive integer;
performing voice recognition on each phoneme of each frame of voice data to be detected to obtain pronunciation starting and stopping time corresponding to each phoneme, and
and obtaining the pitch, the sound intensity and the pronunciation duration corresponding to each phoneme according to the corresponding relation between the frame number and the time.
In some embodiments, the first detection module 2553 comprises: a first stress detection submodule 25531, a first pause detection submodule 25532, and a first boundary tone detection submodule 25533;
the first accent detection submodule 25531 is configured to detect, through an accent two-speaker tree model, an accent position of the to-be-detected speech data, to obtain a first accent position;
the first pause detection submodule 25532 is configured to detect a pause position of the to-be-detected speech data through a pause bilingual tree model, so as to obtain a first pause position;
the first boundary tone detection submodule 25533 is configured to detect the boundary tone type of the to-be-detected speech data through the boundary tone bilingual tree model, so as to obtain a first boundary tone type.
In some embodiments, the first accent detection sub-module 25531 is further configured to obtain a two-speaker speech data sample and a corresponding accent position, and perform prosody detection processing on the two-speaker speech data sample to obtain a pitch of a syllable, a pitch intensity characteristic, a normalized pitch and a normalized pitch intensity, and a pitch of a syllable, a pitch intensity variation trend characteristic;
selecting characteristics with classification capability as nodes to construct an initial stress two-speaker tree model from the syllable pitch, the tone intensity characteristics, the normalized pitch and tone intensity, and the syllable pitch and tone intensity variation trend characteristics;
pruning the constructed initial stress two-speaker tree model to obtain the stress two-speaker tree model for detecting the first stress position.
In some embodiments, the first pause detection sub-module 25532 is further configured to obtain a bilingual speech data sample and a corresponding pause position, and perform prosody detection processing on the bilingual speech data sample to obtain a word pitch, a pitch characteristic, a normalized mute duration, and a pitch and pitch variation trend characteristic;
selecting the characteristics with classification capability as nodes to construct an initial pause bilingual tree model from the characteristics of the word pitch, the sound intensity, the normalized mute time and the variation trend of the pitch and the sound intensity;
pruning the constructed initial pause two-speaker tree model to obtain a pause two-speaker tree model for detecting the first pause position.
In some embodiments, the first boundary tone detection sub-module 25533 is further configured to obtain a two-speaker voice data sample and a corresponding boundary tone type, and perform prosody detection processing on the two-speaker voice data sample to obtain pronunciation features of different granularities and pronunciation variation trend features of different granularities;
selecting the characteristics with classification capability as nodes to construct an initial boundary tone bilingual tree model from the pronunciation characteristics with different granularities and the pronunciation change trend characteristics with different granularities;
pruning the constructed initial boundary tone bimorphism tree model to obtain the boundary tone bimorphism tree model for detecting the first boundary tone type.
In some embodiments, the second detection module 2554 comprises: a second stress detection submodule 25541, a second pause detection submodule 25542, and a second boundary tone detection submodule 25543;
the second accent detection submodule 25541 is configured to detect, through the accent speaker tree model, an accent position of the to-be-detected speech data, to obtain a second accent position;
the second pause detection submodule 25542 is configured to detect a pause position of the to-be-detected voice data through a pause parent-speaker tree model, so as to obtain a second pause position;
the second boundary key detection submodule 25543 is configured to detect the boundary key type of the to-be-detected speech data through the boundary key native speaker tree model, so as to obtain a second boundary key type.
In some embodiments, the second accent detection sub-module 25541 is further configured to obtain a native speaker voice data sample and a corresponding accent position, and perform prosody detection processing on the native speaker voice data sample to obtain a pitch of a syllable, a pitch intensity characteristic, a normalized pitch and pitch intensity, and a pitch and pitch intensity variation trend characteristic of the syllable;
selecting characteristics with classification capability from the syllable pitch, the tone intensity characteristics, the normalized pitch and tone intensity, and the syllable pitch and tone intensity variation trend characteristics as nodes to construct an initial stress vowel speaker tree model;
pruning the constructed initial stress vowel speaker tree model to obtain a stress vowel speaker tree model for detecting the second re-reading position.
In some embodiments, the second pause detection submodule 25542 is further configured to obtain a native speaker voice data sample and a corresponding pause position, and perform prosody detection processing on the native speaker voice data sample to obtain a word pitch, a voice intensity characteristic, a normalized mute duration, and a pitch and voice intensity variation trend characteristic;
selecting the characteristics with classification capability as nodes to construct an initial pause mother speaker tree model from the characteristics of the word pitch, the sound intensity, the normalized mute time and the variation trend of the pitch and the sound intensity;
pruning the constructed initial pause parent speaker tree model to obtain a pause parent speaker tree model for detecting the second pause position.
In some embodiments, the second boundary key detection sub-module 25543 is further configured to obtain a native speaker voice data sample and a corresponding boundary key type, and perform prosody detection processing on the native speaker voice data sample to obtain pronunciation features of different granularities and pronunciation variation trend features of different granularities;
selecting the characteristics with classification capability as nodes to construct an initial boundary tone-mother-language tree model from the pronunciation characteristics with different granularities and the pronunciation change trend characteristics with different granularities;
and pruning the constructed initial boundary tone native speaker tree model to obtain a boundary tone native speaker tree model for detecting the second boundary tone type.
In some embodiments, the fusion module 2555 is further configured to perform voting on the first detection result and the second detection result, and determine the detection result with the largest number of votes as the final detection result of the to-be-detected voice data.
In some embodiments, the fusion module 2555 is further configured to perform weighting processing on the first detection result and the second detection result, and determine the weighted detection result as a final prosody detection result of the to-be-detected voice data.
It should be noted that the description of the apparatus according to the embodiment of the present invention is similar to the description of the method embodiment, and has similar beneficial effects to the method embodiment, and therefore, the description is omitted. The inexhaustible technical details of the artificial intelligence-based speech prosody processing device provided by the embodiment of the invention can be understood from the description of any one of the drawings of fig. 3-5 and 8-11.
In the following, an exemplary application of the embodiments of the present invention in a practical application scenario will be described.
When the voice prosody detection is carried out in the related technology, effective acoustic features are extracted, and the extracted acoustic features are input into a specific detection model to carry out prosody detection. However, the number of effective acoustic features extracted during prosody detection in the related art is small, and factors considered during detection model construction are not comprehensive enough, so that the final pronunciation prosody detection result is poor.
The embodiment of the invention optimizes the prosody detection from the two aspects, and firstly extracts more effective prosody pronunciation characteristics such as repeated reading, pause, boundary tone type and the like based on the two-speaker voice data sample and the native speaker voice data sample. Secondly, considering the difference between the pronunciation rhythm of the two speakers and the pronunciation rhythm of the native speaker, a prosody detection tree model of the two speakers and a prosody detection tree model of the native speaker are respectively constructed, and detection results obtained by the detection of the two tree models are fused, so that the effect of better detecting the pronunciation rhythm of the voice data to be detected is achieved.
Referring to fig. 6, fig. 6 is a schematic view of a specific application scenario of the artificial intelligence based speech prosody processing method according to the embodiment of the present invention. As shown in fig. 6, the user inputs a text to be read in the interface of the speech prosody detection application, such as inputting "I knock the fact, do you knock? And clicking a start reading button to start reading the sentence, and clicking an end reading button after reading is finished.
Fig. 7 is a schematic diagram of the prosody detection result provided by the embodiment of the invention. By way of example, different colors may be used to reflect different correction details of prosody. For example, a red mark is used to mark an error point of the pronunciation rhythm in the text, a green mark is used to mark a correct point of the pronunciation rhythm, and an orange mark is used to mark error correction of the rhythm, and the result is fed back to the user. As shown in FIG. 7, the knock, fact, knock words are correctly re-read and the re-read is incorrect. the error is stalled after the. The fact boundary tone is correctly reduced, the knock boundary tone is wrong, and the correct mode is that the boundary tone is raised.
Referring to fig. 8, fig. 8 is a schematic flow chart of an alternative artificial intelligence based speech prosody processing method according to an embodiment of the present invention. As shown in fig. 8, the method comprises the steps of:
1) a user opens an Application (APP) and inputs a sentence or a section of English to be read;
2) clicking a recording in the APP;
3) the APP sends the text and the audio to a server;
4) the server sends the text to a prosody standard module, and the prosody standard module generates a prosody standard of the corresponding text, wherein the prosody standard comprises the following steps: a standard reread position, a standard pause position and a standard boundary key type;
5) the server sends the audio and the text to an automatic speech recognition module to generate an alignment result of the phoneme level of the audio text, and the start-stop time of each pronunciation phoneme is obtained;
6) the server sends the audio to a prosody detection module, the automatic speech recognition module sends the start-stop time of each obtained pronunciation phoneme to the prosody detection module, the prosody detection module generates a detection result of an actual pronunciation prosody, and the detection result of the actual pronunciation prosody comprises the following steps: an actual reread location, an actual pause location, and an actual boundary key type.
7) And the server receives the prosody standard returned by the prosody standard module and the prosody detection result returned by the prosody detection module, returns the prosody standard and the prosody detection result to the APP and displays the prosody standard and the prosody detection result to the user.
In some embodiments, the prosody detection module is mainly composed of three parts: sentence accent detection, sentence pause detection, and sentence boundary tone detection. And finally outputting three rhythm detection results: sentence rereading position, sentence pause position, and sentence boundary key type.
Before performing the voice prosody detection, the voice data to be detected needs to be preprocessed. For example, with 10 seconds as one frame, the voice data to be measured is divided into a plurality of frames, and the pitch and the intensity of each frame of audio are extracted. Since the pitch and intensity parameters are discrete points, the extracted pitch and intensity parameters need to be smoothed through a sliding window. And then, acquiring the pronunciation start-stop time corresponding to each phoneme through an automatic language identification technology, and acquiring characteristics such as pitch, sound intensity, pronunciation duration and the like corresponding to each phoneme through the corresponding relation between the frame number and the time.
1) Sentence pause detection
The pause in the sentence is mainly related to the mute duration behind the word, and the speed of speech of different users is considered to be different, so that the speed of speech normalization processing is firstly carried out on the mute duration of the collected voice data. In addition, when the energy of a word is high, the word is usually followed by a pause, so that the pitch and the tone intensity characteristics of the word and the statistics of the characteristics such as the maximum value, the minimum value, the average value and the like can also be used as the characteristics for detecting the pause of the sentence. Or, when the pitch or the tone strength of one word suddenly changes from another word, the mark is also generated by the pause, so that the pitch and tone strength change trend between adjacent words can be used as the characteristic of the pause of the judgment sentence. Based on the factors, the multi-dimensional characteristics are finally generated by simultaneously combining the previous word characteristics and the next word characteristics of the current word and are jointly used as the characteristic input of the sentence pause tree model.
Referring to fig. 9, fig. 9 is a schematic structural diagram of a word pause determination model according to an embodiment of the present invention, and as shown in fig. 9, word pitch, pitch characteristics, normalized mute duration, pitch, and pitch variation trend of the voice data to be detected are input into a tree model for word pause determination together to obtain an actual pause position of the voice data to be detected.
2) Sentence emphasis detection
Since English is an overreading isochronous language, the time length between overreading syllables is equal. And Chinese is a syllable isochronous language, that is, the pronunciation time of each syllable is almost the same, and the pronunciation strength of each syllable is almost the same. The difference between the two is mainly reflected in syllable pronunciation. Feature extraction can therefore be performed from a syllable perspective. Whether the syllable is stressed or not is mainly related to syllable pitch, tone intensity, pitch variation, tone intensity variation, syllable duration and the like. Thus, the relevant features of each syllable can be extracted as: maximum pitch, minimum pitch, maximum intensity, minimum intensity, average pitch, amplitude of pitch rise or fall, duration of syllable, etc. Meanwhile, considering that pitches and pitches of different users are not in the same range, the above features need to be normalized first. In addition, whether the syllable is stressed or not is also related to other syllables in the word in which the syllable is positioned, so that other syllable characteristics of the word can be compared with the current syllable characteristics of the word, and the comparison result is used as the characteristic whether the syllable is stressed or not. Based on the factors, the multi-dimensional characteristics are finally generated by simultaneously combining the previous word characteristics and the next word characteristics of the current word and are jointly used as the characteristic input of the sentence re-reading tree model.
Referring to fig. 10, fig. 10 is a schematic structural diagram of a word stress determination model according to an embodiment of the present invention, and as shown in fig. 10, the pitch of the syllable, the tone characteristic, the normalized sound intensity, the pitch, and the sound intensity variation trend of the voice data to be measured are input into the tree model for word stress determination together to obtain the actual stress position of the voice data to be measured.
3) Sentence boundary tone detection
Because the judgment of the boundary tone by different users is not consistent, some users can judge based on the last word, some users can judge based on the last syllable of the word, and some users can judge based on the stressed syllable of the last word, therefore, the feature construction of the sentence boundary tone is carried out by combining the phoneme, the syllable and the features of different levels of the word. Extracting characteristics of the accent such as a pitch maximum value, a pitch average value, a pitch intensity maximum value, a pitch intensity average value, a pitch rise time length, a pitch fall time length, a pitch rise amplitude, a pitch fall amplitude, and the like. Meanwhile, pronunciation characteristics of the non-stressed syllables are extracted. Also, pronunciation features at phoneme and word levels are extracted. Based on the factors, the multi-dimensional characteristics are finally generated by simultaneously combining the previous word characteristics and the next word characteristics of the current word and are jointly used as the characteristic input of the sentence boundary tone tree model.
Referring to fig. 11, fig. 11 is a schematic structural diagram of a sentence boundary key judgment model according to an embodiment of the present invention, and as shown in fig. 11, a multi-granularity pitch, a tone intensity characteristic, a multi-granularity pitch, a tone intensity rising characteristic, and a tone intensity falling characteristic of voice data to be detected are input into a tree model for sentence boundary key judgment together, so as to obtain an actual boundary key type of the voice data to be detected.
In some embodiments, the three prosody detection models can be constructed by using a tree model structure. The tree model selects the features with the largest information gain to split each time, and stops splitting when the information gain is smaller than a certain threshold value, and leaf nodes serve as final classification results. And respectively constructing a plurality of native speaker tree models based on the native speaker voice data samples, constructing a plurality of bilingual speaker tree models based on the bilingual speaker voice data samples, and performing weighted combination on the detection result passing through the native speaker tree models and the detection result passing through the bilingual speaker tree models to obtain a final classification result.
In other embodiments, the three prosody detection models are constructed by using similar acoustic features such as parameters of sound intensity, pitch, duration and the like, so that the three prosody detection models can be fused into one model for collaborative training by adopting multi-task learning.
In some embodiments, the prosody detection module is a module for generating a corresponding prosody standard based on text data, and the module may perform logic writing according to specific scene requirements, may also provide a fixed prosody standard through external service requirements, may also provide a general prosody standard based on a model, and the like.
The embodiment of the invention provides 1000 bilingual voice data samples as a test set for testing. The two-speaker voice data samples are derived from expert labels, wherein each sample is labeled by three experts, and an actual rereading position, an actual pause position and an actual boundary tone type are labeled based on the two-speaker audio features. The internal consistency of the experts reaches about 0.7.
The test result of the artificial intelligence based voice prosody processing method provided by the embodiment of the invention after testing the 1000 bilingual speech data samples is as follows: table 1 shows the sentence re-reading classification result, table 2 shows the sentence pause classification result, and table 3 shows the sentence boundary tone classification result.
Figure RE-GDA0002277462250000251
TABLE 1
Figure RE-GDA0002277462250000252
TABLE 2
Figure RE-GDA0002277462250000253
TABLE 3
Embodiments of the present invention provide a storage medium having stored therein executable instructions that, when executed by a processor, will cause the processor to perform the artificial intelligence based speech prosody processing method provided by embodiments of the present invention, for example, the method as shown in fig. 3-4, 8.
In some embodiments, the storage medium may be a memory such as FRAM, ROM, PROM, EPROM, EE PROM, flash, magnetic surface memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories.
In some embodiments, executable instructions may be written in any form of programming language (including compiled or interpreted languages), in the form of programs, software modules, scripts or code, and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.
By way of example, executable instructions may correspond, but do not necessarily have to correspond, to files in a file system, may be stored in a portion of a file that holds other programs or data, e.g., in one or more scripts in a HyperText markup Language (H TML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).
By way of example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network.
In summary, the embodiment of the invention has the following beneficial effects:
the embodiment of the invention provides a voice prosody processing method based on artificial intelligence, which is characterized in that a bilingual tree model and a native speaker tree model are respectively constructed based on a bilingual voice data sample and a native speaker voice data sample, prosody detection is carried out on voice data to be detected through the constructed bilingual tree model and the native speaker tree model, and detection results are fused, so that the pronunciation prosody of the voice data to be detected can be accurately detected.
The above description is only an example of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present invention are included in the protection scope of the present invention.

Claims (10)

1. A speech prosody processing method based on artificial intelligence is characterized by comprising the following steps:
receiving voice data to be detected and text data corresponding to the voice data to be detected;
aligning the voice data to be detected and the text data to obtain an alignment result;
performing prosody detection on the voice data to be detected through a bilingual tree model based on the alignment result to obtain a first detection result, and
performing prosody detection on the voice data to be detected through a parent speaker tree model to obtain a second detection result;
and fusing the first detection result and the second detection result, and determining the fused detection result as the final prosody detection result of the voice data to be detected.
2. The method according to claim 1, wherein the aligning the speech data to be tested with the text data to obtain an alignment result comprises:
dividing the voice data to be detected into N frames, extracting the pitch and the tone intensity of each frame of voice data to be detected, and smoothing the extracted pitch and tone intensity, wherein N is a positive integer;
performing voice recognition on each phoneme of each frame of voice data to be detected to obtain pronunciation starting and stopping time corresponding to each phoneme, and
and obtaining the pitch, the sound intensity and the pronunciation duration corresponding to each phoneme according to the corresponding relation between the frame number and the time.
3. The method of claim 1,
the two-speaker tree model includes: accent two-speaker tree model, pause two-speaker tree model and boundary tone two-speaker tree model;
performing prosody detection on the voice data to be detected through the bilingual tree model to obtain a first detection result, including:
detecting the re-reading position of the voice data to be detected through the accent two-speaker tree model to obtain a first re-reading position;
detecting the pause position of the voice data to be detected through the pause bilingual tree model to obtain a first pause position;
and detecting the boundary tone type of the voice data to be detected through the boundary tone bilingual tree model to obtain a first boundary tone type.
4. The method according to claim 3, wherein before detecting the re-reading position of the speech data to be detected by the accent bilingual tree model, the method further comprises:
acquiring a two-speaker voice data sample and a corresponding rereading position, and performing rhythm detection processing on the two-speaker voice data sample to obtain a syllable pitch, a tone intensity characteristic, a normalized pitch and tone intensity, and a syllable pitch and tone intensity variation trend characteristic;
selecting characteristics with classification capability as nodes to construct an initial stress two-speaker tree model from the syllable pitch, the tone intensity characteristics, the normalized pitch and tone intensity, and the syllable pitch and tone intensity variation trend characteristics;
pruning the constructed initial stress two-speaker tree model to obtain the stress two-speaker tree model for detecting the first stress position.
5. The method according to claim 3, wherein before detecting the pause position of the speech data to be detected through the pause bilingual tree model, the method further comprises:
acquiring a two-speaker voice data sample and a corresponding pause position, and performing prosody detection processing on the two-speaker voice data sample to obtain word pitch, tone intensity characteristics, normalized mute duration, and pitch and tone intensity variation trend characteristics;
selecting the characteristics with classification capability as nodes to construct an initial pause bilingual tree model from the characteristics of the word pitch, the sound intensity, the normalized mute time and the variation trend of the pitch and the sound intensity;
pruning the constructed initial pause two-speaker tree model to obtain a pause two-speaker tree model for detecting the first pause position.
6. The method according to claim 3, wherein before the detecting the type of the boundary key of the speech data to be detected by the boundary key bilingual tree model, the method further comprises:
acquiring a two-speaker voice data sample and a corresponding boundary tone type, and performing prosody detection processing on the two-speaker voice data sample to obtain pronunciation characteristics with different granularities and pronunciation change trend characteristics with different granularities;
selecting the characteristics with classification capability as nodes to construct an initial boundary tone bilingual tree model from the pronunciation characteristics with different granularities and the pronunciation change trend characteristics with different granularities;
pruning the constructed initial boundary tone bimorphism tree model to obtain the boundary tone bimorphism tree model for detecting the first boundary tone type.
7. The method of claim 1,
the parent tree model includes: an accent mother tree model, a pause mother tree model, and a boundary tone mother tree model;
performing prosody detection on the voice data to be detected through the parent speaker tree model to obtain a second detection result, including:
detecting the re-reading position of the voice data to be detected through the accent mother speaker tree model to obtain a second re-reading position;
detecting the pause position of the voice data to be detected through the pause mother speaker tree model to obtain a second pause position;
and detecting the boundary tone type of the voice data to be detected through the boundary tone mother speaker tree model to obtain a second boundary tone type.
8. The method according to claim 1, wherein the fusing the first detection result and the second detection result comprises:
and voting the first detection result and the second detection result, and determining the detection result with the largest number of votes as the final detection result of the voice data to be detected.
9. The method according to claim 1, wherein the fusing the first detection result and the second detection result comprises:
and weighting the first detection result and the second detection result, and determining the weighted detection result as the final prosody detection result of the voice data to be detected.
10. An artificial intelligence based speech prosody processing apparatus, characterized in that the apparatus comprises:
the receiving module is used for receiving voice data to be detected and text data corresponding to the voice data to be detected;
the alignment module is used for aligning the voice data to be detected and the text data to obtain an alignment result;
the first detection module is used for carrying out prosody detection on the voice data to be detected through a bilingual tree model based on the alignment result to obtain a first detection result;
the second detection module is used for carrying out prosody detection on the voice data to be detected through the parent speaker tree model based on the alignment result to obtain a second detection result;
and the fusion module is used for fusing the first detection result and the second detection result and determining the fused detection result as the final prosody detection result of the voice data to be detected.
CN201910984463.9A 2019-10-16 2019-10-16 Voice rhythm processing method and device based on artificial intelligence Active CN110782875B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910984463.9A CN110782875B (en) 2019-10-16 2019-10-16 Voice rhythm processing method and device based on artificial intelligence

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910984463.9A CN110782875B (en) 2019-10-16 2019-10-16 Voice rhythm processing method and device based on artificial intelligence

Publications (2)

Publication Number Publication Date
CN110782875A true CN110782875A (en) 2020-02-11
CN110782875B CN110782875B (en) 2021-12-10

Family

ID=69385762

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910984463.9A Active CN110782875B (en) 2019-10-16 2019-10-16 Voice rhythm processing method and device based on artificial intelligence

Country Status (1)

Country Link
CN (1) CN110782875B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111312231A (en) * 2020-05-14 2020-06-19 腾讯科技(深圳)有限公司 Audio detection method and device, electronic equipment and readable storage medium
CN111489737A (en) * 2020-04-13 2020-08-04 深圳市友杰智新科技有限公司 Voice command recognition method and device, storage medium and computer equipment
CN112257407A (en) * 2020-10-20 2021-01-22 网易(杭州)网络有限公司 Method and device for aligning text in audio, electronic equipment and readable storage medium
CN112908308A (en) * 2021-02-02 2021-06-04 腾讯音乐娱乐科技(深圳)有限公司 Audio processing method, device, equipment and medium
CN115116427A (en) * 2022-06-22 2022-09-27 马上消费金融股份有限公司 Labeling method, voice synthesis method, training method and device

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2011135001A1 (en) * 2010-04-30 2011-11-03 International Business Machines Corporation Assessing speech prosody
US20120089402A1 (en) * 2009-04-15 2012-04-12 Kabushiki Kaisha Toshiba Speech synthesizer, speech synthesizing method and program product
US20120245942A1 (en) * 2011-03-25 2012-09-27 Klaus Zechner Computer-Implemented Systems and Methods for Evaluating Prosodic Features of Speech
WO2014190496A1 (en) * 2013-05-28 2014-12-04 Thomson Licensing Method and system for identifying location associated with voice command to control home appliance
CN106920547A (en) * 2017-02-21 2017-07-04 腾讯科技(上海)有限公司 Phonetics transfer method and device
TWI595478B (en) * 2016-04-21 2017-08-11 國立臺北大學 Speaking-rate normalized prosodic parameter builder, speaking-rate dependent prosodic model builder, speaking-rate controlled prosodic-information generating device and method for being able to learn different languages and mimic various speakers' speaki
US20180033416A1 (en) * 2012-12-21 2018-02-01 The Nielsen Company (Us), Llc Audio Processing Techniques for Semantic Audio Recognition and Report Generation
US20180330715A1 (en) * 2015-11-11 2018-11-15 Mglish Inc. Foreign language reading and displaying device and a method thereof, motion learning device based on foreign language rhythm detection sensor and motion learning method, electronic recording medium, and learning material
CN110010159A (en) * 2019-04-02 2019-07-12 广州酷狗计算机科技有限公司 Sound similarity determines method and device

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120089402A1 (en) * 2009-04-15 2012-04-12 Kabushiki Kaisha Toshiba Speech synthesizer, speech synthesizing method and program product
WO2011135001A1 (en) * 2010-04-30 2011-11-03 International Business Machines Corporation Assessing speech prosody
US20120245942A1 (en) * 2011-03-25 2012-09-27 Klaus Zechner Computer-Implemented Systems and Methods for Evaluating Prosodic Features of Speech
US20180033416A1 (en) * 2012-12-21 2018-02-01 The Nielsen Company (Us), Llc Audio Processing Techniques for Semantic Audio Recognition and Report Generation
WO2014190496A1 (en) * 2013-05-28 2014-12-04 Thomson Licensing Method and system for identifying location associated with voice command to control home appliance
US20180330715A1 (en) * 2015-11-11 2018-11-15 Mglish Inc. Foreign language reading and displaying device and a method thereof, motion learning device based on foreign language rhythm detection sensor and motion learning method, electronic recording medium, and learning material
TWI595478B (en) * 2016-04-21 2017-08-11 國立臺北大學 Speaking-rate normalized prosodic parameter builder, speaking-rate dependent prosodic model builder, speaking-rate controlled prosodic-information generating device and method for being able to learn different languages and mimic various speakers' speaki
CN106920547A (en) * 2017-02-21 2017-07-04 腾讯科技(上海)有限公司 Phonetics transfer method and device
CN110010159A (en) * 2019-04-02 2019-07-12 广州酷狗计算机科技有限公司 Sound similarity determines method and device

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
SASSAKI: "Expression Pattern and Promoter Analysis of a Eucalyptus grandis Germin-like Gene", 《PLANT MOLECULAR BIOLOGY REPORTER》 *
YALI: "语音迁移研究:理论模型与实证研究", 《第二语言学习研究》 *
李勇: "基于依存信息融合特征的汉语韵律预测", 《计算机工程》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111489737A (en) * 2020-04-13 2020-08-04 深圳市友杰智新科技有限公司 Voice command recognition method and device, storage medium and computer equipment
CN111489737B (en) * 2020-04-13 2020-11-10 深圳市友杰智新科技有限公司 Voice command recognition method and device, storage medium and computer equipment
CN111312231A (en) * 2020-05-14 2020-06-19 腾讯科技(深圳)有限公司 Audio detection method and device, electronic equipment and readable storage medium
CN111312231B (en) * 2020-05-14 2020-09-04 腾讯科技(深圳)有限公司 Audio detection method and device, electronic equipment and readable storage medium
CN112257407A (en) * 2020-10-20 2021-01-22 网易(杭州)网络有限公司 Method and device for aligning text in audio, electronic equipment and readable storage medium
CN112908308A (en) * 2021-02-02 2021-06-04 腾讯音乐娱乐科技(深圳)有限公司 Audio processing method, device, equipment and medium
CN115116427A (en) * 2022-06-22 2022-09-27 马上消费金融股份有限公司 Labeling method, voice synthesis method, training method and device
CN115116427B (en) * 2022-06-22 2023-11-14 马上消费金融股份有限公司 Labeling method, voice synthesis method, training method and training device

Also Published As

Publication number Publication date
CN110782875B (en) 2021-12-10

Similar Documents

Publication Publication Date Title
CN110782875B (en) Voice rhythm processing method and device based on artificial intelligence
US10789290B2 (en) Audio data processing method and apparatus, and computer storage medium
CN108806656B (en) Automatic generation of songs
US8036894B2 (en) Multi-unit approach to text-to-speech synthesis
US20190266998A1 (en) Speech recognition method and device, computer device and storage medium
CN102779508B (en) Sound bank generates Apparatus for () and method therefor, speech synthesis system and method thereof
US8024179B2 (en) System and method for improving interaction with a user through a dynamically alterable spoken dialog system
CN101271688B (en) Prosody modification device, prosody modification method
CN110782918B (en) Speech prosody assessment method and device based on artificial intelligence
Patil et al. A syllable-based framework for unit selection synthesis in 13 Indian languages
CN110782880B (en) Training method and device for prosody generation model
JP2004523004A (en) Hierarchical language model
CN111370024B (en) Audio adjustment method, device and computer readable storage medium
CN112837401B (en) Information processing method, device, computer equipment and storage medium
CN104008752A (en) Speech recognition device and method, and semiconductor integrated circuit device
CN112466279B (en) Automatic correction method and device for spoken English pronunciation
CN115116428B (en) Prosodic boundary labeling method, device, equipment, medium and program product
CN112750187A (en) Animation generation method, device and equipment and computer readable storage medium
US11158308B1 (en) Configuring natural language system
JP2004094257A (en) Method and apparatus for generating question of decision tree for speech processing
CN110853669B (en) Audio identification method, device and equipment
CN111968646A (en) Voice recognition method and device
WO2023221345A1 (en) Emotional speech synthesis method and apparatus
Kepuska et al. Speech corpus generation from DVDs of movies and tv series
Schuller et al. Incremental acoustic valence recognition: an inter-corpus perspective on features, matching, and performance in a gating paradigm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40021545

Country of ref document: HK

GR01 Patent grant
GR01 Patent grant