GB2559766A - Method and system for defining text content for speech segmentation - Google Patents

Method and system for defining text content for speech segmentation Download PDF

Info

Publication number
GB2559766A
GB2559766A GB1702580.0A GB201702580A GB2559766A GB 2559766 A GB2559766 A GB 2559766A GB 201702580 A GB201702580 A GB 201702580A GB 2559766 A GB2559766 A GB 2559766A
Authority
GB
United Kingdom
Prior art keywords
word
speech
words
speech unit
text content
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
GB1702580.0A
Other versions
GB2559766A8 (en
GB201702580D0 (en
Inventor
Dinkar Apte Shaila
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Pastel Dreams
Original Assignee
Pastel Dreams
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Pastel Dreams filed Critical Pastel Dreams
Priority to GB1702580.0A priority Critical patent/GB2559766A/en
Publication of GB201702580D0 publication Critical patent/GB201702580D0/en
Publication of GB2559766A publication Critical patent/GB2559766A/en
Publication of GB2559766A8 publication Critical patent/GB2559766A8/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/06Elementary speech units used in speech synthesisers; Concatenation rules

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

Disclosed is a method of defining text content for speech segmentation. The method comprises defining a set of speech units (which may comprise vowel and or consonant phonemes) for a given natural language. Further, for each speech unit of the set of speech units, determining one or more words in the given natural language that comprise the speech unit, wherein the one or more words comprise at least one of at least one word in which the speech unit is located substantially at a beginning portion, a middle portion and an end portion of the at least one word, and for each speech unit of the set of speech units, selecting, from amongst the determined one or more words, at least one word comprising the speech unit to form a word-set to be employed in the text content, wherein the at least one word is selected based upon at least one predefined selection criterion. This reduces the number of words which need to be recorded to enable a text to speech system to operate.

Description

(54) Title of the Invention: Method and system for defining text content for speech segmentation Abstract Title: Defining text content for speech segmentation (57) Disclosed is a method of defining text content for speech segmentation. The method comprises defining a set of speech units (which may comprise vowel and or consonant phonemes) for a given natural language. Further, for each speech unit of the set of speech units, determining one or more words in the given natural language that comprise the speech unit, wherein the one or more words comprise at least one of at least one word in which the speech unit is located substantially at a beginning portion, a middle portion and an end portion of the at least one word, and for each speech unit of the set of speech units, selecting, from amongst the determined one or more words, at least one word comprising the speech unit to form a word-set to be employed in the text content, wherein the at least one word is selected based upon at least one predefined selection criterion. This reduces the number of words which need to be recorded to enable a text to speech system to operate.
Phonemes Chunks Consonant Blends
® P2 P3 © C2 C3 © B2 B3
P4 ... PI C4 Cm B4 ... Bn
P1 C1 B1
<2word O> dWord i+L^> CyVord j+t2^>
Word 2 <Word i+O <Wordj+2^>
<2word32> Word i+7 <Wordj+O
Word i Word j Word k
P1 Word 1 Word 3 Word 6
P2 Word 2 Word 5 Word 6
Word 11 Word 16 Word 20
Pn Word 26 Word i
C1 Word i+1 Word i+6 Word i+9
C2 Word i+11 Word i+17 Word i+15
Word i+21 Word i+42 Word i+64
Cn Word i+59 Word j
B1 Word j+1 Wordj+2 Wordj+3
B2 Wordj+21 Word j+14 Wordj+16
Word j+51 Word j+60 Word j+80
Bn Word j+91 Word k
FIG. 2
1/3
Figure GB2559766A_D0001
FIG. 1
2/3
Phonemes Chunks Consonant Blends
© P2 P3 P4 ... PI (ST) C2 C3 C4 ... Cm © B2 B3 B4 ... Bn
P1 C1 B1
O/Vord Ό Qvord i+O <Word j+O
Word 2 C^Word i+6^> <Word j+f>
OWord 3^> Word i+7 CWord j+3Z>
Word i Wordj Word k
P1 Word 1 Word 3 Word 6
P2 Word 2 Word 5 Word 6
Word 11 Word 16 Word 20
Pn Word 26 Word i
C1 Word i+1 Word i+6 Word i+9
C2 Word i+11 Word i+17 Word i+15
Word i+21 Word i+42 Word i+64
Cn Word i+59 Wordj
B1 Word j+1 Word j+2 Word j+3
B2 Word j+21 Word j+14 Word j+16
Word j+51 Word j+60 Word j+80
Bn Word j+91 Wordk
FIG. 2
3/3
300 /
Define a set of speech units for a given natural language 310 _ψ_
Determine for each speech unit of the set of speech units, one or more words in the given natural language that comprise the speech unit, wherein the one or more words comprise at least one of at least one word in which the speech unit is located substantially at a beginning portion of the at least one word, at least one word in which the speech unit is located substantially at a middle portion of the at least one word, at least one word in which the speech unit is located substantially at an end portion of the at least one word 320
Select for each speech unit of the set of speech units, from amongst the determined one or more words, at least one word comprising the speech unit to form a word-set to be employed in the text content, wherein the at least one word is selected based upon at least one predefined selection criterion 330
FIG. 3
METHOD AND SYSTEM FOR. DEFINING TEXT CONTENT FOR SPEECH SEGMENTATION
TECHNICAL FIELD
The present disclosure relates generally to text-to-speech synthesis;
and more specifically, to defining text content for speech segmentation.
BACKGROUND
Text-to-speech synthesis systems enable users to convert text to artificial speech. Conventionally, such systems enable users to 'listen' to text content included in computer documents, scanned documents (for example, scanned newspapers, books and magazines) comprising recognisable text, direct input using hardware (such as keyboards or keypads) and so forth. Further, the speech synthesised from the text is played to the user in a standard voice, such as a machine voice or that of a voice actor.
Generally, the conventional text-to-speech synthesis systems utilise a database comprising pre-recorded voice segments associated with every word in a natural language (such as English). Further, it may be evident that the database may comprise voice segments associated with thousands of words in the language. Additionally, multiple voice segments associated with a word may be required to be stored in the database that may differ according to usage of the word (such as, a different dialect). For example, second edition of Oxford English Dictionary comprises 171,476 currently used words, 47,156 obsolete words and around 9,500 derivative words. Therefore, a text-to-speech synthesis system configured to convert English text-to-speech will comprise a database having thousands of voice segments.
Usually, a speaker (such as a voice actor) having expertise in a specific language is required to speak the words clearly for recording of the voice segments. Such a speaker is trained to enunciate the words to obtain highest clarity of speech for artificially synthesising speech from text. Further, such speakers may be capable of recording all the required voice segments without fatigue. In such a system, it may be evident that the speech obtained will be capable to converting an input text to speech in the voice of the speaker.
Further, a conventional text-to-speech synthesis system is generally incapable of synthesising the input text to speech in the voice of a user of the system. Also, such systems may require the user to record thousands of words in his/her voice, which may be cumbersome for the user. Additionally, as average persons are not trained to speak clearly, the obtained voice segments may lack clarity, thereby complicating segmentation of the voice thereafter.
Therefore, in light of the foregoing discussion, there exists a need to overcome the aforementioned drawbacks associated with defining text content for speech segmentation.
SUMMARY
The present disclosure seeks to provide a method of defining text content for speech segmentation. The present disclosure also seeks to provide a system for defining text content for speech segmentation. The present disclosure seeks to provide a solution to the existing problem associated with requirement of large database of spoken words for speech segmentation. An aim of the present disclosure is to provide a solution that overcomes at least partially the problems encountered in prior art, and provides a method and system for defining text content for speech segmentation that is simple, efficient and easy to implement.
In one aspect, an embodiment of the present disclosure provides a method of defining text content for speech segmentation, the method comprising:
- defining a set of speech units for a given natural language;
- for each speech unit of the set of speech units, determining one or more words in the given natural language that comprise the speech unit, wherein the one or more words comprise at least one of:
(i) at least one word in which the speech unit is located substantially at a beginning portion of the at least one word, (ii) at least one word in which the speech unit is located substantially at a middle portion of the at least one word, (iii) at least one word in which the speech unit is located substantially at an end portion of the at least one word; and
- for each speech unit of the set of speech units, selecting, from amongst the determined one or more words, at least one word comprising the speech unit to form a word-set to be employed in the text content, wherein the at least one word is selected based upon at least one predefined selection criterion.
In another aspect, an embodiment of the present disclosure provides a system for defining text content for speech segmentation, the system comprising a server arrangement that is configured to:
- define a set of speech units for a given natural language;
- for each speech unit of the set of speech units, determine one or more words in the given natural language that comprise the speech unit, wherein the one or more words comprise at least one of:
(i) at least one word in which the speech unit is located substantially at a beginning portion of the at least one word, (ii) at least one word in which the speech unit is located substantially at a middle portion of the at least one word, (iii) at least one word in which the speech unit is located substantially at an end portion of the at least one word; and
- for each speech unit of the set of speech units, select, from amongst the determined one or more words, at least one word comprising the speech unit to form a word-set to be employed in the text content, wherein the at least one word is selected based upon at least one predefined selection criterion.
Embodiments of the present disclosure substantially eliminate or at least partially address the aforementioned problems in the prior art, and enable defining text content for speech segmentation.
Additional aspects, advantages, features and objects of the present disclosure would be made apparent from the drawings and the detailed description of the illustrative embodiments construed in conjunction with the appended claims that follow.
It will be appreciated that features of the present disclosure are susceptible to being combined in various combinations without departing from the scope of the present disclosure as defined by the appended claims.
BRIEF DESCRIPTION OF THE DRAWINGS
The summary above, as well as the following detailed description of illustrative embodiments, is better understood when read in conjunction with the appended drawings. For the purpose of illustrating the present disclosure, exemplary constructions of the disclosure are shown in the drawings. However, the present disclosure is not limited to specific methods and instrumentalities disclosed herein. Moreover, those in the art will understand that the drawings are not to scale. Wherever possible, like elements have been indicated by identical numbers.
Embodiments of the present disclosure will now be described, by way of example only, with reference to the following diagrams wherein:
FIG. 1 is a block diagram of a system for defining text content for speech segmentation, in accordance with an embodiment of the present disclosure;
FIG. 2 is a schematic illustration of stages of a method of defining text content for speech segmentation, in accordance with an embodiment of the present disclosure; and
FIG. 3 is an illustration of steps of a method of defining text content for speech segmentation, in accordance with an embodiment of the present disclosure.
In the accompanying drawings, an underlined number is employed to represent an item over which the underlined number is positioned or an item to which the underlined number is adjacent. A non-underlined number relates to an item identified by a line linking the non-underlined number to the item. When a number is non-underlined and accompanied by an associated arrow, the non-underlined number is used to identify a general item at which the arrow is pointing.
DETAILED DESCRIPTION OF EMBODIMENTS
The following detailed description illustrates embodiments of the present disclosure and ways in which they can be implemented. Although some modes of carrying out the present disclosure have been disclosed, those skilled in the art would recognize that other embodiments for carrying out or practicing the present disclosure are also possible.
In one aspect, an embodiment of the present disclosure provides a method of defining text content for speech segmentation, the method comprising:
- defining a set of speech units for a given natural language;
- for each speech unit of the set of speech units, determining one or more words in the given natural language that comprise the speech unit, wherein the one or more words comprise at least one of:
(i) at least one word in which the speech unit is located substantially at a beginning portion of the at least one word, (ii) at least one word in which the speech unit is located substantially at a middle portion of the at least one word, (iii) at least one word in which the speech unit is located substantially 15 at an end portion of the at least one word; and
- for each speech unit of the set of speech units, selecting, from amongst the determined one or more words, at least one word comprising the speech unit to form a word-set to be employed in the text content, wherein the at least one word is selected based upon at least one predefined selection criterion.
In another aspect, an embodiment of the present disclosure provides a system for defining text content for speech segmentation, the system comprising a server arrangement that is configured to:
- define a set of speech units for a given natural language;
- for each speech unit of the set of speech units, determine one or more words in the given natural language that comprise the speech unit, wherein the one or more words comprise at least one of:
(i) at least one word in which the speech unit is located substantially at a beginning portion of the at least one word, (ii) at least one word in which the speech unit is located substantially at a middle portion of the at least one word, (iii) at least one word in which the speech unit is located substantially at an end portion of the at least one word; and
- for each speech unit of the set of speech units, select, from amongst the determined one or more words, at least one word comprising the speech unit to form a word-set to be employed in the text content, wherein the at least one word is selected based upon at least one predefined selection criterion.
The present disclosure provides a method and system for defining text content for speech segmentation. The system enables speech to be synthesised from text in a user's voice, thereby eliminating requirement of trained speakers of a specific language for recording input speech. Also, usage of text content comprising of words selected based on acoustic quality of speech units enables easy segmentation of input speech and improved quality of synthesised speech. Moreover, usage of multiple words associated with a speech unit to address change in property of the speech unit depending on the position ofthe speech unit in the word, enables improved accuracy in extraction of speech units from the words. Additionally, the system enables synthesis of speech in the user's voice, thereby eliminating monotonous speech synthesised in a machine voice. Further, the system enables synthesis of speech without requirement of thousands of words to be recorded by the user and subsequently stored in a database. Therefore, the system of the present disclosure enables easy synthesis of speech using a limited number of words to be uttered by the user and also, eliminates the necessity of a large database for storage of the recorded speech.
The system for defining text content for speech segmentation comprises a server arrangement that is configured to define a set of speech units for a given natural language. The speech units are fundamental units of sound that enable differentiation of various words in a natural language during oral communication. As an example, the natural language could be any language that is commonly spoken around the world, such as English, Spanish, French and Russian. In another example, the natural language may comprise any one of the 6,000-7,000 languages spoken around the world.
In an embodiment, the speech units comprise phonemes. Phonemes are one of the units of sound that enables differentiation of different words in a language. For example, in the words 'kit' and 'bit', the phonemes /k/ and /b/ enable differentiation of the words, to enable a listener to understand the context and meaning of speech comprising the words. In one embodiment, the phonemes comprise at least one of vowel phonemes and/or consonant phonemes. For example, the phonemes may comprise vowel phonemes such as /e/, /u/ and /1/ and consonant phonemes such as /b/, /g/ and /m/. In another example, the phonemes may include 20 vowel phonemes and 24 consonant phonemes of the English language. Further, the vowel phonemes may comprise short vowels and long vowels. It will be appreciated that short vowels are vowels that have a shorter length of sound as compared to long vowels. For example, the vowel sound /a/ in the word 'cap' is a short vowel whereas the vowel sound /a:/ in the word 'car' is a long vowel. Additionally, the vowel phonemes may comprise nasal vowels, i.e. vowel phonemes comprising a nasal sound during pronunciation thereof, and diphthongs that are a combination of two vowels such that the sound begins with sound of the first vowel (onset) and ends with sound of the second vowel (rime). For example, in the word 'noodle', the /nu:/ sound may be a nasal vowel and in the word 'appear', the /is/ sound may be a diphthong. Also, the consonant phonemes may comprise voiced consonants (such as /8/ sound in the word 'then') and unvoiced consonants (such as /0/ sound in the word 'thing').
In one embodiment, the speech units comprise chunks having a collection of multiple phonemes. The chunks may be a collection of consonant and vowel phonemes that are included in frequently used words in a natural language. For example, the collection of consonant and vowel phonemes /1/ and /□/ in the word 'ing' may be a chunk, since the 'ing' sound is included in frequently used words such as 'thing', 'morning', 'doing' and so forth. In another example, the collection of phonemes /eid3/ comprising the word 'age' may be a chunk that is included in commonly used words such as 'cage', 'luggage' and 'cabbage'. In another embodiment, the chunk may be a syllable. For example, the syllable 'and' may be a chunk.
In an embodiment, the speech units comprise consonant blends. The consonant blends may be a consonant sound comprising a group of multiple consonant phonemes. For example, 'br' sound in the word 'brush', 'spl' sound in the word 'splash' and 'dr' sound in the word 'drum' are consonant blends.
In one embodiment, the set of speech units may comprise ordered subsets of vowel phonemes, consonant phonemes, chunks and consonant blends. In another embodiment, the set of speech units may be a randomly arranged set.
In an embodiment, the set of speech units comprises all such phonemes in a given natural language. For example, the set of speech units may comprise all phonemes in the English language. In another example, the set of speech units may comprise 44 phonemes in the English language such as /a/, /b/ and /th/. In yet another example, the set of speech units may comprise phonemes in Spanish language.
The server arrangement is further configured, for each speech unit of the set of speech units, determine one or more words in the given natural language that comprise the speech unit, wherein the one or more words comprise at least one of:
(i) at least one word in which the speech unit is located substantially at a beginning portion of the at least one word, (ii) at least one word in which the speech unit is located substantially at a middle portion of the at least one word, (iii) at least one word in which the speech unit is located substantially at an end portion of the at least one word.
For example, the server arrangement may be operable to select the speech unit at a first position in the set of speech units and determine one or more words in the given natural language (such as English) comprising the speech unit located at least once in the word. It may be evident that the speech units in the set after the first speech unit may be subsequently selected. In an example, the one or more determined words comprising the speech unit (such as a phoneme) may be a collection of 100 words. In yet another example, the speech unit from the set of speech units may be the phoneme /b/ and the one or more words that are determined to comprise the phoneme may be 'but', 'label' and 'tub'. It may be evident that the phoneme /b/ is located substantially at a beginning portion of the word 'but', at a middle portion of the word 'label' and at an end portion of the word 'tub'.
In some examples, the speech unit from the set of speech units may be a consonant phoneme and may be located substantially at a beginning portion of the at least one word. In such instance, the consonant phoneme located at the beginning portion of the word may be followed by a vowel phoneme. For example, in the word 'kit', the phoneme /k/ is positioned at the beginning portion of the word and is followed by the vowel phoneme /1/. In another example, in the word 'thing', the phoneme /0/ is positioned at the beginning portion of the word and is followed by the vowel phoneme /1/. In another embodiment, the speech unit from the set of speech units may be a vowel phoneme that is located substantially at a beginning portion of the at least one word. In such instance, the vowel phoneme located at the beginning portion of the word may be followed by a consonant phoneme. For example, in the word 'echo', the phoneme /ε/ is located at the beginning portion of the word and is followed by the consonant phoneme /k/. In another example, in the word 'end', the phoneme /ε/ is located at the beginning portion of the word and is followed by the consonant phoneme /n/.
In other examples, the speech unit from the set of speech units may be a consonant phoneme that is located substantially at a middle portion of the at least one word. In such instance, the consonant phoneme located at the middle portion of the word may be followed by a vowel phoneme. Further, the word may comprise a vowel phoneme or a consonant phoneme located at the beginning portion of the word. For example, the word 'leisure' comprises the phoneme /3/ located substantially at the middle portion of the word. In another example, in the word 'leather', the phoneme /6/ is located substantially at the middle portion of the word. In another embodiment, the speech unit from the set of speech units may be a vowel phoneme that is located substantially at a middle portion of the at least one word. In such instance, the vowel phoneme located at the middle portion of the word may follow a consonant phoneme located at the beginning portion of the word, and may be followed by a consonant phoneme located at the end portion of the word. For example, in the word 'bet', the phoneme /ε/ is located at the middle portion of the word and is preceded by the consonant phoneme /b/ and followed by the consonant phoneme /t/. Optionally, a word may comprise a combination of two vowel phonemes located substantially at a middle portion of the at least one word. In such instance, the vowel phonemes may be considered a diphthong and further, the diphthong may be considered as a single phoneme. For example, the word 'coin' comprises the vowel phonemes /□/ and /1/ located at the middle portion of the word. In such instance, the phonemes /□/ and /1/ may be combined to obtain the diphthong /oi/ and may be considered as a single vowel phoneme.
In yet other examples, the speech unit of the set of speech units may be a consonant phoneme and may be located substantially at an end portion of the at least one word. In such instance, the consonant phoneme may be preceded by consonant and/or vowel phonemes. For example, in the word 'run', the consonant phoneme /n/ is located at the end portion of the word. In another example, the word 'rich' comprises the consonant phoneme /tf/ located at the end portion of the word. In another embodiment, the speech unit of the set of speech units may be a vowel phoneme that is located substantially at an end portion of the at least one word. Further, in such instance, the vowel phoneme may be preceded by consonant and/or vowel phonemes. In an example, the word 'be' comprises the vowel phoneme /i:/ located at the end portion of the word and is preceded by the consonant phoneme /b/. In another example, in the word 'to', the vowel phoneme /u:/ is located at the end portion of the word and is preceded by the consonant phoneme /t/.
In one embodiment, the server arrangement is configured to determine the one or more words based upon a location of at least one grapheme corresponding to the speech unit in the one or more words. In an example, the one or more words may comprise 'nut', 'label' and 'tub'.
The word 'nut' comprises the phonemes /η/, /λ/ and /t/. In an instance where the speech unit of the set of speech units is the phoneme /n/, it is evident by the location of the grapheme 'n' that the phoneme is located at the beginning portion of the word 'nut'. Therefore, the word 'nut' may comprise the determined one or more words. However, the words 'label' and 'tub' do not comprise the grapheme associated with the phoneme /n/ and are therefore not included in the determined one or more words for the phoneme /n/. In an embodiment, the server arrangement is operable to automatically determine the one or more words based on the speech unit being located in each word.
The server arrangement is further configured to, for each speech unit of the set of speech units, select, from amongst the determined one or more words, at least one word comprising the speech unit to form a word-set to be employed in the text content, wherein the at least one word is selected based upon at least one predefined selection criterion.
In an embodiment, the word-set comprises three words that are selected from amongst the determined one or more words, such that the word-set comprises one word with the speech unit (such as a phoneme) located substantially at a beginning portion, one word with the speech unit located substantially at a middle portion and one word with the phoneme located substantially at an end portion.
In an embodiment, the server arrangement is configured to select the at least one word based upon acoustic quality of the speech unit in the at least one word, as per the at least one predefined selection criterion.In an example, the acoustic quality of the speech unit may relate to sound energy level of the speech unit (such as phonemes). It is well known that vowel phonemes have a higher sound energy level than consonant phonemes. The predefined selection criterion may relate to a relative sound energy level of phonemes in the word, such as a difference in sound energy of the neighbouring phonemes in the word.
In such instance, the selected at least one word may comprise higher sound energy difference among neighbouring phonemes in the word. In another example, the acoustic quality of the speech unit may relate to clarity of sound of the speech unit during pronunciation of the word comprising the speech unit. For example, in the word 'queen', the phoneme /w/ lacks clarity during pronunciation of the word. However, in the word 'power', the /w/ sound has better acoustic quality when pronounced and may therefore be selected for the consonant phoneme /w/.
In one embodiment, the one or more words determined based upon a location of at least one grapheme corresponding to the speech unit in the one or more words are filtered based upon acoustic quality of the speech unit in the at least one word. For example, one or more words determined based upon a location of at least one grapheme may comprise 100 words that may be subsequently filtered to comprise 30 words based on the acoustic quality of the selected speech unit in the words. The 30 words that are selected based upon acoustic quality of the speech unit in the at least one word may be further filtered to obtain the word-set comprising 3 words of the highest acoustic quality among the plurality of words. In another example, the word-set may comprise words that are easier to pronounce for an average person.
In an embodiment, the defined text content comprises 150 to 300 words. In an example, the text content comprises 175 words. In another example, the text content comprises 250 words. In yet another example, the text content may be a short story comprising the word30 sets for each speech unit of the set of speech units.
In one embodiment, the server arrangement comprises one or more servers. For example, the server may be a cloud server that is associated with a service-provider offering a service, such as speech segmentation, text-to-speech synthesis and so forth.
In another embodiment, the server arrangement is configured to provide to a mobile communication device the defined text content for speech segmentation purposes. In an example, the mobile communication device may be a smartphone, tablet, laptop, and so forth, and may comprise software (such as a software application) and hardware (such as a processor, display, battery, memory, microphone, and other electronic components) for speech segmentation purposes. Further, the mobile communication device may be communicably coupled to the server arrangement. For example, the server may be a cloud server that is communicably coupled to the mobile communication device via a cloud network. In such instance, the mobile communication device may be associated with a user (for example, a customer seeking services of a service provider) and may enable the user to obtain information such as the set of speech units, the one or more words in the natural language comprising the speech unit, the defined text content, and so forth, from the server arrangement.
In an embodiment, the mobile communication device is operable to execute partially the steps associated with the method of defining text content for speech segmentation (using on-board electronic components such as a processor) and the remainder of the steps may be executed by the server arrangement. For example, the processor may be operable to define the set of speech units and determine one or more words in the given natural language that comprise the speech unit. Subsequently, the step of selecting at least one word comprising the speech unit to form a word-set to be employed in the text content may be executed by the server.
According to an embodiment, the system further comprises a database arrangement that is coupled in communication with the server arrangement, wherein the server arrangement is configured to store the defined text content at the database arrangement. In one embodiment, the information (such as the set of speech units, the one or more words in the natural language comprising the speech unit, the defined text content, and so forth) may be partially stored on the mobile communication device (such as, on the on-board memory associated with the mobile communication device), and the remainder of the information may be stored on the database arrangement. For example, the set of speech units and the determined one or more words may be stored in the memory unit, and the word-sets for all speech units and the text content may be stored in the database.
According to an embodiment, the database arrangement may be configured to store the defined text content for future use.
DETAILED DESCRIPTION OF THE DRAWINGS
FIG. 1 is a block diagram of a system 100 for defining text content for speech segmentation, in accordance with an embodiment of the present disclosure. The system 100 comprises a server arrangement 130 that is communicably coupled to a mobile communication device 120 of a given user 110, wherein the server arrangement 130 is configured to provide to the mobile communication device 120 the defined text content for speech segmentation purposes. The system also comprises a database arrangement 140 that is coupled in communication with the server arrangement 130, wherein the server arrangement is configured to store the defined text content at the database arrangement 140.
FIG. 2 is a schematic illustration of stages of a method of defining text content for speech segmentation, in accordance with an embodiment of the present disclosure. At stage 200, a set of speech units for a given natural language is defined. As shown, the set of speech units comprises phonemes, chunks and consonant blends. Further, set of phonemes comprises phonemes such as Pl, P2, P3, and P4. The set of chunks comprises chunks such as Cl, C2, C3, and C4. Also, the set of consonant blends comprises the consonant blends BI, B2, B3, and B4. Initially, the phoneme Pl is selected. It may be evident that the remaining phonemes such as P2, P3 and so forth, the chunks and the consonant blends are selected subsequent to the phoneme Pl. At stage 210, one or more words in the natural language are determined that comprise the speech unit located substantially at a beginning portion, a middle portion and/or an end portion of the at least one word. As shown, words 'Word 1', Word 3', 'Word 6' to 'Word i' are determined comprising the phoneme Pl located substantially at a beginning portion, a middle portion and/or an end portion. Similarly, the words 'Word i+1' to 'Word j' are determined comprising the chunk Cl located substantially at a beginning portion, a middle portion and/or an end portion and also, the words 'Word j + 1' to 'Word k' are determined comprising the consonant blend BI located substantially at a beginning portion, a middle portion and/or an end portion. At stage 220, word20 set corresponding to the phoneme Pl is formed comprising the words 'Word 1', 'Word 3' and 'Word 6' from the determined one or more words comprising the phonemes (words 'Word 1' to 'Word i'). Similarly, the word-set corresponding to the chunk Cl is formed comprising the words 'Word i+1', 'Word i+6' and 'Word i+9' and the word-set corresponding to the consonant blend B2 is formed comprising the words 'Word j + 21', 'Word j + 14' and 'Word j + 16' from the determined one or more words for the chunks and consonant blends respectively. Further, the wordsets for each speech unit from of the set of speech units, such as the phonemes, chunks and consonant blends, are employed in the text content.
FIG. 3 is an illustration of steps of a method 300 of defining text content for speech segmentation, in accordance with an embodiment of the present disclosure. At step 310, a set of speech units for a given natural languageis defined. At step 320, for each speech unit of the set of speech units, one or more words in the given natural language are determined that comprise the speech unit, wherein the one or more words comprise at least one of at least one word in which the speech unit is located substantially at a beginning portion of the at least one word, at least one word in which the speech unit is located substantially at a middle portion of the at least one word, at least one word in which the speech unit is located substantially at an end portion of the at least one word. At step 330, for each speech unit of the set of speech units, from amongst the determined one or more words, at least one word comprising the speech unit is selected to form a word-set to be employed in the text content, wherein the at least one word is selected based upon at least one predefined selection criterion.
The steps 310 to 330 are only illustrative and other alternatives can also be provided where one or more steps are added, one or more steps are removed, or one or more steps are provided in a different sequence without departing from the scope of the claims herein. For example, the speech units may comprise phonemes, chunks having a collection of multiple phonemes and consonant blends, wherein the phonemes may further comprise at least one of vowel phonemes and/or consonant phonemes. In another example, the one or more words are determined based upon a location of at least one grapheme corresponding to the speech unit in the one or more words. In yet another example, the at least one word is selected based upon acoustic quality of the speech unit in the at least one word, as per the at least one predefined selection criterion. In one example, the text content comprises 150 to
3 00 words. In another example, the defined text content may be stored for future use or may be provided to a mobile communication device for speech segmentation purposes.
Modifications to embodiments of the present disclosure described in the foregoing are possible without departing from the scope of the present disclosure as defined by the accompanying claims. Expressions such as including, comprising, incorporating, have, is used to describe and claim the present disclosure are intended to be construed in a nonexclusive manner, namely allowing for items, components or elements not explicitly described also to be present. Reference to the singular is also to be construed to relate to the plural.

Claims (15)

1. A method of defining text content for speech segmentation, the method comprising:
- defining a set of speech units for a given natural language;
5 - for each speech unit of the set of speech units, determining one or more words in the given natural language that comprise the speech unit, wherein the one or more words comprise at least one of:
(i) at least one word in which the speech unit is located substantially at a beginning portion of the at least one word,
10 (ii) at least one word in which the speech unit is located substantially at a middle portion of the at least one word, (iii) at least one word in which the speech unit is located substantially at an end portion of the at least one word; and
- for each speech unit of the set of speech units, selecting, from
15 amongst the determined one or more words, at least one word comprising the speech unit to form a word-set to be employed in the text content, wherein the at least one word is selected based upon at least one predefined selection criterion.
2. A method according to claim 1, wherein the speech units comprise
20 phonemes.
3. A method according to claim 2, wherein the phonemes comprise at least one of: vowel phonemes, consonant phonemes.
4. A method according to claim 1, wherein the speech units comprise chunks having a collection of multiple phonemes.
5. A method according to claim 1, wherein the speech units comprise consonant blends.
6. A method according to claim 1, wherein the one or more words are determined based upon a location of at least one grapheme
5 corresponding to the speech unit in the one or more words.
7. A method according to claim 1, wherein the at least one word is selected based upon acoustic quality of the speech unit in the at least one word, as per the at least one predefined selection criterion.
8. A method according to claim 1, wherein the defined text content 10 comprises 150 to 300 words.
9. A method according to claim 1, further comprising storing the defined text content for future use.
10. A method according to claim 1, further comprising providing to a mobile communication device the defined text content for speech
15 segmentation purposes.
11. A system for defining text content for speech segmentation, the system comprising a server arrangement that is configured to:
- define a set of speech units for a given natural language;
- for each speech unit of the set of speech units, determine one or more 20 words in the given natural language that comprise the speech unit, wherein the one or more words comprise at least one of:
(i) at least one word in which the speech unit is located substantially at a beginning portion of the at least one word, (ii) at least one word in which the speech unit is located substantially 25 at a middle portion of the at least one word, (iii) at least one word in which the speech unit is located substantially at an end portion of the at least one word; and
- for each speech unit of the set of speech units, select, from amongst the determined one or more words, at least one word comprising the
5 speech unit to form a word-set to be employed in the text content, wherein the at least one word is selected based upon at least one predefined selection criterion.
12. A system according to claim 11, wherein the server arrangement is configured to determine the one or more words based upon a location
10 of at least one grapheme corresponding to the speech unit in the one or more words.
13. A system according to claim 11, wherein the server arrangement is configured to select the at least one word based upon acoustic quality of the speech unit in the at least one word, as per the at least one
15 predefined selection criterion.
14. A system according to claim 11, further comprising a database arrangement that is coupled in communication with the server arrangement, wherein the server arrangement is configured to store the defined text content at the database arrangement.
20
15. A system according to claim 11, wherein the server arrangement is communicably coupled to a mobile communication device of a given user, and wherein the server arrangement is configured to provide to the mobile communication device the defined text content for speech segmentation purposes.
Intellectual
Property
Office
Application No: GB1702580.0 Examiner: Mr Steve Evans
GB1702580.0A 2017-02-17 2017-02-17 Method and system for defining text content for speech segmentation Withdrawn GB2559766A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
GB1702580.0A GB2559766A (en) 2017-02-17 2017-02-17 Method and system for defining text content for speech segmentation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB1702580.0A GB2559766A (en) 2017-02-17 2017-02-17 Method and system for defining text content for speech segmentation

Publications (3)

Publication Number Publication Date
GB201702580D0 GB201702580D0 (en) 2017-04-05
GB2559766A true GB2559766A (en) 2018-08-22
GB2559766A8 GB2559766A8 (en) 2018-10-10

Family

ID=58486892

Family Applications (1)

Application Number Title Priority Date Filing Date
GB1702580.0A Withdrawn GB2559766A (en) 2017-02-17 2017-02-17 Method and system for defining text content for speech segmentation

Country Status (1)

Country Link
GB (1) GB2559766A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2218602A (en) * 1988-05-10 1989-11-15 Seiko Epson Corp Voice synthesizer
WO2004032112A1 (en) * 2002-10-04 2004-04-15 Koninklijke Philips Electronics N.V. Speech synthesis apparatus with personalized speech segments
EP1777697A2 (en) * 2000-12-04 2007-04-25 Microsoft Corporation Method and apparatus for speech synthesis without prosody modification

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2218602A (en) * 1988-05-10 1989-11-15 Seiko Epson Corp Voice synthesizer
EP1777697A2 (en) * 2000-12-04 2007-04-25 Microsoft Corporation Method and apparatus for speech synthesis without prosody modification
WO2004032112A1 (en) * 2002-10-04 2004-04-15 Koninklijke Philips Electronics N.V. Speech synthesis apparatus with personalized speech segments

Also Published As

Publication number Publication date
GB2559766A8 (en) 2018-10-10
GB201702580D0 (en) 2017-04-05

Similar Documents

Publication Publication Date Title
Schultz et al. Globalphone: A multilingual text & speech database in 20 languages
EP3994683B1 (en) Multilingual neural text-to-speech synthesis
Asinovsky et al. The ORD speech corpus of Russian everyday communication “One Speaker’s Day”: creation principles and annotation
US9431011B2 (en) System and method for pronunciation modeling
Gelas et al. Developments of Swahili resources for an automatic speech recognition system.
US7716052B2 (en) Method, apparatus and computer program providing a multi-speaker database for concatenative text-to-speech synthesis
US9865251B2 (en) Text-to-speech method and multi-lingual speech synthesizer using the method
Patil et al. A syllable-based framework for unit selection synthesis in 13 Indian languages
Lee et al. Learning pronunciation from a foreign language in speech synthesis networks
JP2018146803A (en) Voice synthesizer and program
Ramabhadran et al. The IBM 2007 speech transcription system for European parliamentary speeches
Abushariah et al. Phonetically rich and balanced text and speech corpora for Arabic language
Boothalingam et al. Development and evaluation of unit selection and HMM-based speech synthesis systems for Tamil
Abushariah et al. Modern standard Arabic speech corpus for implementing and evaluating automatic continuous speech recognition systems
Wibawa et al. Building open Javanese and Sundanese corpora for multilingual text-to-speech
Vinodh et al. Using polysyllabic units for text to speech synthesis in indian languages
CN112151020A (en) Voice recognition method and device, electronic equipment and storage medium
Kurimo et al. Unsupervised segmentation of words into morphemes-Morpho challenge 2005 Application to automatic speech recognition
Toman et al. Unsupervised and phonologically controlled interpolation of Austrian German language varieties for speech synthesis
Sitaram et al. Text to speech in new languages without a standardized orthography
GB2559766A (en) Method and system for defining text content for speech segmentation
Raghavendra et al. Global syllable set for building speech synthesis in Indian languages
JP6849977B2 (en) Synchronous information generator and method for text display and voice recognition device and method
Sharma et al. Polyglot speech synthesis: a review
Kiruthiga et al. Annotating Speech Corpus for Prosody Modeling in Indian Language Text to Speech Systems

Legal Events

Date Code Title Description
WAP Application withdrawn, taken to be withdrawn or refused ** after publication under section 16(1)