CN110136688B - Text-to-speech method based on speech synthesis and related equipment - Google Patents
Text-to-speech method based on speech synthesis and related equipment Download PDFInfo
- Publication number
- CN110136688B CN110136688B CN201910298456.3A CN201910298456A CN110136688B CN 110136688 B CN110136688 B CN 110136688B CN 201910298456 A CN201910298456 A CN 201910298456A CN 110136688 B CN110136688 B CN 110136688B
- Authority
- CN
- China
- Prior art keywords
- voice
- user
- text
- speech
- broadcasting
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 44
- 230000015572 biosynthetic process Effects 0.000 title claims abstract description 42
- 238000003786 synthesis reaction Methods 0.000 title claims abstract description 41
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 17
- 238000006243 chemical reaction Methods 0.000 claims abstract description 17
- 230000003993 interaction Effects 0.000 claims description 28
- 230000014509 gene expression Effects 0.000 claims description 8
- 230000000694 effects Effects 0.000 claims description 7
- 230000011218 segmentation Effects 0.000 claims description 7
- 238000004364 calculation method Methods 0.000 claims description 5
- 238000013016 damping Methods 0.000 claims description 5
- 238000010606 normalization Methods 0.000 claims 1
- 238000005516 engineering process Methods 0.000 abstract description 8
- 230000008569 process Effects 0.000 description 10
- 238000012015 optical character recognition Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 230000005540 biological transmission Effects 0.000 description 1
- 230000000903 blocking effect Effects 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000003384 imaging method Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000001020 rhythmical effect Effects 0.000 description 1
- 238000013518 transcription Methods 0.000 description 1
- 230000035897 transcription Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/027—Concept to speech synthesisers; Generation of natural phrases from machine-based concepts
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
- G10L13/10—Prosody rules derived from text; Stress or intonation
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N1/00—Scanning, transmission or reproduction of documents or the like, e.g. facsimile transmission; Details thereof
- H04N1/00127—Connection or combination of a still picture apparatus with another apparatus, e.g. for storage, processing or transmission of still picture signals or of information associated with a still picture
- H04N1/00204—Connection or combination of a still picture apparatus with another apparatus, e.g. for storage, processing or transmission of still picture signals or of information associated with a still picture with a digital computer or a digital computer system, e.g. an internet server
- H04N1/00209—Transmitting or receiving image data, e.g. facsimile data, via a computer, e.g. using e-mail, a computer network, the internet, I-fax
- H04N1/00222—Transmitting or receiving image data, e.g. facsimile data, via a computer, e.g. using e-mail, a computer network, the internet, I-fax details of image data generation or reproduction, e.g. scan-to-email or network printing
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N1/00—Scanning, transmission or reproduction of documents or the like, e.g. facsimile transmission; Details thereof
- H04N1/04—Scanning arrangements, i.e. arrangements for the displacement of active reading or reproducing elements relative to the original or reproducing medium, or vice versa
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
- G10L2013/083—Special characters, e.g. punctuation marks
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- General Engineering & Computer Science (AREA)
- Signal Processing (AREA)
- Computing Systems (AREA)
- Machine Translation (AREA)
Abstract
The application relates to a voice synthesis technology in the field of voice semantics, in particular to a text-to-voice method based on voice synthesis and related equipment, wherein the method comprises the following steps: receiving a scanning request of a user, and scanning characters to be identified selected by the user into electronic texts; converting the electronic text into a voice text through a text-to-voice system, and prompting the user of successful conversion through voice; and acquiring a voice broadcasting request of the user, and broadcasting a voice text through voice. According to the method, the weight of each word is calculated through iteration by using the TextRank algorithm, so that the whole text can be quickly converted into voice.
Description
Technical Field
The application relates to the field of voice semantics, in particular to a method for converting words into voice based on voice synthesis and related equipment.
Background
With the arrival of big data age, knowing information around the world at any time and any place becomes an important thing in people's daily life, and big data brings huge information, and meanwhile, the information also changes the work and life of everybody, but everybody can not master the information of own attention at any time and any place, wherein the blind, children and old people occupy a great proportion.
The existing reading auxiliary equipment can read a lot of information, but is concentrated in a part of readings matched with each other, the price is high, the read information is targeted, the acquired content is less, the hearing requirements of the blind, children and old can not be met, at present, the common manuscript still occupies the main market, and most of the blind, children and old mainly have information sources which depend on others and can not be read independently, so that great inconvenience is brought to the improvement of work and life of the blind, children and old.
At present, in the process of converting characters into voice, the problem that the whole text can not be quickly converted into voice exists, so that a great deal of time is required to be consumed when converting text materials such as text books into voice, and the efficiency of converting the text into voice is low.
Disclosure of Invention
Based on this, it is necessary to provide a text-to-speech method based on speech synthesis and related equipment for the problem that the whole text cannot be quickly converted into speech in the text-to-speech process.
A text-to-speech method based on speech synthesis includes:
receiving a scanning request of a user, calling a character scanning system, and scanning characters to be identified selected by the user into electronic texts;
reading punctuation marks in the electronic text through a regular expression, defining texts between two adjacent punctuation marks as single independent sentences, and segmenting the electronic text into a plurality of independent sentences;
determining keywords in a plurality of independent sentences respectively through a TextRank algorithm, and adding an empty character between the keywords and other words by utilizing a segmentation character string split;
converting words in the independent sentences into speech waveforms through a preset word-to-speech system, and converting empty characters in the independent sentences into silence to form a single voice file packet;
and performing voice synthesis on each voice file packet to obtain a voice text corresponding to the electronic text.
In one possible design, the determining, by the TextRank algorithm, keywords in the plurality of independent sentences respectively includes:
dividing words of the independent sentences and marking word characteristics, reserving nouns, verbs, adjectives and adverbs after the word characteristics are marked, constructing a word network of the independent sentences, wherein the word network is a relation network formed by interaction of words and each word, and taking each word in the independent sentences as a node in the word network;
and iteratively calculating a weight ordering result of each word by using the TextRank algorithm, wherein the TextRank iterative calculation formula is as follows:
wherein WS (V) i ) Is node V i The weight in the independent sentence, d is the damping coefficient, is a preset constant, w ji Is node V i And node V j Weight between, out (V j ) Is node V j Set of pointed nodes, node V k Is node V j Pointed node, w jk Is node V k And node V j Weight between, WS (V) j ) Is node V j Weights in independent sentences;
dividing the weight of all nodes by the maximum weight in the set to obtain the normalized weight of all nodes, and defining the word corresponding to the node with the normalized weight larger than the preset weight threshold as a keyword.
In one possible design, after the voice synthesis is performed on each voice file package to obtain the voice text corresponding to the electronic text, the method further includes:
prompting the user to submit a voice parameter setting request through a voice interaction system, and automatically setting voice parameters according to the voice parameter setting request submitted by the user.
In one possible design, the prompting, by the voice interaction system, the user to submit a voice parameter setting request, and automatically setting the voice parameter according to the voice parameter setting request submitted by the user includes:
after converting the electronic text into a voice text, inquiring whether the user sets voice parameters or not through a preset voice interaction system, and prompting the user to reply yes or no through voice, wherein the voice parameters comprise broadcasting speed and broadcasting voice;
when the voice replied by the user is obtained to be yes, prompting the user to select a broadcasting grade, wherein the broadcasting grade comprises 0.8 times of slow speed, normal speed, 1.5 times of fast speed, 2 times of fast speed and 3 times of fast speed, obtaining the broadcasting grade selected by the user, and automatically setting the broadcasting speed according to the broadcasting grade;
prompting the user to select special voices, wherein the special voices comprise system original voices, trending voices, star voices and sound effect voices, the special voices selected by the user are obtained, and the broadcasting voices are automatically set according to the special voices selected by the user;
and when the voice replied by the user is obtained to be no, the default broadcasting speed is the normal voice speed, and the broadcasting voice is the system sound.
In one possible design, the obtaining the voice broadcast request of the user, broadcasting the voice text through voice includes:
inquiring whether the user broadcasts the voice text or not through a voice interaction system, and prompting whether the user replies yes or no through voice;
when the voice replied by the user is obtained to be yes, obtaining voice parameters preset in a voice interaction system, and broadcasting the voice text through the voice interaction system according to the voice parameters;
and when the voice replied by the user is obtained to be negative, prompting the user to temporarily not broadcast the voice, and rescanning the text if necessary.
Based on the same technical conception, the application also provides a text-to-speech device based on speech synthesis, which comprises:
the character scanning module is used for receiving a scanning request of a user, calling a character scanning system and scanning characters to be identified selected by the user into electronic texts;
the text conversion module is arranged as a text conversion module and is used for reading punctuation marks in the electronic text through a regular expression, defining the text between two adjacent punctuation marks as a single independent sentence, and dividing the electronic text into a plurality of independent sentences; determining keywords in a plurality of independent sentences respectively through a TextRank algorithm, and adding an empty character between the keywords and other words by utilizing a segmentation character string split; converting words in the independent sentences into speech waveforms through a preset word-to-speech system, and converting empty characters in the independent sentences into silence to form a single voice file packet; and performing voice synthesis on each voice file packet to obtain a voice text corresponding to the electronic text.
In one possible design, the text conversion module is further configured to:
prompting the user to submit a voice parameter setting request through a voice interaction system, and automatically setting voice parameters according to the voice parameter setting request submitted by the user.
In one possible design, the text conversion module is further configured to:
after converting the electronic text into a voice text, inquiring whether the user sets voice parameters or not through a preset voice interaction system, and prompting the user to reply yes or no through voice, wherein the voice parameters comprise broadcasting speed and broadcasting voice;
when the voice replied by the user is obtained to be yes, prompting the user to select a broadcasting grade, wherein the broadcasting grade comprises 0.8 times of slow speed, normal speed, 1.5 times of fast speed, 2 times of fast speed and 3 times of fast speed, obtaining the broadcasting grade selected by the user, and automatically setting the broadcasting speed according to the broadcasting grade;
prompting the user to select special voices, wherein the special voices comprise system original voices, trending voices, star voices and sound effect voices, the special voices selected by the user are obtained, and the broadcasting voices are automatically set according to the special voices selected by the user;
and when the voice replied by the user is obtained to be no, the default broadcasting speed is the normal voice speed, and the broadcasting voice is the system sound.
Based on the same conception, the application proposes a computer device comprising a memory and a processor, said memory having stored therein computer readable instructions which, when executed by one or more of said processors, cause the one or more processors to perform the steps of said one speech synthesis based text-to-speech method.
Based on the same conception, the present application proposes a storage medium storing computer readable instructions that, when executed by one or more processors, cause the one or more processors to perform the steps of the above-described text-to-speech method based on speech synthesis.
According to the text-to-speech method based on speech synthesis and the related equipment, the text scanning system is called by receiving the scanning request of the user, and the text to be recognized selected by the user is scanned into the electronic text; converting the electronic text into a voice text through a text-to-voice system, and prompting the user that the conversion is successful through voice reminding; and acquiring a voice broadcasting request of the user, and broadcasting the voice text through voice. The application converts the characters into the voices through the character conversion and the voice synthesis, provides great help for the reading of the blind, children and the old, and simultaneously accurately identifies the characters based on the TextRank algorithm, thereby further improving the precision and the height of the voice synthesis.
Drawings
FIG. 1 is a flow chart of a text-to-speech method based on speech synthesis according to an embodiment of the present application;
FIG. 2 is a diagram of a word network in accordance with one embodiment of the present application;
FIG. 3 is a flowchart of the voice broadcast of step S3 according to an embodiment of the present application;
fig. 4 is a schematic diagram of a text-to-speech apparatus based on speech synthesis according to an embodiment of the application.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
Fig. 1 is a flowchart of a text-to-speech method based on speech synthesis according to an embodiment of the present application, as shown in fig. 1, a text-to-speech method based on speech synthesis, comprising the steps of:
step S1, character scanning: and receiving a scanning request of a user, calling a character scanning system, and scanning characters to be identified selected by the user into electronic texts.
The step converts paper characters into computer readable electronic texts through character scanning, effectively solves the problem of converting paper texts and computer texts, and improves the efficiency of converting characters into voices.
In one embodiment, step S1 includes the following specific steps:
step S101, scanning imaging: and receiving a scanning request of a user, calling a preset character scanning system, and scanning characters to be identified selected by the user into images through the character scanning system.
The method comprises the steps of rapidly reading various paper texts through a preset text scanning system, taking the paper texts as a basis of text conversion, rapidly realizing the switching between the paper texts and the computer texts by scanning the paper texts into images, and particularly selecting the scanning omnipotent king, the trace cloud notes and the WPS text for recognition.
For example: in the process of reading the library, the user A finds a poem set with great literature value, but the poem set is not published any more at present, the user A cannot purchase the poem set, at the moment, the user A can shoot interested contents in the poem set through the shooting function of the mobile phone, and the contents are stored in the form of pictures, so that a great deal of transcription time and energy can be saved.
Step S102, text conversion: recognizing characters in the image through image-to-character software preset in the character scanning system, and forming the characters into corresponding electronic texts.
The scanned image is quickly subjected to character recognition, characters in the image are converted into a text format by an optical character recognition OCR technology in image-to-character software, the text is further edited and processed, the recognition rate is high, the speed is high, and a good foundation is established for converting the text into the voice.
For example: the user A wants to convert the shot picture into a text collection as a local electronic document, at the moment, the characters in the stored picture can be identified through OCR technology, and the identified characters can be copied and used in various computer texts.
According to the embodiment, the character recognition technology is refined, the advanced OCR technology is adopted, paper characters are quickly absorbed and processed, the electronic text is obtained, the error rate of manual input is reduced, the input labor cost is reduced, management of the electronic text can be facilitated, and quick transmission and storage of the electronic text are realized.
Step S2, independent sentence segmentation: and reading punctuation marks in the electronic text through a regular expression, defining texts between two adjacent punctuation marks as single independent sentences, and segmenting the electronic text into a plurality of independent sentences.
The regular expression is a logic formula for operating the character string, and a "regular character string" is formed by preset specific characters and combinations of the specific characters, and is used for expressing a filtering logic for the character string.
For example: in step S2, it is assumed that the poem set identified by the user a is "the poem with a caveat" by the cang jia, and the user a wants to share the graceful text with the blind B, and needs to convert the identified electronic text into speech, where a section of electronic text is "first best not seen", so that the electronic text cannot be loved. The second is preferably unknown, and thus can be ignored. The third is preferably not accompanied, and thus may not be sufficient. "the" and "in the electronic text can be identified by regular expressions. The piece of electronic text is split into six separate sentences that resemble a "first preferably invisible" style.
Step S3, keyword determination: and respectively determining keywords in a plurality of independent sentences through a TextRank algorithm, and adding a blank character between the keywords and other words by utilizing a segmentation character string split.
According to the method, the keywords are identified, and blank characters are added between the keywords and other words, so that short-time pauses in the voice broadcasting process are formed, the keywords are highlighted on one hand, the voice of the voice broadcasting can be adjusted on the other hand, and the voice broadcasting is closer to the voice.
In one embodiment, step S3 may include the steps of: dividing words of the independent sentences and marking word characteristics, reserving nouns, verbs, adjectives and adverbs after the word characteristics are marked, constructing a word network of the independent sentences, wherein the word network is a relation network formed by interaction of words and each word, and taking each word in the independent sentences as a node in the word network; and iteratively calculating a weight ordering result of each word by using the TextRank algorithm, wherein the TextRank iterative calculation formula is as follows:
wherein WS (V) i ) Is node V i The weight in the independent sentence, d is the damping coefficient, is a preset constant, w ji Is node V i And node V j Weight between, out (V j ) Is node V j Set of pointed nodes, node V k Is node V j Pointed node, w jk Is node V k And node V j Weight between, WS (V) j ) Is node V j Weights in independent sentences; dividing the weight of all nodes by the maximum weight in the set to obtain all nodesAnd defining words corresponding to nodes with the normalized weight larger than a preset weight threshold as keywords.
The damping coefficient in this step is generally preset to 0.85, and the weight ordering result of each word is iterated by using TextRank algorithm by constructing word network of independent sentences, for example: fig. 2 is a schematic diagram of the word network of the "first best not seen" in step S3, and after the keywords in the independent sentences are identified based on the relationship between the words in the word network, blank characters are added in the independent sentences to create the sense of human voice.
Step S4, voice file package formation: and converting the words in the independent sentences into speech waveforms through a preset word-to-speech system, and converting the empty characters in the independent sentences into silence to form a single voice file packet.
In the step, independent sentences are converted into speech waveforms with silence, so that a voice file packet for generating voice texts is formed, and a foundation is provided for voice generation.
For example: in step S3, the "first", "best", "invisible" forms speech waveforms, and the keywords are "best", so that a blank character is generated between the "first" and "best", "best" and "invisible", at this time, the two blank characters are converted into silence, the whole independent sentence forms speech waveforms with two short pauses, the blind person B hearing process has a rhythmic feeling, and the voice broadcasting process is more natural.
Step 5, generating a voice text: and performing voice synthesis on each voice file packet to obtain a voice text corresponding to the electronic text.
The step adopts voice synthesis, forms a complete voice text for voice broadcasting with a certain section of electronic text, and avoids the condition of blocking due to too short voice in the voice broadcasting process.
For example: the "first" in step S2 is preferably invisible, so that it is not loved. The second is preferably unknown, and thus can be ignored. The third is preferably not accompanied, and thus may not be sufficient. "will form phonetic text comprising six independent sentences.
According to the embodiment, the process of converting the electronic text into the voice text is expanded, a detailed technical support is provided, meanwhile, the weight of each word is calculated iteratively by using a TextRank algorithm, the calculation process is more accurate, and meanwhile, the whole text is quickly converted into voice.
In one embodiment, after step S5, the method further comprises:
prompting the user to submit a voice parameter setting request through a voice interaction system, and automatically setting voice parameters according to the voice parameter setting request submitted by the user.
The step meets the requirements of different users on the voice through personalized demand setting, increases the personalized setting function of broadcasting the speed of the voice and the characteristic voice, and improves the user experience.
Optionally, after the electronic text is converted into the voice text, inquiring whether the user sets voice parameters through a preset voice interaction system, and prompting whether the user replies yes or no through voice, wherein the voice parameters comprise broadcasting speed and broadcasting voice; when the voice replied by the user is obtained to be yes, prompting the user to select a broadcasting grade, wherein the broadcasting grade comprises 0.8 times of slow speed, normal speed, 1.5 times of fast speed, 2 times of fast speed and 3 times of fast speed, obtaining the broadcasting grade selected by the user, and automatically setting the broadcasting speed according to the broadcasting grade; prompting the user to select special voices, wherein the special voices comprise system original voices, trending voices, star voices and sound effect voices, the special voices selected by the user are obtained, and the broadcasting voices are automatically set according to the special voices selected by the user; and when the voice replied by the user is obtained to be no, the default broadcasting speed is the normal voice speed, and the broadcasting voice is the system sound.
For example: the blind person B can set the broadcasting speed and characteristic voice of the voice text of the 'poetry with a caveat', and can select the voice of the news simulcast host at 0.8 times of the slow speed on the premise that the blind person B is a faithful listener of the news simulcast and also likes the broadcasting effect of a little slower.
According to the voice broadcasting method and device, the voice parameter setting is added to meet the requirements of different users, on one hand, the requirements of different crowds on broadcasting speed in different environments are met through the setting of broadcasting speed, on the other hand, characteristic voices are added, the voice broadcasting method and device cater to the preference of various crowds in different ages, different professions and different backgrounds on the voice, user experience is improved, customer satisfaction is improved, market share is enlarged, and market resources are preempted.
In one embodiment, after the step S5, the method further includes a step S6 of voice broadcasting, as shown in fig. 3, including the following specific steps:
step S601, a voice broadcast request: inquiring whether the user broadcasts the voice text or not through a voice interaction system, and prompting whether the user replies yes or no through voice.
The voice inquiry function is additionally arranged through the voice interaction system, so that the requirements of the blind, children and old people can be met, and the voice inquiry method is particularly helpful for people with difficulty in reading and has great significance.
Step S602, voice broadcast feedback: and when the voice replied by the user is obtained to be yes, obtaining voice parameters preset in a voice interaction system, and broadcasting the voice text through the voice interaction system according to the voice parameters.
In the step, voice broadcasting is carried out through voice reply, and when other input commands can not be carried out by a user, the voice broadcasting can be directly carried out through voice, so that a convenient operation mode is provided for the user, and the broadcasting efficiency is improved.
Step S603, voice pause feedback: and when the voice replied by the user is obtained to be negative, prompting the user to temporarily not broadcast the voice, and rescanning the text if necessary.
In order to prevent misoperation of a user, the voice broadcasting can be directly paused, character recognition can be carried out again under the condition of pausing by mouth error, and the response is more intelligent and convenient.
The embodiment distinguishes different broadcasting demands of users, meets the requirements under various broadcasting environments, improves user experience, and makes the text-to-voice process more humanized.
The embodiment of the application combines the current mature technology based on the text scanning and voice synthesis technology, realizes successful conversion from paper text to voice text, provides great help for the reading of the blind, children and old people, adopts the TextRank algorithm to accurately identify keywords, realizes high accuracy of converting electronic text into voice text, and promotes the voice synthesis technology to a new height.
In one embodiment, a text-to-speech device based on speech synthesis is provided, as shown in fig. 4, comprising:
the character scanning module is used for receiving a scanning request of a user, calling a character scanning system and scanning characters to be identified selected by the user into electronic texts;
the text conversion module is used for reading punctuation marks in the electronic text through a regular expression, defining texts between two adjacent punctuation marks as single independent sentences, and dividing the electronic text into a plurality of independent sentences; determining keywords in a plurality of independent sentences respectively through a TextRank algorithm, and adding an empty character between the keywords and other words by utilizing a segmentation character string split; converting words in the independent sentences into speech waveforms through a preset word-to-speech system, and converting empty characters in the independent sentences into silence to form a single voice file packet; and performing voice synthesis on each voice file packet to obtain a voice text corresponding to the electronic text.
In one embodiment, the text conversion module is further configured to:
prompting the user to submit a voice parameter setting request through a voice interaction system, and automatically setting voice parameters according to the voice parameter setting request submitted by the user.
In one embodiment, the text conversion module is further configured to:
after converting the electronic text into a voice text, inquiring whether the user sets voice parameters or not through a preset voice interaction system, and prompting the user to reply yes or no through voice, wherein the voice parameters comprise broadcasting speed and broadcasting voice;
when the voice replied by the user is obtained to be yes, prompting the user to select a broadcasting grade, wherein the broadcasting grade comprises 0.8 times of slow speed, normal speed, 1.5 times of fast speed, 2 times of fast speed and 3 times of fast speed, obtaining the broadcasting grade selected by the user, and automatically setting the broadcasting speed according to the broadcasting grade;
prompting the user to select special voices, wherein the special voices comprise system original voices, trending voices, star voices and sound effect voices, the special voices selected by the user are obtained, and the broadcasting voices are automatically set according to the special voices selected by the user;
and when the voice replied by the user is obtained to be no, the default broadcasting speed is the normal voice speed, and the broadcasting voice is the system sound.
In one embodiment, a computer device is provided, including a memory and a processor, where the memory stores computer readable instructions that, when executed by the processor, cause the processor to perform steps in a text-to-speech method based on speech synthesis in the above embodiments.
In one embodiment, a storage medium storing computer readable instructions that, when executed by one or more processors, cause the one or more processors to perform the steps of a speech synthesis based text-to-speech method in the above embodiments is presented.
Those skilled in the art will appreciate that implementing all or part of the above-described methods in accordance with the embodiments may be accomplished by way of a computer program stored in a computer-readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. The storage medium may be a nonvolatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a random access Memory (Random Access Memory, RAM).
The technical features of the above-described embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above-described embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.
Claims (9)
1. A text-to-speech method based on speech synthesis, comprising:
receiving a scanning request of a user, calling a character scanning system, and scanning characters to be identified selected by the user into electronic texts;
reading punctuation marks in the electronic text through a regular expression, defining texts between two adjacent punctuation marks as single independent sentences, and segmenting the electronic text into a plurality of independent sentences;
determining keywords in a plurality of independent sentences respectively through a TextRank algorithm, and adding an empty character between the keywords and other words by utilizing a segmentation character string split;
converting words in the independent sentences into speech waveforms through a preset word-to-speech system, and converting empty characters in the independent sentences into silence to form a single voice file packet;
performing voice synthesis on each voice file packet to obtain a voice text corresponding to the electronic text;
the determining the keywords in the independent sentences through the TextRank algorithm comprises the following steps:
dividing words of the independent sentences and marking word characteristics, reserving nouns, verbs, adjectives and adverbs after the word characteristics are marked, constructing a word network of the independent sentences, wherein the word network is a relation network formed by interaction of words and each word, and taking each word in the independent sentences as a node in the word network;
and iteratively calculating a weight ordering result of each word by using the TextRank algorithm, wherein the TextRank iterative calculation formula is as follows:
wherein WS (V) i ) Is node V i The weight in the independent sentence, d is the damping coefficient, is a preset constant, w ji Is node V i And node V j Weight between, out (V j ) Is node V j Set of pointed nodes, node V k Is node V j Pointed node, w jk Is node V k And node V j Weight between, WS (V) j ) Is node V j Weights in independent sentences;
dividing the weight of all nodes by the maximum weight in the set to obtain the normalized weight of all nodes, and defining the word corresponding to the node with the normalized weight larger than the preset weight threshold as a keyword.
2. The method for converting text to speech based on speech synthesis according to claim 1, wherein after speech synthesis is performed on each speech file packet to obtain a speech text corresponding to the electronic text, the method further comprises:
prompting the user to submit a voice parameter setting request through a voice interaction system, and automatically setting voice parameters according to the voice parameter setting request submitted by the user.
3. The method for converting text to speech based on speech synthesis according to claim 2, wherein said prompting said user to submit a speech parameter setting request via a speech interaction system, automatically setting speech parameters in accordance with said user-submitted speech parameter setting request, comprises:
after converting the electronic text into a voice text, inquiring whether the user sets voice parameters through a preset voice interaction system, and prompting the user to reply yes or no through voice, wherein the voice parameters comprise broadcasting speed and broadcasting voice;
when the voice replied by the user is obtained to be yes, prompting the user to select a broadcasting grade, wherein the broadcasting grade comprises 0.8 times of slow speed, normal speed, 1.5 times of fast speed, 2 times of fast speed and 3 times of fast speed, obtaining the broadcasting grade selected by the user, and automatically setting the broadcasting speed according to the broadcasting grade;
prompting the user to select special voices, wherein the special voices comprise system original voices, trending voices, star voices and sound effect voices, the special voices selected by the user are obtained, and the broadcasting voices are automatically set according to the special voices selected by the user;
and when the voice replied by the user is obtained to be negative, the default broadcasting speed is the normal voice speed, and the broadcasting voice is the system sound.
4. The method for converting text to speech based on speech synthesis according to claim 1, wherein after speech synthesis is performed on each speech file packet to obtain a speech text corresponding to the electronic text, the method further comprises a step of speech broadcasting, and specifically comprises:
inquiring whether the user broadcasts the voice text or not through a voice interaction system, and prompting whether the user replies yes or no through voice;
when the voice replied by the user is obtained to be yes, obtaining voice parameters preset in a voice interaction system, and broadcasting the voice text through the voice interaction system according to the voice parameters;
and when the voice replied by the user is obtained to be negative, prompting the user to temporarily not broadcast the voice, and rescanning the text if necessary.
5. A text-to-speech apparatus based on speech synthesis, comprising:
the character scanning module is used for receiving a scanning request of a user, calling a character scanning system and scanning characters to be identified selected by the user into electronic texts;
the text conversion module is used for reading punctuation marks in the electronic text through a regular expression, defining texts between two adjacent punctuation marks as single independent sentences, and dividing the electronic text into a plurality of independent sentences; determining keywords in a plurality of independent sentences respectively through a TextRank algorithm, and adding an empty character between the keywords and other words by utilizing a segmentation character string split; converting words in the independent sentences into speech waveforms through a preset word-to-speech system, and converting empty characters in the independent sentences into silence to form a single voice file packet; performing voice synthesis on each voice file packet to obtain a voice text corresponding to the electronic text;
the word conversion module is further configured to divide words and make word-part labels on the independent sentences, reserve nouns, verbs, adjectives and adverbs after the word-part labels, construct a word network of the independent sentences, wherein the word network is a relation network formed by interaction of words and words, and each word in the independent sentences is used as a node in the word network; and iteratively calculating a weight ordering result of each word by using the TextRank algorithm, wherein the TextRank iterative calculation formula is as follows:wherein WS (V) i ) Is node V i The weight in the independent sentence, d is the damping coefficient, is a preset constant, w ji Is node V i And node V j Weight between, out (V j ) Is node V j Set of pointed nodes, node V k Is node V j Pointed node, w jk Is node V k And node V j Weight between, WS (V) j ) Is node V j Weights in independent sentences; dividing the weight of all nodes by the maximum weight in the set to obtain normalization of all nodesAnd the weight value is defined as a keyword by using words corresponding to the nodes with the normalized weight value larger than a preset weight value threshold.
6. The speech synthesis-based text-to-speech apparatus of claim 5, wherein the text conversion module is further configured to:
prompting the user to submit a voice parameter setting request through a voice interaction system, and automatically setting voice parameters according to the voice parameter setting request submitted by the user.
7. The speech synthesis-based text-to-speech apparatus of claim 5, wherein the text conversion module is further configured to:
after converting the electronic text into a voice text, inquiring whether the user sets voice parameters through a preset voice interaction system, and prompting the user to reply yes or no through voice, wherein the voice parameters comprise broadcasting speed and broadcasting voice;
when the voice replied by the user is obtained to be yes, prompting the user to select a broadcasting grade, wherein the broadcasting grade comprises 0.8 times of slow speed, normal speed, 1.5 times of fast speed, 2 times of fast speed and 3 times of fast speed, obtaining the broadcasting grade selected by the user, and automatically setting the broadcasting speed according to the broadcasting grade;
prompting the user to select special voices, wherein the special voices comprise system original voices, trending voices, star voices and sound effect voices, the special voices selected by the user are obtained, and the broadcasting voices are automatically set according to the special voices selected by the user;
and when the voice replied by the user is obtained to be negative, the default broadcasting speed is the normal voice speed, and the broadcasting voice is the system sound.
8. A computer device comprising a memory and a processor, the memory having stored therein computer readable instructions which, when executed by one or more of the processors, cause the one or more processors to perform the steps of a speech synthesis based text-to-speech method as claimed in any one of claims 1 to 4.
9. A computer readable storage medium readable and writable by a processor, the storage medium storing computer readable instructions which when executed by one or more processors cause the one or more processors to perform the steps of a speech synthesis based text-to-speech method according to any one of claims 1 to 4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910298456.3A CN110136688B (en) | 2019-04-15 | 2019-04-15 | Text-to-speech method based on speech synthesis and related equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910298456.3A CN110136688B (en) | 2019-04-15 | 2019-04-15 | Text-to-speech method based on speech synthesis and related equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110136688A CN110136688A (en) | 2019-08-16 |
CN110136688B true CN110136688B (en) | 2023-09-29 |
Family
ID=67569915
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910298456.3A Active CN110136688B (en) | 2019-04-15 | 2019-04-15 | Text-to-speech method based on speech synthesis and related equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110136688B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP7467999B2 (en) * | 2020-03-10 | 2024-04-16 | セイコーエプソン株式会社 | Scan system, program, and method for generating scan data for a scan system |
WO2021217433A1 (en) * | 2020-04-28 | 2021-11-04 | 青岛海信传媒网络技术有限公司 | Content-based voice playback method and display device |
CN111916055A (en) * | 2020-06-20 | 2020-11-10 | 中国建设银行股份有限公司 | Speech synthesis method, platform, server and medium for outbound system |
CN111883100B (en) * | 2020-07-22 | 2021-11-09 | 马上消费金融股份有限公司 | Voice conversion method, device and server |
CN115394282A (en) * | 2022-06-01 | 2022-11-25 | 北京网梯科技发展有限公司 | Information interaction method and device, teaching platform, electronic equipment and storage medium |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101196881A (en) * | 2006-12-08 | 2008-06-11 | 富士通株式会社 | Words symbolization processing method and system for number and special symbol string in text |
CN101859309A (en) * | 2009-04-07 | 2010-10-13 | 慧科讯业有限公司 | System and method for identifying repeated text |
CN102486801A (en) * | 2011-09-06 | 2012-06-06 | 上海博路信息技术有限公司 | Method for obtaining publication contents in voice recognition mode |
CN104166462A (en) * | 2013-05-17 | 2014-11-26 | 北京搜狗科技发展有限公司 | Input method and system for characters |
CN105404903A (en) * | 2014-09-15 | 2016-03-16 | 联想(北京)有限公司 | Information processing method and apparatus, and electronic device |
CN107657947A (en) * | 2017-09-20 | 2018-02-02 | 百度在线网络技术(北京)有限公司 | Method of speech processing and its device based on artificial intelligence |
CN107832457A (en) * | 2017-11-24 | 2018-03-23 | 国网山东省电力公司电力科学研究院 | Power transmission and transforming equipment defect dictionary method for building up and system based on TextRank algorithm |
CN108538286A (en) * | 2017-03-02 | 2018-09-14 | 腾讯科技(深圳)有限公司 | A kind of method and computer of speech recognition |
CN108763500A (en) * | 2018-05-30 | 2018-11-06 | 深圳壹账通智能科技有限公司 | Voice-based Web browser method, device, equipment and storage medium |
CN108962217A (en) * | 2018-07-28 | 2018-12-07 | 华为技术有限公司 | Phoneme synthesizing method and relevant device |
-
2019
- 2019-04-15 CN CN201910298456.3A patent/CN110136688B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101196881A (en) * | 2006-12-08 | 2008-06-11 | 富士通株式会社 | Words symbolization processing method and system for number and special symbol string in text |
CN101859309A (en) * | 2009-04-07 | 2010-10-13 | 慧科讯业有限公司 | System and method for identifying repeated text |
CN102486801A (en) * | 2011-09-06 | 2012-06-06 | 上海博路信息技术有限公司 | Method for obtaining publication contents in voice recognition mode |
CN104166462A (en) * | 2013-05-17 | 2014-11-26 | 北京搜狗科技发展有限公司 | Input method and system for characters |
CN105404903A (en) * | 2014-09-15 | 2016-03-16 | 联想(北京)有限公司 | Information processing method and apparatus, and electronic device |
CN108538286A (en) * | 2017-03-02 | 2018-09-14 | 腾讯科技(深圳)有限公司 | A kind of method and computer of speech recognition |
CN107657947A (en) * | 2017-09-20 | 2018-02-02 | 百度在线网络技术(北京)有限公司 | Method of speech processing and its device based on artificial intelligence |
CN107832457A (en) * | 2017-11-24 | 2018-03-23 | 国网山东省电力公司电力科学研究院 | Power transmission and transforming equipment defect dictionary method for building up and system based on TextRank algorithm |
CN108763500A (en) * | 2018-05-30 | 2018-11-06 | 深圳壹账通智能科技有限公司 | Voice-based Web browser method, device, equipment and storage medium |
CN108962217A (en) * | 2018-07-28 | 2018-12-07 | 华为技术有限公司 | Phoneme synthesizing method and relevant device |
Also Published As
Publication number | Publication date |
---|---|
CN110136688A (en) | 2019-08-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110136688B (en) | Text-to-speech method based on speech synthesis and related equipment | |
CN1256714C (en) | Hierarchichal language models | |
CN110430476B (en) | Live broadcast room searching method, system, computer equipment and storage medium | |
KR20120107933A (en) | Speech translation system, control apparatus and control method | |
WO2022262487A1 (en) | Form generation method, apparatus and device, and medium | |
CN112883731B (en) | Content classification method and device | |
TW499671B (en) | Method and system for providing texts for voice requests | |
JP2012181358A (en) | Text display time determination device, text display system, method, and program | |
JP2003255992A (en) | Interactive system and method for controlling the same | |
KR20220130863A (en) | Apparatus for Providing Multimedia Conversion Content Creation Service Based on Voice-Text Conversion Video Resource Matching | |
CN114550718A (en) | Hot word speech recognition method, device, equipment and computer readable storage medium | |
CN112632950A (en) | PPT generation method, device, equipment and computer-readable storage medium | |
CN110738061A (en) | Ancient poetry generation method, device and equipment and storage medium | |
JP2019220098A (en) | Moving image editing server and program | |
CN115273840A (en) | Voice interaction device and voice interaction method | |
CN117558259B (en) | Digital man broadcasting style control method and device | |
CN116913278B (en) | Voice processing method, device, equipment and storage medium | |
CN113887244A (en) | Text processing method and device | |
CN117786095A (en) | Controllable news manuscript generation method, device and medium based on consistency discrimination | |
KR102462685B1 (en) | Apparatus for assisting webtoon production | |
CN113744369A (en) | Animation generation method, system, medium and electronic terminal | |
KR20220130864A (en) | A system for providing a service that produces voice data into multimedia converted contents | |
KR20210145536A (en) | Apparatus for managing minutes and method thereof | |
KR102435242B1 (en) | An apparatus for providing a producing service of transformed multimedia contents using matching of video resources | |
CN113096633B (en) | Information film generation method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |