CN110136688B

CN110136688B - Text-to-speech method based on speech synthesis and related equipment

Info

Publication number: CN110136688B
Application number: CN201910298456.3A
Authority: CN
Inventors: 赵超
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-04-15
Filing date: 2019-04-15
Publication date: 2023-09-29
Anticipated expiration: 2039-04-15
Also published as: CN110136688A

Abstract

The application relates to a voice synthesis technology in the field of voice semantics, in particular to a text-to-voice method based on voice synthesis and related equipment, wherein the method comprises the following steps: receiving a scanning request of a user, and scanning characters to be identified selected by the user into electronic texts; converting the electronic text into a voice text through a text-to-voice system, and prompting the user of successful conversion through voice; and acquiring a voice broadcasting request of the user, and broadcasting a voice text through voice. According to the method, the weight of each word is calculated through iteration by using the TextRank algorithm, so that the whole text can be quickly converted into voice.

Description

Text-to-speech method based on speech synthesis and related equipment

Technical Field

The application relates to the field of voice semantics, in particular to a method for converting words into voice based on voice synthesis and related equipment.

Background

With the arrival of big data age, knowing information around the world at any time and any place becomes an important thing in people's daily life, and big data brings huge information, and meanwhile, the information also changes the work and life of everybody, but everybody can not master the information of own attention at any time and any place, wherein the blind, children and old people occupy a great proportion.

The existing reading auxiliary equipment can read a lot of information, but is concentrated in a part of readings matched with each other, the price is high, the read information is targeted, the acquired content is less, the hearing requirements of the blind, children and old can not be met, at present, the common manuscript still occupies the main market, and most of the blind, children and old mainly have information sources which depend on others and can not be read independently, so that great inconvenience is brought to the improvement of work and life of the blind, children and old.

At present, in the process of converting characters into voice, the problem that the whole text can not be quickly converted into voice exists, so that a great deal of time is required to be consumed when converting text materials such as text books into voice, and the efficiency of converting the text into voice is low.

Disclosure of Invention

Based on this, it is necessary to provide a text-to-speech method based on speech synthesis and related equipment for the problem that the whole text cannot be quickly converted into speech in the text-to-speech process.

A text-to-speech method based on speech synthesis includes:

receiving a scanning request of a user, calling a character scanning system, and scanning characters to be identified selected by the user into electronic texts;

reading punctuation marks in the electronic text through a regular expression, defining texts between two adjacent punctuation marks as single independent sentences, and segmenting the electronic text into a plurality of independent sentences;

determining keywords in a plurality of independent sentences respectively through a TextRank algorithm, and adding an empty character between the keywords and other words by utilizing a segmentation character string split;

converting words in the independent sentences into speech waveforms through a preset word-to-speech system, and converting empty characters in the independent sentences into silence to form a single voice file packet;

and performing voice synthesis on each voice file packet to obtain a voice text corresponding to the electronic text.

In one possible design, the determining, by the TextRank algorithm, keywords in the plurality of independent sentences respectively includes:

dividing words of the independent sentences and marking word characteristics, reserving nouns, verbs, adjectives and adverbs after the word characteristics are marked, constructing a word network of the independent sentences, wherein the word network is a relation network formed by interaction of words and each word, and taking each word in the independent sentences as a node in the word network;

and iteratively calculating a weight ordering result of each word by using the TextRank algorithm, wherein the TextRank iterative calculation formula is as follows:

wherein WS (V) _i ) Is node V _i The weight in the independent sentence, d is the damping coefficient, is a preset constant, w _ji Is node V _i And node V _j Weight between, out (V _j ) Is node V _j Set of pointed nodes, node V _k Is node V _j Pointed node, w _jk Is node V _k And node V _j Weight between, WS (V) _j ) Is node V _j Weights in independent sentences;

dividing the weight of all nodes by the maximum weight in the set to obtain the normalized weight of all nodes, and defining the word corresponding to the node with the normalized weight larger than the preset weight threshold as a keyword.

In one possible design, after the voice synthesis is performed on each voice file package to obtain the voice text corresponding to the electronic text, the method further includes:

prompting the user to submit a voice parameter setting request through a voice interaction system, and automatically setting voice parameters according to the voice parameter setting request submitted by the user.

In one possible design, the prompting, by the voice interaction system, the user to submit a voice parameter setting request, and automatically setting the voice parameter according to the voice parameter setting request submitted by the user includes:

after converting the electronic text into a voice text, inquiring whether the user sets voice parameters or not through a preset voice interaction system, and prompting the user to reply yes or no through voice, wherein the voice parameters comprise broadcasting speed and broadcasting voice;

when the voice replied by the user is obtained to be yes, prompting the user to select a broadcasting grade, wherein the broadcasting grade comprises 0.8 times of slow speed, normal speed, 1.5 times of fast speed, 2 times of fast speed and 3 times of fast speed, obtaining the broadcasting grade selected by the user, and automatically setting the broadcasting speed according to the broadcasting grade;

prompting the user to select special voices, wherein the special voices comprise system original voices, trending voices, star voices and sound effect voices, the special voices selected by the user are obtained, and the broadcasting voices are automatically set according to the special voices selected by the user;

and when the voice replied by the user is obtained to be no, the default broadcasting speed is the normal voice speed, and the broadcasting voice is the system sound.

In one possible design, the obtaining the voice broadcast request of the user, broadcasting the voice text through voice includes:

inquiring whether the user broadcasts the voice text or not through a voice interaction system, and prompting whether the user replies yes or no through voice;

when the voice replied by the user is obtained to be yes, obtaining voice parameters preset in a voice interaction system, and broadcasting the voice text through the voice interaction system according to the voice parameters;

and when the voice replied by the user is obtained to be negative, prompting the user to temporarily not broadcast the voice, and rescanning the text if necessary.

Based on the same technical conception, the application also provides a text-to-speech device based on speech synthesis, which comprises:

the character scanning module is used for receiving a scanning request of a user, calling a character scanning system and scanning characters to be identified selected by the user into electronic texts;

the text conversion module is arranged as a text conversion module and is used for reading punctuation marks in the electronic text through a regular expression, defining the text between two adjacent punctuation marks as a single independent sentence, and dividing the electronic text into a plurality of independent sentences; determining keywords in a plurality of independent sentences respectively through a TextRank algorithm, and adding an empty character between the keywords and other words by utilizing a segmentation character string split; converting words in the independent sentences into speech waveforms through a preset word-to-speech system, and converting empty characters in the independent sentences into silence to form a single voice file packet; and performing voice synthesis on each voice file packet to obtain a voice text corresponding to the electronic text.

In one possible design, the text conversion module is further configured to:

Based on the same conception, the application proposes a computer device comprising a memory and a processor, said memory having stored therein computer readable instructions which, when executed by one or more of said processors, cause the one or more processors to perform the steps of said one speech synthesis based text-to-speech method.

Based on the same conception, the present application proposes a storage medium storing computer readable instructions that, when executed by one or more processors, cause the one or more processors to perform the steps of the above-described text-to-speech method based on speech synthesis.

According to the text-to-speech method based on speech synthesis and the related equipment, the text scanning system is called by receiving the scanning request of the user, and the text to be recognized selected by the user is scanned into the electronic text; converting the electronic text into a voice text through a text-to-voice system, and prompting the user that the conversion is successful through voice reminding; and acquiring a voice broadcasting request of the user, and broadcasting the voice text through voice. The application converts the characters into the voices through the character conversion and the voice synthesis, provides great help for the reading of the blind, children and the old, and simultaneously accurately identifies the characters based on the TextRank algorithm, thereby further improving the precision and the height of the voice synthesis.

Drawings

FIG. 1 is a flow chart of a text-to-speech method based on speech synthesis according to an embodiment of the present application;

FIG. 2 is a diagram of a word network in accordance with one embodiment of the present application;

FIG. 3 is a flowchart of the voice broadcast of step S3 according to an embodiment of the present application;

fig. 4 is a schematic diagram of a text-to-speech apparatus based on speech synthesis according to an embodiment of the application.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

Fig. 1 is a flowchart of a text-to-speech method based on speech synthesis according to an embodiment of the present application, as shown in fig. 1, a text-to-speech method based on speech synthesis, comprising the steps of:

step S1, character scanning: and receiving a scanning request of a user, calling a character scanning system, and scanning characters to be identified selected by the user into electronic texts.

The step converts paper characters into computer readable electronic texts through character scanning, effectively solves the problem of converting paper texts and computer texts, and improves the efficiency of converting characters into voices.

In one embodiment, step S1 includes the following specific steps:

step S101, scanning imaging: and receiving a scanning request of a user, calling a preset character scanning system, and scanning characters to be identified selected by the user into images through the character scanning system.

The method comprises the steps of rapidly reading various paper texts through a preset text scanning system, taking the paper texts as a basis of text conversion, rapidly realizing the switching between the paper texts and the computer texts by scanning the paper texts into images, and particularly selecting the scanning omnipotent king, the trace cloud notes and the WPS text for recognition.

For example: in the process of reading the library, the user A finds a poem set with great literature value, but the poem set is not published any more at present, the user A cannot purchase the poem set, at the moment, the user A can shoot interested contents in the poem set through the shooting function of the mobile phone, and the contents are stored in the form of pictures, so that a great deal of transcription time and energy can be saved.

Step S102, text conversion: recognizing characters in the image through image-to-character software preset in the character scanning system, and forming the characters into corresponding electronic texts.

The scanned image is quickly subjected to character recognition, characters in the image are converted into a text format by an optical character recognition OCR technology in image-to-character software, the text is further edited and processed, the recognition rate is high, the speed is high, and a good foundation is established for converting the text into the voice.

For example: the user A wants to convert the shot picture into a text collection as a local electronic document, at the moment, the characters in the stored picture can be identified through OCR technology, and the identified characters can be copied and used in various computer texts.

According to the embodiment, the character recognition technology is refined, the advanced OCR technology is adopted, paper characters are quickly absorbed and processed, the electronic text is obtained, the error rate of manual input is reduced, the input labor cost is reduced, management of the electronic text can be facilitated, and quick transmission and storage of the electronic text are realized.

Step S2, independent sentence segmentation: and reading punctuation marks in the electronic text through a regular expression, defining texts between two adjacent punctuation marks as single independent sentences, and segmenting the electronic text into a plurality of independent sentences.

The regular expression is a logic formula for operating the character string, and a "regular character string" is formed by preset specific characters and combinations of the specific characters, and is used for expressing a filtering logic for the character string.

For example: in step S2, it is assumed that the poem set identified by the user a is "the poem with a caveat" by the cang jia, and the user a wants to share the graceful text with the blind B, and needs to convert the identified electronic text into speech, where a section of electronic text is "first best not seen", so that the electronic text cannot be loved. The second is preferably unknown, and thus can be ignored. The third is preferably not accompanied, and thus may not be sufficient. "the" and "in the electronic text can be identified by regular expressions. The piece of electronic text is split into six separate sentences that resemble a "first preferably invisible" style.

Step S3, keyword determination: and respectively determining keywords in a plurality of independent sentences through a TextRank algorithm, and adding a blank character between the keywords and other words by utilizing a segmentation character string split.

According to the method, the keywords are identified, and blank characters are added between the keywords and other words, so that short-time pauses in the voice broadcasting process are formed, the keywords are highlighted on one hand, the voice of the voice broadcasting can be adjusted on the other hand, and the voice broadcasting is closer to the voice.

In one embodiment, step S3 may include the steps of: dividing words of the independent sentences and marking word characteristics, reserving nouns, verbs, adjectives and adverbs after the word characteristics are marked, constructing a word network of the independent sentences, wherein the word network is a relation network formed by interaction of words and each word, and taking each word in the independent sentences as a node in the word network; and iteratively calculating a weight ordering result of each word by using the TextRank algorithm, wherein the TextRank iterative calculation formula is as follows:

wherein WS (V) _i ) Is node V _i The weight in the independent sentence, d is the damping coefficient, is a preset constant, w _ji Is node V _i And node V _j Weight between, out (V _j ) Is node V _j Set of pointed nodes, node V _k Is node V _j Pointed node, w _jk Is node V _k And node V _j Weight between, WS (V) _j ) Is node V _j Weights in independent sentences; dividing the weight of all nodes by the maximum weight in the set to obtain all nodesAnd defining words corresponding to nodes with the normalized weight larger than a preset weight threshold as keywords.

The damping coefficient in this step is generally preset to 0.85, and the weight ordering result of each word is iterated by using TextRank algorithm by constructing word network of independent sentences, for example: fig. 2 is a schematic diagram of the word network of the "first best not seen" in step S3, and after the keywords in the independent sentences are identified based on the relationship between the words in the word network, blank characters are added in the independent sentences to create the sense of human voice.

Step S4, voice file package formation: and converting the words in the independent sentences into speech waveforms through a preset word-to-speech system, and converting the empty characters in the independent sentences into silence to form a single voice file packet.

In the step, independent sentences are converted into speech waveforms with silence, so that a voice file packet for generating voice texts is formed, and a foundation is provided for voice generation.

For example: in step S3, the "first", "best", "invisible" forms speech waveforms, and the keywords are "best", so that a blank character is generated between the "first" and "best", "best" and "invisible", at this time, the two blank characters are converted into silence, the whole independent sentence forms speech waveforms with two short pauses, the blind person B hearing process has a rhythmic feeling, and the voice broadcasting process is more natural.

Step 5, generating a voice text: and performing voice synthesis on each voice file packet to obtain a voice text corresponding to the electronic text.

The step adopts voice synthesis, forms a complete voice text for voice broadcasting with a certain section of electronic text, and avoids the condition of blocking due to too short voice in the voice broadcasting process.

For example: the "first" in step S2 is preferably invisible, so that it is not loved. The second is preferably unknown, and thus can be ignored. The third is preferably not accompanied, and thus may not be sufficient. "will form phonetic text comprising six independent sentences.

According to the embodiment, the process of converting the electronic text into the voice text is expanded, a detailed technical support is provided, meanwhile, the weight of each word is calculated iteratively by using a TextRank algorithm, the calculation process is more accurate, and meanwhile, the whole text is quickly converted into voice.

In one embodiment, after step S5, the method further comprises:

The step meets the requirements of different users on the voice through personalized demand setting, increases the personalized setting function of broadcasting the speed of the voice and the characteristic voice, and improves the user experience.

Optionally, after the electronic text is converted into the voice text, inquiring whether the user sets voice parameters through a preset voice interaction system, and prompting whether the user replies yes or no through voice, wherein the voice parameters comprise broadcasting speed and broadcasting voice; when the voice replied by the user is obtained to be yes, prompting the user to select a broadcasting grade, wherein the broadcasting grade comprises 0.8 times of slow speed, normal speed, 1.5 times of fast speed, 2 times of fast speed and 3 times of fast speed, obtaining the broadcasting grade selected by the user, and automatically setting the broadcasting speed according to the broadcasting grade; prompting the user to select special voices, wherein the special voices comprise system original voices, trending voices, star voices and sound effect voices, the special voices selected by the user are obtained, and the broadcasting voices are automatically set according to the special voices selected by the user; and when the voice replied by the user is obtained to be no, the default broadcasting speed is the normal voice speed, and the broadcasting voice is the system sound.

For example: the blind person B can set the broadcasting speed and characteristic voice of the voice text of the 'poetry with a caveat', and can select the voice of the news simulcast host at 0.8 times of the slow speed on the premise that the blind person B is a faithful listener of the news simulcast and also likes the broadcasting effect of a little slower.

According to the voice broadcasting method and device, the voice parameter setting is added to meet the requirements of different users, on one hand, the requirements of different crowds on broadcasting speed in different environments are met through the setting of broadcasting speed, on the other hand, characteristic voices are added, the voice broadcasting method and device cater to the preference of various crowds in different ages, different professions and different backgrounds on the voice, user experience is improved, customer satisfaction is improved, market share is enlarged, and market resources are preempted.

In one embodiment, after the step S5, the method further includes a step S6 of voice broadcasting, as shown in fig. 3, including the following specific steps:

step S601, a voice broadcast request: inquiring whether the user broadcasts the voice text or not through a voice interaction system, and prompting whether the user replies yes or no through voice.

The voice inquiry function is additionally arranged through the voice interaction system, so that the requirements of the blind, children and old people can be met, and the voice inquiry method is particularly helpful for people with difficulty in reading and has great significance.

Step S602, voice broadcast feedback: and when the voice replied by the user is obtained to be yes, obtaining voice parameters preset in a voice interaction system, and broadcasting the voice text through the voice interaction system according to the voice parameters.

In the step, voice broadcasting is carried out through voice reply, and when other input commands can not be carried out by a user, the voice broadcasting can be directly carried out through voice, so that a convenient operation mode is provided for the user, and the broadcasting efficiency is improved.

Step S603, voice pause feedback: and when the voice replied by the user is obtained to be negative, prompting the user to temporarily not broadcast the voice, and rescanning the text if necessary.

In order to prevent misoperation of a user, the voice broadcasting can be directly paused, character recognition can be carried out again under the condition of pausing by mouth error, and the response is more intelligent and convenient.

The embodiment distinguishes different broadcasting demands of users, meets the requirements under various broadcasting environments, improves user experience, and makes the text-to-voice process more humanized.

The embodiment of the application combines the current mature technology based on the text scanning and voice synthesis technology, realizes successful conversion from paper text to voice text, provides great help for the reading of the blind, children and old people, adopts the TextRank algorithm to accurately identify keywords, realizes high accuracy of converting electronic text into voice text, and promotes the voice synthesis technology to a new height.

In one embodiment, a text-to-speech device based on speech synthesis is provided, as shown in fig. 4, comprising:

the text conversion module is used for reading punctuation marks in the electronic text through a regular expression, defining texts between two adjacent punctuation marks as single independent sentences, and dividing the electronic text into a plurality of independent sentences; determining keywords in a plurality of independent sentences respectively through a TextRank algorithm, and adding an empty character between the keywords and other words by utilizing a segmentation character string split; converting words in the independent sentences into speech waveforms through a preset word-to-speech system, and converting empty characters in the independent sentences into silence to form a single voice file packet; and performing voice synthesis on each voice file packet to obtain a voice text corresponding to the electronic text.

In one embodiment, the text conversion module is further configured to:

In one embodiment, a computer device is provided, including a memory and a processor, where the memory stores computer readable instructions that, when executed by the processor, cause the processor to perform steps in a text-to-speech method based on speech synthesis in the above embodiments.

In one embodiment, a storage medium storing computer readable instructions that, when executed by one or more processors, cause the one or more processors to perform the steps of a speech synthesis based text-to-speech method in the above embodiments is presented.

Those skilled in the art will appreciate that implementing all or part of the above-described methods in accordance with the embodiments may be accomplished by way of a computer program stored in a computer-readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. The storage medium may be a nonvolatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a random access Memory (Random Access Memory, RAM).

The technical features of the above-described embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above-described embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims

1. A text-to-speech method based on speech synthesis, comprising:

performing voice synthesis on each voice file packet to obtain a voice text corresponding to the electronic text;

the determining the keywords in the independent sentences through the TextRank algorithm comprises the following steps:

2. The method for converting text to speech based on speech synthesis according to claim 1, wherein after speech synthesis is performed on each speech file packet to obtain a speech text corresponding to the electronic text, the method further comprises:

3. The method for converting text to speech based on speech synthesis according to claim 2, wherein said prompting said user to submit a speech parameter setting request via a speech interaction system, automatically setting speech parameters in accordance with said user-submitted speech parameter setting request, comprises:

after converting the electronic text into a voice text, inquiring whether the user sets voice parameters through a preset voice interaction system, and prompting the user to reply yes or no through voice, wherein the voice parameters comprise broadcasting speed and broadcasting voice;

and when the voice replied by the user is obtained to be negative, the default broadcasting speed is the normal voice speed, and the broadcasting voice is the system sound.

4. The method for converting text to speech based on speech synthesis according to claim 1, wherein after speech synthesis is performed on each speech file packet to obtain a speech text corresponding to the electronic text, the method further comprises a step of speech broadcasting, and specifically comprises:

5. A text-to-speech apparatus based on speech synthesis, comprising:

the text conversion module is used for reading punctuation marks in the electronic text through a regular expression, defining texts between two adjacent punctuation marks as single independent sentences, and dividing the electronic text into a plurality of independent sentences; determining keywords in a plurality of independent sentences respectively through a TextRank algorithm, and adding an empty character between the keywords and other words by utilizing a segmentation character string split; converting words in the independent sentences into speech waveforms through a preset word-to-speech system, and converting empty characters in the independent sentences into silence to form a single voice file packet; performing voice synthesis on each voice file packet to obtain a voice text corresponding to the electronic text;

the word conversion module is further configured to divide words and make word-part labels on the independent sentences, reserve nouns, verbs, adjectives and adverbs after the word-part labels, construct a word network of the independent sentences, wherein the word network is a relation network formed by interaction of words and words, and each word in the independent sentences is used as a node in the word network; and iteratively calculating a weight ordering result of each word by using the TextRank algorithm, wherein the TextRank iterative calculation formula is as follows:wherein WS (V) _i ) Is node V _i The weight in the independent sentence, d is the damping coefficient, is a preset constant, w _ji Is node V _i And node V _j Weight between, out (V _j ) Is node V _j Set of pointed nodes, node V _k Is node V _j Pointed node, w _jk Is node V _k And node V _j Weight between, WS (V) _j ) Is node V _j Weights in independent sentences; dividing the weight of all nodes by the maximum weight in the set to obtain normalization of all nodesAnd the weight value is defined as a keyword by using words corresponding to the nodes with the normalized weight value larger than a preset weight value threshold.

6. The speech synthesis-based text-to-speech apparatus of claim 5, wherein the text conversion module is further configured to:

7. The speech synthesis-based text-to-speech apparatus of claim 5, wherein the text conversion module is further configured to:

8. A computer device comprising a memory and a processor, the memory having stored therein computer readable instructions which, when executed by one or more of the processors, cause the one or more processors to perform the steps of a speech synthesis based text-to-speech method as claimed in any one of claims 1 to 4.

9. A computer readable storage medium readable and writable by a processor, the storage medium storing computer readable instructions which when executed by one or more processors cause the one or more processors to perform the steps of a speech synthesis based text-to-speech method according to any one of claims 1 to 4.