CN111199724A

CN111199724A - Information processing method and device and computer readable storage medium

Info

Publication number: CN111199724A
Application number: CN201911421920.XA
Authority: CN
Inventors: 邵皓; 雷欣; 李志飞
Original assignee: Mobvoi Information Technology Co Ltd
Current assignee: Mobvoi Information Technology Co Ltd; Chumen Wenwen Information Technology Co Ltd
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2020-05-26

Abstract

The invention discloses an information processing method, equipment and a computer readable storage medium, wherein the method comprises the following steps: obtaining a specified pronunciation characteristic corresponding to a specified text, wherein the specified pronunciation characteristic is used for generating specified audio; modifying the specified pronunciation characteristics according to modification rules to obtain target pronunciation characteristics; performing voice synthesis processing on the specified text based on the target pronunciation characteristics to generate target audio corresponding to the specified text; by applying the method provided by the invention, the voice synthesis error of the equipment can be corrected rapidly in a targeted manner.

Description

Information processing method and device and computer readable storage medium

Technical Field

The present invention relates to the field of speech synthesis technologies, and in particular, to an information processing method, an information processing apparatus, and a computer-readable storage medium.

Background

Speech synthesis, i.e. "text to speech" techniques. It applies linguistics and psychology simultaneously, and under the support of built-in chip, through the design of neural network, converts characters into natural voice stream intelligently. However, due to the complexity of the language itself, different pronunciations and meanings of the same word occur in different language environments, and prosodic pauses also affect the meaning of the sentence expressed in the article, so that the pronunciations calculated by the device through speech synthesis cannot guarantee that 100% of the pronunciations fit the emotional expression of the sentence in the article.

Disclosure of Invention

Embodiments of the present invention provide an information processing method, an information processing apparatus, and a computer-readable storage medium, which can modify an audio obtained by speech synthesis.

One aspect of the present invention provides an information processing method, including: obtaining a specified pronunciation characteristic corresponding to a specified text, wherein the specified pronunciation characteristic is used for generating specified audio; modifying the specified pronunciation characteristics according to modification rules to obtain target pronunciation characteristics; and performing voice synthesis processing on the specified text based on the target pronunciation characteristics to generate target audio corresponding to the specified text.

In one embodiment, the modification rule includes at least one of the following types: a first rule for modifying polyphonic characters, a second rule for modifying prosody, a third rule for modifying numeric symbols, and a fourth rule for modifying pauses.

In an implementation manner, the modifying the specified pronunciation characteristics according to the modification rule to obtain the target pronunciation characteristics includes: obtaining a first modification instruction; executing the first modification instruction to determine the type of the modification rule; generating a modified pronunciation feature set corresponding to the specified text based on the type of the modification rule, wherein the modified pronunciation feature set comprises a plurality of modified pronunciation features; obtaining a second modification instruction; and executing the second modification instruction to determine one of the modified pronunciation characteristics as a target pronunciation characteristic.

In one embodiment, before obtaining the specified pronunciation characteristics corresponding to the specified text, the method further comprises: obtaining an initial text, wherein the initial text comprises the specified text; judging whether a word-stroke gesture is collected; when the word-dividing gesture is judged to be collected, a first collection instruction corresponding to the word-dividing gesture is obtained; executing the first acquisition instruction to determine the specified text.

In an embodiment, after determining the specified text, the method further comprises: determining whether the button is triggered; when the button is determined to be triggered, obtaining a playing instruction; executing the playing instruction, and judging whether the specified text corresponds to a target audio; when the specified text is judged to correspond to the target audio, playing the target audio corresponding to the specified text; and when the designated text is judged not to correspond to the target audio, playing the designated audio corresponding to the designated text.

Another aspect of the present invention provides an information processing apparatus, including: the obtaining module is used for obtaining the appointed pronunciation characteristics corresponding to the appointed text, and the appointed pronunciation characteristics are used for generating appointed audio; the modification module is used for modifying the specified pronunciation characteristics according to modification rules to obtain target pronunciation characteristics; and the synthesis module is used for carrying out voice synthesis processing on the specified text based on the target pronunciation characteristics and generating target audio corresponding to the specified text.

In an embodiment, the modification module includes: the obtaining submodule is used for obtaining a first modification instruction; the execution submodule is used for executing the first modification instruction to determine the type of the modification rule; the generation submodule is used for generating a modified pronunciation feature set corresponding to the specified text based on the type of the modification rule, and the modified pronunciation feature set comprises a plurality of modified pronunciation features; the obtaining submodule is also used for obtaining a second modification instruction; the execution submodule is further configured to execute the second modification instruction to determine one of the modified pronunciation features as a target pronunciation feature.

In an implementation manner, the obtaining module is further configured to obtain an initial text, where the initial text includes the specified text; the apparatus further comprises: the acquisition module is used for judging whether a word-stroke gesture is acquired or not; the acquisition module is further used for acquiring a first acquisition instruction corresponding to the word-stroke gesture when the word-stroke gesture is judged to be acquired; the execution module is used for executing the first acquisition instruction to determine the specified text.

In an embodiment, the apparatus further comprises: a determining module for determining whether the button is triggered; the obtaining module is further configured to obtain a play instruction when it is determined that the button is triggered; the execution module is further configured to execute the play instruction and determine whether the specified text corresponds to a target audio; the playing module is used for playing the target audio corresponding to the specified text when the specified text is judged to correspond to the target audio; the playing module is further configured to play the designated audio corresponding to the designated text when it is determined that the designated text does not correspond to the target audio.

Another aspect of the present invention provides a computer-readable storage medium comprising a set of computer-executable instructions, which when executed, perform any one of the above-described information processing methods.

In the information processing method, the information processing apparatus, and the computer-readable storage medium provided by this embodiment, the information processing apparatus modifies the specified pronunciation feature by the modification rule, and performs speech synthesis processing on the specified text by the modified target pronunciation feature, so that a target audio with correct pronunciation and pause can be obtained, and the purpose of correcting an error audio in a speech synthesis process is achieved.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present invention will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:

in the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.

FIG. 1 is a schematic diagram of an implementation flow of an information processing method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a flow chart of implementing feature modification of an information processing method according to an embodiment of the present invention;

FIG. 3 is a schematic diagram illustrating an implementation flow of determining a specified text by an information processing method according to an embodiment of the present invention;

FIG. 4 is a schematic flow chart illustrating an implementation of playing audio by an information processing method according to an embodiment of the present invention;

fig. 5 is a scene diagram of an embodiment of an information processing method according to the present invention;

fig. 6 is a schematic diagram of an implementation module of an information processing apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a schematic flow chart illustrating an implementation of an information processing method according to an embodiment of the present invention.

Referring to fig. 1, in one aspect, an embodiment of the present invention provides an information processing method, where the method includes: 101, acquiring a specified pronunciation characteristic corresponding to a specified text, wherein the specified pronunciation characteristic is used for generating a specified audio; 102, modifying the specified pronunciation characteristics according to modification rules to obtain target pronunciation characteristics; and 103, performing voice synthesis processing on the specified text based on the target pronunciation characteristics to generate a target audio corresponding to the specified text.

The information processing method provided by the embodiment is applied to the information processing equipment, and in the process of converting the text into the audio by the information processing equipment through speech synthesis, due to the complexity of the language, the obtained specified audio is easy to have wrong pronunciations, pauses and the like, and the accuracy of the specified audio is influenced. Based on the method, the equipment modifies the specified pronunciation characteristics through the modification rules and then carries out voice synthesis processing on the specified text through the modified target pronunciation characteristics, so that the target audio with correct pronunciation and pause can be obtained, and the aim of correcting error audio in the voice synthesis process is fulfilled. It can be understood that there may be a plurality of designated pronunciation features corresponding to the designated text, where the plurality of designated pronunciation features form a designated pronunciation feature set, some of the designated pronunciation features are modified based on a modification rule to obtain a target pronunciation feature, and then the target pronunciation feature and the unmodified designated pronunciation feature are used to perform speech synthesis processing on the designated text to generate a target audio corresponding to the designated text. The information processing device is a device with a voice synthesis function and is used for converting texts into voices. The specified pronunciation characteristics and the target pronunciation characteristics correspond to characteristic information used by the specified text in the process of voice synthesis, such as prosodic characteristics, character-voice conversion characteristics and the like.

The method includes obtaining a specified pronunciation characteristic corresponding to a specified text, the specified pronunciation characteristic being used to generate a specified audio. The designated text can be part of the text content or the whole text content in the initial text. The initial text can be obtained by actively acquiring a signal for determining the initial text by the device, can be obtained by receiving a signal for determining the initial text by the device, and can be randomly determined by the device. The source of the initial text may be the internet, a portable storage medium, or other source. In one case, the initial text may originate from the internet, and the device receives website information from a user input, and performs text collection on a webpage based on the website information to obtain the initial text; in another case, the initial text may be from a usb disk, and the device is in communication connection with the usb disk, and reads text information in the usb disk to determine the initial text; in another case, the initial text may also be obtained by acquiring an input text by the device, that is, a text box is provided on the device, the user may input text content in the text box by using a keyboard, a stylus, voice input, text import, or the like, and the device acquires the initial text by acquiring the text content in the text box. The specific manner of obtaining the initial text in this embodiment includes, but is not limited to, the manner described above. The specified pronunciation characteristics are generated by the device and are used for characteristic information in the process of carrying out voice synthesis corresponding to the specified text. The specified pronunciation characteristics may be phonetic characteristics, prosodic control characteristics, and other characteristics of the specified text during speech synthesis. For example, when text is designated as "convert text to speech," the corresponding prosodic feature may be "convert/text/convert/speech"; the corresponding phonetic features may be "j iang3, wen2, ben3, zhuan3, huan4, yu3, yin 1". It should be understood that the designated text here is the text content that the user needs to modify the corresponding audio, and before this step, the user has determined the designated text and the wrong audio in the designated audio.

The method further comprises the step of modifying the appointed pronunciation characteristics according to the modification rule to obtain the target pronunciation characteristics. When an error occurs in a specified pronunciation characteristic from the device, the error occurs in the specified audio corresponding to the specified pronunciation characteristic. Therefore, the specified pronunciation characteristics need to be modified to correct errors, and the target pronunciation characteristics are also characteristic information of the corresponding specified text in the speech synthesis process. The target characteristic information comprises character pronunciation characteristics, rhythm control characteristics and other characteristics, the correction modes corresponding to different characteristic types are different, so that a corresponding error correction rule needs to be set for each characteristic type, different characteristic information is suitable for different error correction rules, the specified pronunciation characteristics are modified in the mode, the corresponding modification rules can be selected according to the actual error condition of the specified pronunciation characteristics, and the correct target pronunciation characteristics are finally determined.

Specifically, the modification rule includes at least one of the following types: a first rule for modifying polyphonic characters, a second rule for modifying prosody, a third rule for modifying numeric symbols, and a fourth rule for modifying pauses.

The first rule is used for modifying the polyphone in the speech synthesis, and after the polyphone with wrong reading in the designated audio corresponding to the designated text is determined, various pronunciations of the polyphone are provided through the first rule so as to determine the correct pronunciation. For example, when the specified text is "guard watching gate likes to watch novels". Wherein 'seeing' is polyphone, when the designated audio determines the 'watching' pronunciation as 'kan 4, shou 3', two pronunciations of 'seeing' with 'kan 1' and 'kan 4' are obtained through a first rule, and the 'kan 1' is determined as the target pronunciation characteristic according to the subsequent instruction.

The second rule is used for modifying the prosody in the voice synthesis, and when the specified audio corresponding to the specified text is determined to have wrong prosody, various prosody modes are provided through the second rule so as to determine the correct prosody. For example, when the designated text is "the horse is inexperienced and the person is both hung", the designated audio obtained by speech synthesis is "the horse is inexperienced and the person is both hung. The rhythm of "horse is not casual but only in the hanging of the person" obtained by the second rule is that "horse is not casual/in the hanging of the person", "horse is not casual but in/out the hanging of the person". "" horse is lost but enters the Hu/man and hangs it. "," the person who enters the Hu and the person who is dead or dead can hang them. The sentence-breaking manner, based on the subsequent command, determines that the horse is dead and the person/person is hanging. "is the target pronunciation characteristic. The above "/" indicates a prosodic pause.

The third rule is used for modifying the pronunciation of the digital symbol, specifically, the digital symbol has the conditions of single-digit separate reading, reading in connection, skipping unread and the like according to the requirement. Audio errors are therefore likely to occur during speech synthesis. For example, when the text is designated as "room number 333", the audio corresponding to "333" may be determined as "three hundred and thirty three thirteen" by speech synthesis. The pronunciation of 333 obtained by the third rule has pronunciation modes of thirty-three, three-hundred-thirty-three, three-thirteen, three-three and the like, and the three-three is determined as the target pronunciation characteristic according to the subsequent instruction.

A fourth rule is used to modify the pause, which, unlike the second rule, may be an inter-paragraph pause. It should be understood that the types of modification rules include, but are not limited to, the above four types, and may also be other characteristic information affecting the target audio; the number of the … is not limited to four, and may be three, five or six, and will not be described in detail below.

After the target pronunciation characteristics are determined, the method further comprises the step of carrying out voice synthesis processing on the specified text through the target pronunciation characteristics to generate target audio corresponding to the specified text. The speech synthesis technique belongs to the prior art, and the method is not further described. It should be understood that, when the text is designated as a part of the content of the initial text, the method may further include splicing the target audio into the initial audio corresponding to the initial text to obtain a modified audio corresponding to the initial text. The method further comprises the step of storing the modified audio to a designated position in a designated format, wherein the designated position can be a removable storage module, a storage module carried by the equipment, a cloud storage module or other modules with storage functions, and the specific designated position is determined according to needs and can also be preset in advance. The specified format of the modified audio may be a CD format, a WAV format, an MP3 format, or other storage formats, and the specific specified format is determined as required or may be preset in advance. It should be understood that the target audio, the initial audio, and the designated audio can be stored or played separately, and the storage manner is the same as above. When the target audio or the designated audio is stored separately, the device may automatically generate the designation of the target audio according to the location of the designated text in the initial text.

Fig. 2 is a schematic flow chart illustrating implementation of feature modification of an information processing method according to an embodiment of the present invention.

Referring to fig. 2, in the embodiment of the present invention, step 102, modifying the specified pronunciation characteristics according to the modification rule to obtain the target pronunciation characteristics includes: step 1021, obtaining a first modification instruction; step 1022, executing the first modification instruction to determine the type of the modification rule; step 1023, generating a modified pronunciation feature set corresponding to the specified text based on the type of the modification rule, wherein the modified pronunciation feature set comprises a plurality of modified pronunciation features; step 1024, obtaining a second modification instruction; in step 1025, the second modification instruction is executed to determine one of the modified pronunciation features as the target pronunciation feature.

In modifying the specified pronunciation characteristics, the method includes obtaining a first modification instruction. The first modification instruction can be from other equipment, or the user gesture can be acquired by the equipment, and the equipment is controlled through the user gesture. In one case, the triggering condition of the first modification instruction is button triggering, and the button may be a touch button or a physical button. After determining the specified text needing to be modified, the equipment judges whether a button corresponding to the first modification instruction is triggered, and when the button corresponding to the first modification instruction is triggered, subsequent modification operation is carried out on the specified audio features. Furthermore, the type based on the modification rule is multiple, the number of the buttons is multiple, in this embodiment, one trigger rule is selected to correspond to one button, and then when the user triggers the first modification instruction through the button, the corresponding modification rule type can be determined. That is, the first modification instruction may be a first modification instruction that modifies the specified pronunciation characteristic based on a first rule, a first modification instruction that modifies the specified pronunciation characteristic based on a second rule, a first modification instruction that modifies the specified pronunciation characteristic based on a third rule, and a first modification instruction that modifies the specified pronunciation characteristic based on a fourth rule. According to the selection of the button, the specific content of the first modification instruction is determined, and then the modified pronunciation feature set corresponding to the first modification instruction is determined, namely the modified pronunciation feature sets corresponding to different rules are different. After the device obtains the first modification instruction by collecting the button action, the device executes the first modification instruction, and the type of the modification rule can be determined according to the first modification instruction. Based on the determined type of the modification rule, a modified pronunciation feature set corresponding to the specified text can be generated, and the modified pronunciation feature set comprises a plurality of modified pronunciation features. That is, when the modification rule type is the first rule, a pronunciation table that a polyphone may have in the text is specified, for example: when the polyphone is 'watch', the corresponding modified pronunciation feature set comprises two modified pronunciation features 'kan 1' and 'kan 4'; and when the modification rule type is a second rule, obtaining various sentence breaking modes corresponding to the specified text, wherein each sentence breaking mode corresponds to a modification pronunciation characteristic. The third rule and the fourth rule are not described in detail, and it should be added that the number of the buttons is determined according to the number of the types of the modification rules, that is, the number of the buttons also includes, but is not limited to, four, and may also be three, five, and the like.

The method further includes obtaining a second modification instruction, where the second modification instruction is used to instruct to determine one of the modified pronunciation features in the modified pronunciation feature set as the target pronunciation feature, and the second modification instruction is obtained in the same manner as the first modification instruction, and may also be triggered by a third-party device or a button. The second modification instruction may also directly functionalize the modified pronunciation feature as a button, and when the user touches the modified pronunciation feature, the device determines that the trigger condition is satisfied and triggers the second modification instruction.

The method further comprises the step that the equipment executes a second modification instruction so as to determine one of the modified pronunciation characteristics as the target pronunciation characteristic. That is, when the modified pronunciation feature is directly functionalized as a button, the modified pronunciation feature touched by the user may be directly determined as the target pronunciation feature. It is to be added that, when the designated text has a plurality of designated pronunciation features to be modified, each designated pronunciation feature to be modified is modified one by one according to the above steps, and the obtained target pronunciation features are also multiple correspondingly, that is, the number of the target pronunciation features for performing the speech synthesis in the subsequent speech synthesis process is not limited.

Fig. 3 is a schematic flow chart illustrating an implementation process of determining an assigned text by an information processing method according to an embodiment of the present invention.

Referring to fig. 3, in the embodiment of the present invention, before obtaining the specified pronunciation characteristics corresponding to the specified text in step 101, the method further includes: step 301, obtaining an initial text, wherein the initial text comprises an appointed text; step 302, judging whether a word-stroke gesture is collected; step 303, when the word-stroke gesture is judged to be acquired, acquiring a first acquisition instruction corresponding to the word-stroke gesture; step 304, a first acquisition instruction is executed to determine the specified text.

After the initial text is obtained, the specified text needs to be determined, and the determination mode of the specified text can be determined through a word-stroke gesture. It should be understood that, after the initial text is obtained, the method further includes generating an initial audio corresponding to the initial text, and when the device acquires an audio playing instruction, playing the audio of the initial text to facilitate the user to determine the wrong audio position. The user instruction can trigger the audio playing instruction through setting a button, the device is provided with a button corresponding to the audio playing instruction, when the button is triggered, the device plays the audio of the initial text, and it is further necessary to supplement that the audio played by the device corresponding to the audio playing instruction is the audio of the initial text which is modified last time. That is, when the button is triggered after the user obtains the target audio by modification, the target audio segment is included in the played audio.

The method further comprises the step of judging whether a word-stroke gesture is collected or not, and when the word-stroke gesture is judged to be collected, obtaining a first collection instruction corresponding to the word-stroke gesture. When the device obtains the initial text, the initial text is displayed, the text content in the initial text is functionalized, and when the word-dividing gesture is that the text content is touched by a user, the corresponding text content is selected, so that the specified text is determined. Through the word-stroke gesture, the specified text can be determined.

Fig. 4 is a schematic flow chart illustrating an implementation process of playing audio by an information processing method according to an embodiment of the present invention.

Referring to fig. 4, in the embodiment of the present invention, in step 304, after determining the specified text, the method further includes: step 401, determining whether a button is triggered; step 402, when the button is determined to be triggered, a playing instruction is obtained; step 403, executing a play instruction, and judging whether the specified text corresponds to a target audio; step 404, when the designated text is judged to correspond to the target audio, playing the target audio corresponding to the designated text; and 405, when the designated text is judged not to correspond to the target audio, playing the designated audio corresponding to the designated text.

Furthermore, in order to facilitate user operation, when the device determines the specified text through the word-stroke gesture so as to modify the specified pronunciation characteristics, the device may also play the audio of the specified text through the word-stroke gesture. The method also includes determining whether the button is actuated. It should be understood that the button is different from the other buttons for other commands, but the button may be activated in the same manner. When the button is determined to be triggered, a playing instruction is obtained, the playing instruction is used for indicating the equipment to play the audio corresponding to the specified text, in order to avoid the equipment still playing the specified audio after the target pronunciation characteristic is determined, the method further comprises the steps of judging whether the specified text corresponds to the target audio, and when the specified text is judged to correspond to the target audio, playing the target audio corresponding to the specified text; and when the designated text is judged not to correspond to the target audio, playing the designated audio corresponding to the designated text. Namely, when the user needs to perform audition on the specified text to confirm whether the specified text is wrong or not, the user can play the text by adopting a word-dividing gesture, and when the user needs to confirm whether the target audio subjected to voice synthesis based on the target pronunciation characteristics is correct or not, the user can also confirm through the word-dividing gesture. It is further added that, in order to avoid the user misoperation, a confirmation button may be set on the basis of the word-stroke gesture, that is, between step 402 and step 403, whether the confirmation button is triggered or not is acquired, and when the confirmation button is triggered, the playing instruction is executed. It is further understood that with respect to a word-stroke gesture, the designated text may be selected or deselected by the directional nature of the gesture. It is further necessary to supplement that the button can be always displayed, or can be displayed when the acquisition button is triggered.

Fig. 5 is a scene diagram of an embodiment of an information processing method according to the present invention.

Referring to fig. 5, to facilitate further understanding of the above embodiments, a specific implementation scenario is provided below for description.

The scene comprises equipment applying the information processing method, the equipment comprises a processing module and a display, the processing module is in communication connection with the display, and the processing module is used for controlling the display and processing information. The device interface is as shown in fig. 5, a document import button 501 is arranged at the upper right corner of the interface, the device judges whether the document import button 501 is triggered, when the document import button 501 is triggered, the device obtains a document import instruction and imports an initial text, the initial text is displayed in a text box 502 of a display, and the device performs speech synthesis on the initial text to generate an initial text audio corresponding to the initial text. An audio playing button 503 corresponding to the initial text is arranged at the lower right corner of the display, the device judges whether the audio playing button corresponding to the initial text is triggered, and when the audio playing button corresponding to the initial text is triggered, the device plays the audio of the initial text. The user determines the specified text that needs to be modified in text box 502 based on the audio of the initial text.

The upper left corner of the display screen is provided with a plurality of triggering buttons 504 of a first modification instruction, which comprise a polyphone error correction button 5041 for triggering the first modification instruction corresponding to the first rule, a prosody error correction button 5042 for triggering the first modification instruction corresponding to the second rule, a digital symbol error correction button 5043 for triggering the first modification instruction corresponding to the third rule, and a paragraph pause error correction button 5044 for triggering the first modification instruction corresponding to the fourth rule.

When the device judges that no word-dividing gesture exists in the initial text, the default is that the initial text is the same as the specified text, at the moment, when the polyphone error correction button 5041 is triggered, the device searches all polyphones in the initial text and displays the pronunciation set 505 corresponding to each polyphone, the user realizes the modification of the polyphone by clicking the target pronunciation in the pronunciation set 505, and when the user does not click the target pronunciation, the modification is not performed by default. Similarly, when judging that the word-stroke gesture does not exist, and other buttons are triggered, the default is that the initial text is the same as the specified text.

When the device acquires a word-dividing gesture through the display, determines an appointed text through the word-dividing gesture, and after the appointed text is determined, a 'listening trial' button 506 for triggering a playing instruction is displayed behind the appointed text, a user can indicate the device to play the appointed text by triggering the button, and the user can also trigger any one of a polyphonic character correction button, a prosody error correction button, a digital symbol error correction button and a paragraph pause error correction button to modify the appointed pronunciation characteristics of the appointed text. And then, the user clicks any modification pronunciation feature in the modification pronunciation feature set to trigger a second modification instruction, and the modification pronunciation feature clicked by the user is determined as the untargeted pronunciation feature based on the second modification instruction.

The target pronunciation characteristics may be directly substituted for the specified pronunciation characteristics and then the target audio may be generated based on the target pronunciation characteristics. At this time, when the user tries to listen to the designated text with the same word stroke again, the device plays the target audio corresponding to the designated text. And after the user finishes correcting the initial text, clicking a button for generating the whole audio file by the user, splicing and replacing the target audio on the audio corresponding to the initial text by the equipment, and acquiring and storing the modified audio file.

Furthermore, when audio playing is carried out, the equipment can also select the tone of audio synthesis by selecting the speaker, and select the pronunciation speed during audio synthesis by speech speed configuration.

Referring to fig. 6, another aspect of the embodiments of the present invention provides an information processing apparatus, including: an obtaining module 601, configured to obtain a specified pronunciation feature corresponding to a specified text, where the specified pronunciation feature is used to generate a specified audio; a modification module 602, configured to modify the specified pronunciation feature according to a modification rule, so as to obtain a target pronunciation feature; the generating module 603 is configured to perform speech synthesis processing on the specified text based on the target pronunciation feature, and generate a target audio corresponding to the specified text.

In an embodiment of the invention, the modification rule comprises at least one of the following types: a first rule for modifying polyphonic characters, a second rule for modifying prosody, a third rule for modifying numeric symbols, and a fourth rule for modifying pauses.

In an embodiment of the present invention, the modifying module 602 includes: an obtaining sub-module 6021 for obtaining a first modification instruction; an execution submodule 6022 for executing the first modification instruction to determine the type of the modification rule; the generation submodule 6023 is configured to generate a modified pronunciation feature set corresponding to the specified text based on the type of the modification rule, where the modified pronunciation feature set includes a plurality of modified pronunciation features; the obtaining sub-module 6021 is further configured to obtain a second modification instruction; the execution sub-module 6022 is further configured to execute a second modification instruction to determine one of the modified pronunciation characteristics as the target pronunciation characteristic.

In the embodiment of the present invention, the obtaining module 601 is further configured to obtain an initial text, where the initial text includes a specified text; the apparatus further comprises: the acquisition module 604 is configured to determine whether a word-stroke gesture is acquired; the obtaining module 601 is further configured to obtain a first collecting instruction corresponding to the word-stroke gesture when it is determined that the word-stroke gesture is collected; the execution module 605 is configured to execute the first collecting instruction to determine the specified text.

In an embodiment of the present invention, the apparatus further includes: a determining module 606 for determining whether the button is triggered; the obtaining module 601 is further configured to obtain a play instruction when it is determined that the button is triggered; the execution module 605 is further configured to execute the play instruction, and determine whether the specified text corresponds to the target audio; the playing module 607 is configured to play the target audio corresponding to the specified text when it is determined that the specified text corresponds to the target audio; the playing module 607 is further configured to play the specified audio corresponding to the specified text when it is determined that the specified text does not correspond to the target audio.

Another aspect of the embodiments of the present invention provides a computer-readable storage medium, which includes a set of computer-executable instructions, and when the instructions are executed, the computer-readable storage medium is used for executing any one of the information processing methods described above.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. An information processing method, characterized in that the method comprises:

obtaining a specified pronunciation characteristic corresponding to a specified text, wherein the specified pronunciation characteristic is used for generating specified audio;

modifying the specified pronunciation characteristics according to modification rules to obtain target pronunciation characteristics;

and performing voice synthesis processing on the specified text based on the target pronunciation characteristics to generate target audio corresponding to the specified text.

2. The method of claim 1, wherein the modification rule comprises at least one of the following types: a first rule for modifying polyphonic characters, a second rule for modifying prosody, a third rule for modifying numeric symbols, and a fourth rule for modifying pauses.

3. The method according to claim 2, wherein the modifying the specified pronunciation characteristics according to the modification rule to obtain the target pronunciation characteristics comprises:

obtaining a first modification instruction;

executing the first modification instruction to determine the type of the modification rule;

generating a modified pronunciation feature set corresponding to the specified text based on the type of the modification rule, wherein the modified pronunciation feature set comprises a plurality of modified pronunciation features;

obtaining a second modification instruction;

and executing the second modification instruction to determine one of the modified pronunciation characteristics as a target pronunciation characteristic.

4. The method of claim 1, wherein prior to obtaining the specified pronunciation characteristics for the specified text, the method further comprises:

obtaining an initial text, wherein the initial text comprises the specified text;

judging whether a word-stroke gesture is collected;

when the word-dividing gesture is judged to be collected, a first collection instruction corresponding to the word-dividing gesture is obtained;

executing the first acquisition instruction to determine the specified text.

5. The method of claim 4, wherein after determining the specified text, the method further comprises:

determining whether the button is triggered;

when the button is determined to be triggered, obtaining a playing instruction;

executing the playing instruction, and judging whether the specified text corresponds to a target audio;

when the specified text is judged to correspond to the target audio, playing the target audio corresponding to the specified text;

and when the designated text is judged not to correspond to the target audio, playing the designated audio corresponding to the designated text.

6. An information processing apparatus characterized by comprising:

the obtaining module is used for obtaining the appointed pronunciation characteristics corresponding to the appointed text, and the appointed pronunciation characteristics are used for generating appointed audio;

the modification module is used for modifying the specified pronunciation characteristics according to modification rules to obtain target pronunciation characteristics;

and the synthesis module is used for carrying out voice synthesis processing on the specified text based on the target pronunciation characteristics and generating target audio corresponding to the specified text.

7. The apparatus of claim 6, wherein the modification rule comprises at least one of the following types: a first rule for modifying polyphonic characters, a second rule for modifying prosody, a third rule for modifying numeric symbols, and a fourth rule for modifying pauses.

8. The apparatus of claim 7, wherein the modification module comprises:

the obtaining submodule is used for obtaining a first modification instruction;

the execution submodule is used for executing the first modification instruction to determine the type of the modification rule;

the generation submodule is used for generating a modified pronunciation feature set corresponding to the specified text based on the type of the modification rule, and the modified pronunciation feature set comprises a plurality of modified pronunciation features;

the obtaining submodule is also used for obtaining a second modification instruction;

the execution submodule is further configured to execute the second modification instruction to determine one of the modified pronunciation features as a target pronunciation feature.

9. The apparatus of claim 6,

the obtaining module is further configured to obtain an initial text, where the initial text includes the specified text;

the apparatus further comprises:

the acquisition module is used for judging whether a word-stroke gesture is acquired or not;

the acquisition module is further used for acquiring a first acquisition instruction corresponding to the word-stroke gesture when the word-stroke gesture is judged to be acquired;

the execution module is used for executing the first acquisition instruction to determine the specified text.

10. A computer-readable storage medium comprising a set of computer-executable instructions for performing the information processing method of any one of claims 1 to 5 when the instructions are executed.