CN116153289A - Processing method and related device for speech synthesis marked text - Google Patents

Processing method and related device for speech synthesis marked text Download PDF

Info

Publication number
CN116153289A
CN116153289A CN202211279014.2A CN202211279014A CN116153289A CN 116153289 A CN116153289 A CN 116153289A CN 202211279014 A CN202211279014 A CN 202211279014A CN 116153289 A CN116153289 A CN 116153289A
Authority
CN
China
Prior art keywords
text
mark
voice synthesis
speech synthesis
markup
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211279014.2A
Other languages
Chinese (zh)
Inventor
叶启松
郭剑霓
吴海英
郭江
刘磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Mashang Consumer Finance Co Ltd
Original Assignee
Mashang Consumer Finance Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mashang Consumer Finance Co Ltd filed Critical Mashang Consumer Finance Co Ltd
Priority to CN202211279014.2A priority Critical patent/CN116153289A/en
Publication of CN116153289A publication Critical patent/CN116153289A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/117Tagging; Marking up; Designating a block; Setting of attributes

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The disclosure provides a processing method and a related device for a speech synthesis mark text, wherein the method comprises the following steps: determining a first speech synthesis flag corresponding to the speech synthesis flag addition request; determining a second speech synthesis flag corresponding to the first speech synthesis flag; adding a second voice synthesis mark to the original text to obtain a second voice synthesis mark text containing the second voice synthesis mark; the second voice synthesis mark text is used for being displayed in a text mark interface; converting the second voice synthesis mark contained in the second voice synthesis mark text into a corresponding first voice synthesis mark to obtain a first voice synthesis mark text; the first voice synthesis mark text is used for being provided for a voice synthesis device to carry out voice synthesis processing. The method can improve the readability of the marked original text and reduce the error rate.

Description

Processing method and related device for speech synthesis marked text
Technical Field
The present disclosure relates to the field of language processing technologies, and in particular, to a method and an apparatus for processing a speech synthesis markup text.
Background
The Chinese language is used as a tone voice, the rhythm characteristics are very complex, and the rhythm parameters of the Chinese language are different from each other for the same syllable under different environments. Thus, the effect of text-only speech synthesis may be less than ideal, sometimes may not sound humanized or natural enough, sometimes may be very harsh, and even some errors occur, such as: multi-tone word mispronounces, misreads brand names, numbers, and digital reading cannot be distinguished, etc.
In order to solve the above-mentioned problems, in the related art, a method of marking each word by a speech synthesis mark text, which is a mark text for assisting in speech synthesis, is performed, and the speech synthesis mark text may be marked with contents such as pronunciation, pause, break, intonation, etc. of words in the text to be speech synthesized. Thus, when speech synthesis is performed based on speech synthesis markup text, the pronunciation can be fine-tuned and adjusted to make them sound more natural, correct common mispronunciations, and add features such as interrupts and pauses, and speed up, slow down or adjust the treble of the speech.
However, in the conventional marking tool, the speech synthesis mark is mixed in the marked original text, and the speech synthesis text may not be rendered in the original text by the browser in a parsing manner, so that the readability of the original text after the mark is added is poor, and the editing process is easy to make mistakes.
Disclosure of Invention
The disclosure provides a processing method, a processing device and electronic equipment for a voice synthesis marked text, which are used for improving the readability of the marked original text and reducing the error rate.
In a first aspect, the present disclosure provides a method for processing a speech synthesis markup text, including the steps of:
determining a first speech synthesis mark corresponding to an original text in a text mark interface in response to a speech synthesis mark adding request triggered for the original text;
determining a second voice synthesis mark corresponding to the first voice synthesis mark according to a preset mark mapping relation;
adding the second voice synthesis mark to the original text to obtain a second voice synthesis mark text containing the second voice synthesis mark; the second voice synthesis mark text is used for being displayed in the text mark interface;
Converting the second voice synthesis mark contained in the second voice synthesis mark text into a corresponding first voice synthesis mark to obtain a first voice synthesis mark text; the first voice synthesis mark text is used for providing the voice synthesis device with voice synthesis processing; the first speech synthesis markup is implemented based on a first markup language and the second speech synthesis markup is implemented based on a second markup language; the first markup language is a language which is not supported by the browser to be rendered, and the second markup language is a language which is supported by the browser to be rendered.
In a second aspect, the present disclosure provides a processing apparatus for speech synthesis markup text, including:
a determining module adapted to determine a first speech synthesis tag corresponding to an original text in a text tagging interface in response to a speech synthesis tag addition request triggered for the original text;
the determining module is further adapted to determine a second voice synthesis mark corresponding to the first voice synthesis mark according to a preset mark mapping relation;
the adding module is suitable for adding the second voice synthesis mark aiming at the original text to obtain a second voice synthesis mark text containing the second voice synthesis mark; the second voice synthesis mark text is used for being displayed in the text mark interface;
The conversion module is suitable for converting the second voice synthesis mark contained in the second voice synthesis mark text into a corresponding first voice synthesis mark to obtain a first voice synthesis mark text; the first voice synthesis mark text is used for providing the voice synthesis device with voice synthesis processing; the first speech synthesis markup is implemented based on a first markup language and the second speech synthesis markup is implemented based on a second markup language; the first markup language is a language which is not supported by the browser to be rendered, and the second markup language is a language which is supported by the browser to be rendered.
In a third aspect, the present disclosure provides an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores one or more computer programs executable by the at least one processor, one or more of the computer programs being executable by the at least one processor to enable the at least one processor to perform the above-described method.
In a fourth aspect, the present disclosure provides a computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor/processing core implements the above-described method.
In the embodiment provided by the disclosure, a mapping relation between a first voice synthesis mark and a second voice synthesis mark is preset, wherein the first voice synthesis mark is realized by using a first mark voice which is not supported by a browser to be rendered, and the second voice synthesis mark is realized by using a second mark voice which is supported by the browser to be rendered. According to the embodiment of the application, two voice synthesis mark texts can be obtained, wherein one is a first voice synthesis mark text composed of the original text and the first voice synthesis mark, and the other is a second voice synthesis mark text composed of the original text and the second voice synthesis mark. Since the first speech synthesis markup is browser-non-rendered, the first speech synthesis markup text cannot be displayed to the user for viewing, and conversely, the second speech synthesis markup is browser-rendered and can be displayed in the text markup interface for viewing by the user. The first speech synthesis text is provided to the speech synthesis means such that the speech synthesis means performs speech synthesis based thereon.
Therefore, the method can enable the voice synthesis device to carry out voice synthesis on the voice synthesis marked text displayed in the text marking interface to adopt different types, so that the readability of the original text is improved, and the editing error rate is reduced.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.
Drawings
The accompanying drawings are included to provide a further understanding of the disclosure, and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure, without limitation to the disclosure. The above and other features and advantages will become more readily apparent to those skilled in the art by describing in detail exemplary embodiments with reference to the attached drawings, in which:
FIG. 1 is a flow chart of a method for processing speech synthesis markup text provided in one embodiment of the present disclosure;
FIG. 2 is a flow chart of a method for processing speech synthesis markup text according to another embodiment of the present disclosure;
FIG. 3 shows a schematic diagram of a text markup interface;
FIG. 4 illustrates a flow chart of a method of processing speech synthesis markup text provided by a specific example;
FIG. 5 illustrates a flow diagram of one specific application scenario of an example;
FIG. 6 is a block diagram of a processing device for speech synthesis markup text provided by an embodiment of the present disclosure;
Fig. 7 is a block diagram of an electronic device according to an embodiment of the present disclosure.
Detailed Description
For a better understanding of the technical solutions of the present disclosure, exemplary embodiments of the present disclosure will be described below with reference to the accompanying drawings, in which various details of the embodiments of the present disclosure are included to facilitate understanding, and they should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
Embodiments of the disclosure and features of embodiments may be combined with each other without conflict.
As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used herein, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. The terms "connected" or "connected," and the like, are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect.
Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and the present disclosure, and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
The processing method of the speech synthesis markup text according to the embodiments of the present disclosure may be performed by an electronic device such as a terminal device or a server, where the terminal device may be a vehicle-mounted device, a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a personal digital assistant (Personal Digital Assistant, PDA), a handheld device, a computing device, a vehicle-mounted device, a wearable device, or the like; the server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud computing service. The method may in particular be implemented by means of a processor calling a computer program stored in a memory.
In the related art, the text is marked by speech synthesis to mark the reading method of each character, so that the pronunciation is fine-tuned and adjusted to make them sound more natural, correct common mispronunciations, and add such as interrupt and pause, and quicken, slow down or adjust the high pitch of the speech. However, the speech synthesis marks in conventional marking tools are mixed in the marked text, and these marks are not readable, so that the text and the marks are confused, the readability of the text is poor, and marking errors are easy. In order to solve the above-mentioned problem, the present application proposes a processing method of a speech synthesis markup text, which is capable of converting a first speech synthesis markup in an original text into a second speech synthesis markup, so as to display the second speech synthesis markup in a text markup interface, where the first speech synthesis markup is used for providing to a speech synthesis device for speech synthesis processing. Therefore, the method can enable the marks in the text mark interface and the marks provided by the voice synthesis device to adopt different types of marks, thereby improving the readability of the text and reducing the error rate.
Fig. 1 is a flowchart of a processing method of a speech synthesis markup text according to an embodiment of the present disclosure. Referring to fig. 1, the method includes:
step S110: in response to a voice synthesis tag addition request triggered for an original text in a text tag interface, a first voice synthesis tag corresponding to the original text is determined.
Wherein the text markup interface is for providing a speech synthesis markup function for an original text. The original text refers to the text to be marked, which is displayed in the text marking interface, and can be specifically input by a user. The speech synthesis markup addition request may be triggered by a markup tool portal contained in the text markup interface. The marking instrument entry may be in the form of buttons, shortcuts, hot zones, etc.
Optionally, the speech synthesis mark adding request includes a first mark type of a first speech synthesis mark, where the first mark type is used to indicate what type of mark is added for the original text, and specifically includes: the type of pause, the type of digital reading, the type of polyphonic reading, etc., in particular based on the type of marking-tool entry used to trigger the speech synthesis marking-addition request. That is, there are various types of marking tools that trigger the speech synthesis marking addition request, and when any one type of marking tool is triggered, it is indicated that the marking type corresponding to the marking tool type is added to the original text.
In this embodiment, the first speech synthesis mark is implemented based on a first markup language, where the first markup language may refer to a language that the browser does not support rendering, that is, the first speech synthesis mark cannot be rendered and displayed by the browser, and if the first speech synthesis mark is directly displayed, a messy code may occur.
Step S120: and determining a second voice synthesis mark corresponding to the first voice synthesis mark according to a preset mark mapping relation.
Specifically, a mark mapping relationship between the first speech synthesis mark and the second speech synthesis mark is preset. The second speech synthesis markup and the first speech synthesis markup are implemented based on different markup languages, respectively. The second speech synthesis markup is markup that facilitates presentation in a text markup interface. The second speech synthesis mark is realized based on a second markup language, wherein the second markup language is a language supported by the browser to render, that is, the second speech synthesis mark can be displayed after being rendered by the browser and no messy code can appear.
Step S130: adding a second voice synthesis mark to the original text to obtain a second voice synthesis mark text containing the second voice synthesis mark; the second voice synthesis mark text is used for being displayed in a text mark interface.
Since the second speech synthesis mark is a mark that is convenient to be displayed in the text mark interface, the second speech synthesis mark text obtained by adding the second speech synthesis mark to the original text has an advantage of being convenient to be displayed. For example, the second speech synthesis marks of different types can be displayed in different styles in the text mark interface, so that confusion with the original text is avoided, and readability is improved.
Step S140: converting the second voice synthesis mark contained in the second voice synthesis mark text into a corresponding first voice synthesis mark to obtain a first voice synthesis mark text; the first voice synthesis mark text is used for being provided for a voice synthesis device to carry out voice synthesis processing.
Because the voice synthesis device mainly supports the first voice synthesis mark, in order to facilitate the processing of the voice synthesis device, the second voice synthesis mark contained in the second voice synthesis mark text is further converted into the corresponding first voice synthesis mark, so as to obtain the first voice synthesis mark text. Through the mutual conversion between the two marks, on one hand, the readability of the second voice synthesis mark text displayed in the text mark interface is ensured; on the other hand, the processing logic of the speech synthesis apparatus is not changed, and the processing still needs to be performed according to the first speech synthesis mark text.
It can be seen that, in this embodiment, the mapping relationship between the first speech synthesis flag and the second speech synthesis flag is preset, so that the first speech synthesis flag in the original text can be converted into the second speech synthesis flag, and the second speech synthesis flag is displayed in the text flag interface, and the first speech synthesis flag is used for providing the speech synthesis device with speech synthesis processing. Therefore, the method can enable the marks in the text mark interface and the marks provided by the voice synthesis device to adopt different types of marks, so that the readability of the marked original text is improved, and the error rate is reduced.
Fig. 2 is a flowchart of a processing method of a speech synthesis markup text according to another embodiment of the present disclosure. Referring to fig. 2, the method includes:
step S210: and responding to the text selection operation triggered by the user, and determining target annotation text corresponding to the text selection operation, which is contained in the original text in the text marking interface.
The original text to be marked is displayed in a text marking interface, the original text can be composed of one or more sections of sentences, and when the original text is marked specifically, specific contents such as words or numbers in the original text need to be marked. Accordingly, the user selects contents such as specific words or numbers in the original text as target annotation text through text selection operation. It follows that the target annotation text refers to: text content to be added with a speech synthesis flag contained in the original text. For example, fig. 3 shows a schematic diagram of a text marking interface, as shown in fig. 3, the original text is "please dial 119 in time for about 119 minutes in case of fire", wherein the first place "119" is a phone number, and thus, the first place "119" is selected as the target mark text, and a speech synthesis mark is added for the target mark text.
Step S220: in response to a voice synthesis tag addition request triggered for a target markup text contained in an original text in a text markup interface, a first voice synthesis tag corresponding to the target markup text is determined.
Wherein the speech synthesis markup addition request is triggerable through a markup tool portal contained in the text markup interface. As shown in fig. 3, the marking instrument entry may be a marking instrument button. A plurality of types of marking tool buttons are included in the text marking interface, different marking tool buttons corresponding to different types of first speech synthesis marks. For example, in fig. 3, a pause type button, a substitution type button, a polyphone type button, and a number reading type button are included. The digital reading buttons further comprise a plurality of sub buttons such as an integer, an independent number, a telephone number, address pronunciation, punctuation reading, yearly, month and day reading, date abbreviation, time reading, match score, fraction reading, removal mark and the like.
It can be seen that the speech synthesis mark adding request includes a first mark type of a first speech synthesis mark, where the first mark type is used to indicate what type of mark is added for the original text, and specifically includes: the type of pause, the type of digital reading, the type of polyphonic reading, etc., in particular based on the type of marking-tool entry used to trigger the speech synthesis marking-addition request.
In an alternative implementation manner, in order to avoid misoperation of a user and improve accuracy of the mark, after determining the first voice synthesis mark corresponding to the target labeling text, the following operations are further executed: determining the text type of the target annotation text and the mark type of the first voice synthesis mark; checking whether the text type is matched with the mark type (namely the first mark type) of the first voice synthesis mark; if not, generating prompt information to prompt the user to trigger the voice synthesis mark adding request again. The corresponding relation between various text types and the first voice synthesis marks is preset, for example, the text type of the target labeling text corresponding to the first voice synthesis marks of the polyphone type can only be Chinese characters, but not numbers or English; for another example, the text type of the target labeling text corresponding to the first speech synthesis mark of the digital reading type can be only a number, but not a Chinese character or an English. It can be seen that by presetting the correspondence between the text type and the first mark type of the first speech synthesis mark, it is possible to verify the validity of the speech synthesis mark addition request after the user triggers the speech synthesis mark addition request, and to execute the subsequent steps only if the verification passes. If the verification is not passed, prompting the user to re-trigger the voice synthesis mark adding request until the voice synthesis mark adding request re-triggered by the user accords with the verification rule.
In yet another alternative implementation, after determining the target annotation text corresponding to the text selection operation contained in the original text, the following operations are further performed: judging whether the target labeling text contains a second voice synthesis label or not; if yes, displaying a modification inlet corresponding to the second voice synthesis mark in the text mark interface; wherein the modification entry is for performing modification processing on the second speech synthesis flag; and, placing a marking tool button contained in the text marking interface in an unavailable state; if not, determining a marking tool button matched with the text type according to the text type of the target marking text, and setting the marking tool button matched with the text type into a usable state.
If the target mark text already contains the second speech synthesis mark, it indicates that the user has added the speech synthesis mark for the target mark text, and if the speech synthesis mark is added again, the previous speech synthesis mark will be interfered, so that the speech synthesis device cannot determine which speech synthesis mark should be used, and further a speech synthesis error is caused. In order to avoid the above-described problem, in the present embodiment, in the case where it is determined that the second speech synthesis flag is already included in the target mark text, a modification entry corresponding to the second speech synthesis flag is presented in the text mark interface, and the user can perform modification processing on the second speech synthesis flag through the modification entry. Wherein, the modification entry specifically comprises: delete entry or change tag type entry. For example, the user may delete the second speech synthesis flag that has been added by deleting the entry. Through the judging mode, the error problem caused by secondary labeling of the labeled content can be avoided, misoperation of a user is prevented, and labeling efficiency is improved.
In addition, if the target mark text does not contain the second speech synthesis mark, the user is informed that the speech synthesis mark is not added to the target mark text, and correspondingly, the user can trigger a speech synthesis mark adding request through a mark tool button in a text mark interface. In addition, in order to facilitate the labeling of the user, time consumption caused by the user searching for the available buttons from a plurality of labeling tool buttons by himself is avoided, preferably, the labeling tool buttons matched with the text types are determined according to the text types of the target labeling text, the labeling tool buttons matched with the text types are set to be in an available state, and other labeling tool buttons not matched with the text types are set to be in an unavailable state. For example, assuming that the text type of the target mark-up text selected by the user is a digital type, the marking-tool button of the digital reading type is set to an available state, and the marking-tool button of the polyphone type is set to an unavailable state.
Step S230: and determining a second mark type of the second voice synthesis mark corresponding to the first mark type of the first voice synthesis mark according to a preset mark mapping relation.
Specifically, a mark mapping relationship between the first speech synthesis mark and the second speech synthesis mark is preset. The second speech synthesis markup and the first speech synthesis markup are implemented based on different markup languages, respectively. The second speech synthesis markup is markup that facilitates presentation in a text markup interface.
For example, a first speech synthesis markup is implemented based on a first markup language and a second speech synthesis markup is implemented based on a second markup language; the first markup language is a language which is not supported by the browser to be rendered, and the second markup language is a language which is supported by the browser to be rendered. In this embodiment, the first markup language is a speech synthesis markup language and the second markup language is a hypertext markup language. Specifically, the first markup language is SSML (Speech Synthesis Markup Language, a speech synthesis markup language). Because of the complexity of the Chinese language, when the text is converted into the phonemes, special symbols, digital reading methods under different contexts, polyphones, sentence reading pauses and other scenes are often needed to be considered, and the final pronunciation effect of the synthesized audio can be more accurately and specifically defined by using the SSML. The second markup language is HTML (Hyper Text Markup Language tag, i.e., hypertext markup voice markup tag), which is the most basic unit in HTML language by which a rich style presentation, such as text color, background color, underlining, etc., can be composed in modern browsers. The HTML is a language which is supported by the browser to render, so that the second voice synthesis marks can be rendered through a text mark interface displayed by the browser, and different second voice synthesis marks are rendered into different display styles, so that the user can browse conveniently. However, SSML is a language that the browser does not support rendering, and therefore, the text markup interface presented by the browser cannot render the first speech synthesis markup, and therefore, if a method of directly presenting the first speech synthesis markup in the text markup interface in a conventional manner is adopted, the markup code of the first speech synthesis markup and the original text in the text markup interface will be confused. For example, assuming the original text is "fire, please dial 119", after the numerical reading of 119 is marked, the text marking interface will display "fire, please dial < figure >119</figure type=digit >. ". It follows that when the first speech synthesis label is directly employed, the label code is confused with the original text, and the readability is poor. In order to solve the above-described problem, in the present embodiment, a second mark type of a second speech synthesis mark corresponding to a first mark type of a first speech synthesis mark is determined according to a preset mark mapping relation.
Step S240: adding a second voice synthesis mark of a second mark type to a target mark text in the original text to obtain a second voice synthesis mark text containing the second voice synthesis mark; the second voice synthesis mark text is used for being displayed in a text mark interface.
Since the second speech synthesis mark is a mark that is convenient to be presented in the text mark interface, the second speech synthesis mark text obtained by adding the second speech synthesis mark of the second mark type to the original text has an advantage of being convenient to be presented. For example, the second speech synthesis marks of different types can be displayed in different styles in the text mark interface, so that confusion with the original text is avoided, and readability is improved.
Specifically, adding a second speech synthesis mark of a second mark type for a target mark text contained in the original text; and determining a mark presentation style corresponding to a second mark type of the second speech synthesis mark, and adjusting a text style of the target mark text contained in the original text to the mark presentation style in the text mark interface; wherein the second, different, type of indicia corresponds to a different indicia presentation style. It can be seen that the text style of the target markup text contained in the original text can be adjusted to the markup presentation style in the text markup interface by determining the markup type to which the second speech synthesis markup belongs and the markup presentation style corresponding to the markup type. For example, in the case where the target mark text "119" is a telephone number mark type, the text style of the target mark text is adjusted to a mark presentation style of red; in the case where the target mark text "119" is of an independent number mark type, the text style of the target mark text is adjusted to be a mark presentation style of bold. In summary, different mark presentation styles are set in advance for the second speech synthesis marks of different types, and correspondingly, when the user adds a speech synthesis mark for the target mark text, the speech synthesis mark originally realized by the SSML language is converted into the speech synthesis mark realized by the HTML language, so as to obtain the second speech synthesis mark text in the HTML form. And rendering the target mark text into a mark display style corresponding to the second mark type in the text mark interface by means of the characteristic that the browser can render the HTML language.
Step S250: converting the second voice synthesis mark contained in the second voice synthesis mark text into a corresponding first voice synthesis mark to obtain a first voice synthesis mark text; the first voice synthesis mark text is used for being provided for a voice synthesis device to carry out voice synthesis processing.
In one implementation, the second speech synthesis markup contained in the second speech synthesis markup text is converted into the corresponding first speech synthesis markup in response to a received markup preservation instruction or speech synthesis instruction. For example, the tag save instruction or the voice synthesis instruction may be triggered by the "listen to test" button in fig. 3, and when it is detected that the user clicks the "listen to test" button, conversion processing is performed for each second voice synthesis tag included in the second voice synthesis tag text, so as to obtain a corresponding first voice synthesis tag, and the first voice synthesis tag text including the first voice synthesis tag is provided to the voice synthesis device for voice synthesis processing.
The second speech synthesis mark text is stored in the background database, and specifically includes the original text and the second speech synthesis mark (i.e. the HTML-form speech synthesis mark) added to each target mark text in the original text. In the conversion process, according to a preset mark mapping relation, each second voice synthesis mark is converted into a corresponding first voice synthesis mark, so that a first voice synthesis mark text of the first voice synthesis mark added by the original text and each target mark text in the original text is obtained.
It can be seen that the text of the second speech synthesis mark is stored in the background database and is used for determining the rendering mode of the second speech synthesis mark in the browser; the first Speech synthesis markup Text is for providing To a Speech synthesis device for Speech synthesis (TTS) processing. The core of TTS is to convert text into sound, let robot speak like a robot, and simulate the understanding process of human to natural speech, namely text regularity, word segmentation, grammar analysis and semantic analysis, and then output the final synthesized speech through prosody processing and acoustic processing.
As can be seen from the foregoing, in this embodiment, the mapping relationship between the first speech synthesis flag and the second speech synthesis flag is preset, so that the first speech synthesis flag in the original text can be converted into the second speech synthesis flag, and thus the second speech synthesis flag is displayed in the text flag interface, and the first speech synthesis flag is used for providing the speech synthesis device with speech synthesis processing. Therefore, the method can enable the marks in the text mark interface and the marks provided for the voice synthesis device to adopt different types of marks, so that the identification degree between the text and the marks is higher, the readability of the text is improved, and the error rate is reduced. Moreover, the corresponding relation between the target labeling text and the voice synthesis label can be automatically checked, the problems of false input and the like of a user are prevented, and whether the target labeling text selected by the user is labeled can be judged in advance, so that repeated labeling is further prevented, and the labeling accuracy and efficiency are improved.
In the following, specific implementation details in the present embodiment are described in detail by taking a specific example as an example for understanding.
First, a brief description will be made of an implementation manner in the related art:
in a related art, a user creates SSML mark text directly through a normal text box, a multi-line text box (textarea) without any assistance, specifically comprising the steps of:
(1) Inputting an original text, namely a text to which an SSML mark needs to be added, in a text box;
(2) Manually adding the SSML labels, namely manually inputting and displaying corresponding mark texts at the positions needing to be added with the labels;
(3) And obtaining the SSML marked text, namely outputting the SSML marked text according to the original label input by the user manually, and submitting the SSML marked text to a TTS module to obtain pronunciation audio.
The above-mentioned manner has at least the following drawbacks:
(1) The creation efficiency is low: a large amount of extra contents such as labels, attributes, attribute values and the like are required to be input every time an SSML label is created, and editing efficiency is low;
(2) The error rate is high: the SSML label is more than 10 types, is difficult to manually write, lacks an error correction auxiliary function, and has high written SSML mark text error rate;
(3) Poor readability: after mixing a large number of SSML tags with the original text, the original text is disturbed in reading, and the meaning of the tag itself is also difficult for other users to understand.
In still another related art, a basic SSML editing function is implemented by a markup editor, specifically comprising the steps of:
(1) Inputting original text: i.e., text to be SSML tagged;
(2) Selecting a text region: namely selecting words to be marked;
(3) Selecting a marking tool: selecting a preset tool from the editor;
(4) Automatically insert SSML tags: and according to the selected tool, automatically inserting the corresponding SSML mark to the corresponding position of the text, and visually displaying.
The above approach still has the following drawbacks:
(1) Poor readability: after mixing a large number of SSML tags with the original text, the original text is disturbed in reading, and the meaning of the tag itself is also difficult for other users to understand.
(2) The error rate is high: SSML tags are also a text in the text itself, and if the user continues to select the content in the existing tags, other types of tags can be added further, so that a tag nesting error condition occurs, and a validity check is lacking, so that a tag error is easily caused.
(3) The secondary editing capability is poor: because the text contains the original SSML mark, the editor cannot identify the existing SSML mark, cannot quickly delete the original SSML mark, cannot intelligently judge mark nesting, has poor secondary editing experience on the text and has low efficiency.
In order to solve the above-mentioned problem, fig. 4 is a flowchart showing a processing method of the speech synthesis markup text provided as a specific example. As shown in fig. 4, the processing method of the speech synthesis markup text includes the following steps:
step S41: the original text to be marked is entered.
Step S42: a specified text region (i.e., target annotation text) in the original text is selected. The specified text region is the text content or location that requires SSML markup.
Step S43: it is determined whether the specified text region contains an existing mark. And performing marked anti-display identification according to the judgment result.
Step S44: if so, highlighting is performed for the existing mark contained in the specified text region, and the mark function is disabled. For example, if it is determined that the specified text region includes an existing mark, the text region cannot be repeatedly marked, and therefore, the existing mark is highlighted and a deletion entry is provided for the user to delete the existing mark. In addition, the marking function is disabled before the user deletes the existing mark, i.e.: the marking instrument button is placed in an unavailable state.
Step S45: if not, detecting a marking request triggered by a user through a marking tool button.
Step S46: and carrying out validity check on the marking request.
Specifically, judging whether the mark type contained in the mark request is matched with the text type of the appointed text region, and if so, passing the validity check; if the verification fails, prompting the user to re-mark under the condition of the verification failure.
Step S47: and when the validity check is passed, performing marking processing by an HTML temporary marking mode.
The HTML temporary marking mode is the mode of marking through the second speech synthesis mark.
Step S48: in the event that a save operation is detected, the HTML temporary tag is converted to an SSML tag.
The HTML temporary mark is the second voice synthesis mark, and the SSML mark is the first voice synthesis mark.
Step S49: and obtaining the SSML marked text according to the converted SSML mark.
The SSML marked text is the second speech synthesis marked text, and is used for being provided for a speech synthesis device.
It follows that in this example, by the HTML temporary marking as an intermediate marking state, whether the selected text has been marked or not is quickly recognized, and the marked content can be subjected to the display of the display state.
Table 1 shows the correspondence between HTML tags and SSML tags:
TABLE 1
Figure BDA0003897880320000101
/>
For example, when the mouse is selected on 119 or contains it, the class name of the mark is obtained through [ event. Target. Classname ], if "number-to-phone" indicates that it has been marked, and if "number-to-phone" is not "indicated, it is indicated that the selected text is not marked.
Additionally, in the above example, the marking tool can also be disabled if the selected region text has been marked. Because SSML does not support nesting, the use of other marking tools is disabled when marked, and existing marks can be quickly deleted.
In addition, the user can also select a marking tool. For example, selecting an available SSML markup tool button has packaged various types of complex, non-intuitive SSML tags into intuitive and easy-to-use presentation. Table 2 shows the correspondence between tool buttons and SSML marks.
TABLE 2
Figure BDA0003897880320000102
Figure BDA0003897880320000111
In addition, in this example, text validity checking can also be performed: before proceeding with SSML marking, the system automatically performs a validity check on the selected text or location to prevent misuse of the marking, such as: the "letter" is selected to prohibit the use of pinyin marks, and the "SSML" is selected to disable the "phone number reading" marks. If the validity check of the text and the target SSML tag fails, the tag is not made, the failure is immediately prompted, and the user is informed that the tag supports false prompts for tagged text, such as phone number reading: "telephone number reading only supports the numbers" -, () ". If the validity check is passed, the mark is completed by using an HTML temporary mark module, and the SSML mark is presented by an HTML style by utilizing the characteristic that the HTML mark cannot be edited by a user in a modern browser, and the method is characterized in that: the style is prominent (such as background color, thickening, text color, floating reminding), etc., easy to read without interference information (such as label words which are difficult to understand by users), and convenient to delete (such as one-key deleting style). When the marking is completed and the adoption is determined, all marked operations and results can be saved.
In the above example, in order to facilitate the HTML-SSML conversion function, an HTML-SSML conversion module is used, which automatically converts the "mark style" into a standard SSML mark text, stores the correspondence between 2 mark forms, and the system automatically performs mutual conversion according to the correspondence, such as phone number reading and correspondence between SSML marks of polyphones and HTML marks. Table 3 shows the conversion relationship between SSML marks and HTML marks.
TABLE 3 Table 3
Figure BDA0003897880320000112
When the user performs a save operation, the system will output 2 fields (HTMLcontent, SSMLcontent) of content, HTML markup text content and SSML markup text content, respectively. The HTML mark text is used for rendering and displaying on the browser, and the display experience is more friendly to users; the SSML markup text is for delivery to the TTS interface service for correction of the actual pronunciation.
Fig. 5 shows a flow diagram of a specific application scenario of the present example. As shown in fig. 5, the processing system for speech synthesis markup text specifically includes: the text region display module, the SSML marking tool, the HTML temporary marking module and the HTML-SSML conversion module specifically comprise the following processing steps:
step S501: the original text to be marked is entered.
The method specifically comprises the step of completing the step by a text region display module.
Step S502: and selecting target annotation text contained in the original text.
Step S503: and analyzing the HTML style of the target labeling text.
For example, selecting text retrieves a text HTML style: according to a Range object (standard object https:// development player. Mozilla. Org/zh-CN/docs/Web/API/Range in the Web API) in the DOM, which represents the content of a selected area of the mouse and contains information such as a selected starting position, a selected visual text, an HTML label of the selected area and the like, a plurality of Node nodes in the Range are acquired, class in the Node and a corresponding dataset object are acquired, and therefore the HTML style of the target annotation text is determined.
Step S504: and judging whether the HTML style of the target marked text is a marked style or not.
Step S505: if yes, determining a used marking tool corresponding to the target marking text, and highlighting the used marking tool.
For example, in the HTML temporary marking module, according to the mapping relation of label-class in the HTML-SSML conversion module, searching whether the system SSML marking type is adopted, if so, indicating that the current selection area contains SSML marking, highlighting the marking tool to prompt, and disabling other marking tools; otherwise, any marking tool may be used.
Wherein, table 4 shows the mapping relation of "tag-class" in the HTML-SSML conversion module.
TABLE 4 Table 4
Figure BDA0003897880320000121
For example, assuming that a class is included in a selection node, the number-to-phone indicates that the selection has content marked as [ phone number reading ], the corresponding button may be highlighted to indicate the [ marked ] state.
Step S506: and receiving a user-triggered mark clearing instruction. Specifically, if the user clicks the "clear mark" button, a clear mark command is triggered.
Step S507: and deleting the added mark according to the mark clearing instruction.
For example, when a mark is present and the tool [ clear mark ] is clicked, the HTML tag where the mark class is located is directly removed, and the text in the HTML tag is retained.
Step S508: if not, highlighting all marking tools for selection by a user.
Step S509: it is detected that the user clicks the "phone number reading" button.
Step S510: a telephone number style template is obtained.
Step S511: telephone number indicia are presented within the selection field.
For example, if no mark exists in the selection area, a button for clicking a [ telephone number ] reading method is selected, an HTML temporary mark module obtains a mapping relation of a class-mark type from an HTML-SSML conversion module, a class corresponding to the SSML mark type is found to be number-to-phone, the mark is span, and the data attribute is null, an HTML tag of < span class= "number-to-phone" >119 span is automatically added to the selection area, and an editor automatically renders CSS patterns of the class without showing text information of the tag to a user.
Step S512: HTML markup text is obtained.
Step S513: the HTML markup text is converted to SSML markup text.
When standard SSML mark text is required to be acquired, the HTML-SSML conversion module automatically converts < span class= "number-to-phone" >119 span > into < ty-as inter-pret-as= "text" >119 </ty-as >, and the text is provided for a user to store or directly provided for a TTS interface to preview the sound effect in time.
Step S514: SSML tagged text is obtained for provision to a speech synthesis device.
In summary, through the above manner, at least the following beneficial effects can be achieved:
(1) The HTML tag can be interpreted and rendered into a corresponding style by the browser instead of directly displaying the tag text to the user, so that the HTML tag is used as a temporary mark, the aim of recording the mark can be fulfilled, the number of visible characters can be reduced, the original text is not interfered, and the visual display experience is optimized. The core is to build a corresponding relation table of HTML label and SSML label in the system, namely the corresponding table of HTML-SSML conversion module. Table 5 is illustrated with 2 SSML labels as an example.
TABLE 5
Figure BDA0003897880320000131
Wherein, the characteristics of two types are as follows:
HTML tags are rendered by browser interpretation, such as: 119, the actual HTML tag is < span class= "number-to-text" >119 span >, and different styles are displayed according to CSS style setting of class in the system when HTML is rendered, for example, CSS style is set for the class as follows:
Figure BDA0003897880320000132
in contrast, SSML tags are not interpreted for rendering, as shown, such as: < sky-as interference-as= "telephone" >119 </sky-as >, which has a larger interference to the general user.
The key point of the conversion is as follows: each tag comprises a [ tag-class ] conversion type and a plurality of [ attribute ] conversion types, the span tags are uniformly adopted as basic tags in the HTML and are not required to be embodied in conversion relations, and the [ tag-class ] conversion is the correspondence between the SSML tags and class of span; [ Attribute ] conversion is the correspondence of an SSML attribute under a markup to an HTML attribute, and a portion of the SSML attribute allows the attribute to be null in the HTML.
For example, < say-as intermediate-as= "text" >119</say-as > turns to HTML:
[ Label-class ] the conversion of the try-as into a span. Number-to-test phone, i.e. < try-as > </try-as > corresponds to < span class= "number-to-test phone" >;
[ Attribute ] Interpres-as= "text" has no correspondence in HTML, no conversion is required
< phone alphabet= "py" ph= "dang4" > when phone > turns to HTML:
[ Label-class ] Phoneme converts to a span. Multiple-pinyin, i.e., < phone > </phone > becomes < span class= "multiple-pinyin" >;
the alphabet=py does not have the corresponding html attribute, and does not perform conversion;
the attribute value is directly corresponding, namely, the value of ph= "dang4" is converted into data-ph= "dang4", the whole < phone alpha bet= "py" ph= "dang4" >, when the phone is converted into < spectrum class= "multiple-pinyin" data-ph= "dang4" >, when the phone is;
< span class= "number-to-telephone" >119 span > turns to SSML:
the number-to-test phone is converted into a y-as, namely < span class= "number-to-test phone" >/span > corresponds to < y-as > </y-as >;
[ Property ] Interpres-as= "text" is necessary, but in HTML, it is added directly to SSML as it is;
< span class= "multiple-pinyin" data-ph= "dang4" > when the span turns to SSML:
[ Label-class ] conversion of multiple-pinylin to phone, i.e., < spectrum class= "multiple-pinylin" > "spectrum becomes < phone > </phone >
[ Attribute ] alphabet=py does not have a corresponding html Attribute, then it is added directly to SSML
[ Attribute ] data-ph is converted to ph, and attribute values are directly corresponding, namely data-ph= "dang4" is converted to ph= "dang4", and the whole < span class= "multiple-pinyin" data-ph= "dang4" > when the span is converted to < phone alpha bet= "py" ph= "dang4" > when the phone.
(2) The HTML label is benefited as a browser standard specification, the system can very easily judge whether the selected text is marked through a standard DOM API, the appointed HTML label can be rapidly removed through the DOM API, and the operations of removing the existing mark, adding a new mark and the like can be realized by clicking a button, so that the conditions of complex operation, character missing errors and the like caused by directly editing the complex text by a user are avoided. Such as the usual web API method:
target. Class name can obtain and set class name;
creating an html tag by document.createElement ('span');
e, target. Inlertext obtains the plain text in the node;
e.target.removeChild (targetNode) removes the specified html tag.
(3) The SSML mark has poor readability and large recognition and memory difficulty for common users, different marks are corresponding to different patterns through the HTML tag, the user experience in the recognition and display process is improved, technical details are hidden for the common users, and the technical details are completely and specially converted by the HTML-SSML conversion module automatically, so that more non-professional users can conveniently and efficiently finish text SSML marks, and the mark productivity is improved.
In other alternative implementations, instead of HTML, a Delta format may be used in the text markup style presentation, which is a data object and encapsulates a corresponding object operation interface, and through the API, the deletion, addition, and movement of the data node are operated, which is a data abstraction layer higher than HTML, and the data format is still finally converted into HTML to be rendered in the browser. One Delta data format is as follows:
Figure BDA0003897880320000151
the three relationships are as follows: SSML-Delta-HTML, and the mutual conversion of Delta-HTML is supported by corresponding program packages, and after the mutual conversion relation of SSML-Delta is established and realized, the same effect as that of the example can be realized. In summary, the present example is able to create, convert, manage SSML markup text through HTML style and HTML operations. The example can construct an SSML+HTML mark editor based on a browser, and realize the functions of SSML marks such as a telephone reading method for quickly inserting numbers, visual display and SSML marks such as a telephone reading method for quickly deleting numbers.
It will be appreciated that the above-mentioned method embodiments of the present disclosure may be combined with each other to form a combined embodiment without departing from the principle logic, and are limited to the description of the present disclosure. It will be appreciated by those skilled in the art that in the above-described methods of the embodiments, the particular order of execution of the steps should be determined by their function and possible inherent logic.
In addition, the disclosure further provides a processing device, an electronic device and a computer readable storage medium of the voice synthesis mark text, and the foregoing may be used to implement any one of the processing methods of the voice synthesis mark text provided in the disclosure, and the corresponding technical schemes and descriptions and corresponding descriptions referring to the method parts are not repeated.
Fig. 6 is a block diagram of a processing device for speech synthesis markup text according to an embodiment of the present disclosure.
Referring to fig. 6, an embodiment of the present disclosure provides a processing apparatus 60 of a speech synthesis markup text, the processing apparatus 60 of the speech synthesis markup text including:
a determining module 61 adapted to determine a first speech synthesis flag corresponding to an original text in a text-tagging interface in response to a speech synthesis flag addition request triggered for the original text;
the determining module 61 is further adapted to determine a second speech synthesis flag corresponding to the first speech synthesis flag according to a preset flag mapping relationship;
an adding module 62 adapted to add the second speech synthesis mark to the original text, resulting in a second speech synthesis mark text comprising the second speech synthesis mark; the second voice synthesis mark text is used for being displayed in the text mark interface;
The conversion module 63 is adapted to convert the second speech synthesis mark contained in the second speech synthesis mark text into a corresponding first speech synthesis mark, so as to obtain a first speech synthesis mark text; the first voice synthesis mark text is used for providing the voice synthesis device with voice synthesis processing; the first speech synthesis markup is implemented based on a first markup language and the second speech synthesis markup is implemented based on a second markup language; the first markup language is a language which is not supported by the browser to be rendered, and the second markup language is a language which is supported by the browser to be rendered.
Optionally, the first markup language is a speech synthesis markup language, and the second markup language is a hypertext markup language.
Optionally, the determining module is further adapted to:
responding to text selection operation triggered by a user, and determining target annotation text which is contained in the original text and corresponds to the text selection operation;
the speech synthesis markup addition request is triggered for the target markup text contained in the original text and the speech synthesis markup addition request is triggered by a markup tool button contained in the text markup interface;
The text marking interface is used for displaying the original text, and the text marking interface comprises a plurality of types of marking tool buttons, wherein different marking tool buttons correspond to different types of first voice synthesis marks.
Optionally, the adding module is specifically adapted to:
determining a mark type to which the second voice synthesis mark belongs and a mark display style corresponding to the mark type;
and adjusting the text style of the target marked text contained in the original text to be the marked display style in the text marking interface.
Optionally, the determining module is further adapted to:
determining the text type of the target annotation text and the mark type of the first voice synthesis mark;
checking whether the text type is matched with the mark type to which the first voice synthesis mark belongs;
if not, generating prompt information to prompt the user to trigger the voice synthesis mark adding request again.
Optionally, the determining module is further adapted to:
judging whether the target labeling text contains a second voice synthesis label or not;
if yes, displaying a modification inlet corresponding to the second voice synthesis mark in the text mark interface; wherein the modification entry is for performing modification processing on the second speech synthesis flag; and, placing a marking tool button contained in the text marking interface in an unavailable state;
If not, determining a marking tool button matched with the text type according to the text type of the target marking text, and setting the marking tool button matched with the text type into a usable state.
Optionally, the conversion module is specifically adapted to:
and responding to the received mark preservation instruction or the voice synthesis instruction, and converting the second voice synthesis mark contained in the second voice synthesis mark text into a corresponding first voice synthesis mark.
Fig. 7 is a block diagram of an electronic device according to an embodiment of the present disclosure.
Referring to fig. 7, an embodiment of the present disclosure provides an electronic device including: at least one processor 501; at least one memory 502, and one or more I/O interfaces 503, coupled between the processor 501 and the memory 502; wherein the memory 502 stores one or more computer programs executable by the at least one processor 501, the one or more computer programs being executed by the at least one processor 501 to perform the above-described processing method of speech synthesis markup text.
The disclosed embodiments also provide a computer readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor/processing core, implements the data migration method described above. The computer readable storage medium may be a volatile or nonvolatile computer readable storage medium.
Embodiments of the present disclosure also provide a computer program product comprising computer readable code, or a non-transitory computer readable storage medium carrying computer readable code, which when executed in a processor of an electronic device, performs the above-described data migration method.
Those of ordinary skill in the art will appreciate that all or some of the steps, systems, functional modules/units in the apparatus, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between the functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed cooperatively by several physical components. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer-readable storage media, which may include computer storage media (or non-transitory media) and communication media (or transitory media).
The term computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable program instructions, data structures, program modules or other data, as known to those skilled in the art. Computer storage media includes, but is not limited to, random Access Memory (RAM), read Only Memory (ROM), erasable Programmable Read Only Memory (EPROM), static Random Access Memory (SRAM), flash memory or other memory technology, portable compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical disc storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. Furthermore, as is well known to those of ordinary skill in the art, communication media typically embodies computer readable program instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and may include any information delivery media.
The computer readable program instructions described herein may be downloaded from a computer readable storage medium to a respective computing/processing device or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers. The network interface card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium in the respective computing/processing device.
Computer program instructions for performing the operations of the present disclosure can be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, c++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present disclosure are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), with state information of computer readable program instructions, which can execute the computer readable program instructions.
The computer program product described herein may be embodied in hardware, software, or a combination thereof. In an alternative embodiment, the computer program product is embodied as a computer storage medium, and in another alternative embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK), or the like.
Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.
These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Example embodiments have been disclosed herein, and although specific terms are employed, they are used and should be interpreted in a generic and descriptive sense only and not for purpose of limitation. In some instances, it will be apparent to one skilled in the art that features, characteristics, and/or elements described in connection with a particular embodiment may be used alone or in combination with other embodiments unless explicitly stated otherwise. It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the scope of the disclosure as set forth in the appended claims.

Claims (10)

1. A method for processing a speech synthesis markup text, comprising:
determining a first speech synthesis mark corresponding to an original text in a text mark interface in response to a speech synthesis mark adding request triggered for the original text;
determining a second voice synthesis mark corresponding to the first voice synthesis mark according to a preset mark mapping relation;
adding the second voice synthesis mark to the original text to obtain a second voice synthesis mark text containing the second voice synthesis mark; the second voice synthesis mark text is used for being displayed in the text mark interface;
Converting the second voice synthesis mark contained in the second voice synthesis mark text into a corresponding first voice synthesis mark to obtain a first voice synthesis mark text; the first voice synthesis mark text is used for providing the voice synthesis device with voice synthesis processing; the first speech synthesis markup is implemented based on a first markup language and the second speech synthesis markup is implemented based on a second markup language; the first markup language is a language which is not supported by the browser to be rendered, and the second markup language is a language which is supported by the browser to be rendered.
2. The method of claim 1, wherein the first markup language is a speech synthesis markup language and the second markup language is a hypertext markup language.
3. The method of claim 1, wherein prior to responding to the voice synthesis markup addition request triggered for the original text in the text markup interface, further comprising:
responding to text selection operation triggered by a user, and determining target annotation text which is contained in the original text and corresponds to the text selection operation;
the speech synthesis markup addition request is triggered for the target markup text contained in the original text, and the speech synthesis markup addition request is triggered by a markup tool button contained in the text markup interface;
The text marking interface is used for displaying the original text, and the text marking interface comprises a plurality of types of marking tool buttons, wherein different marking tool buttons correspond to different types of first voice synthesis marks.
4. A method according to claim 3, wherein said adding a second speech synthesis tag of said second tag type for said original text comprises:
determining a mark type to which the second voice synthesis mark belongs and a mark display style corresponding to the mark type;
and adjusting the text style of the target marked text contained in the original text to be the marked display style in the text marking interface.
5. The method of claim 3, wherein after said determining the first speech synthesis mark corresponding to the original text, further comprising:
determining the text type of the target annotation text and the mark type of the first voice synthesis mark;
checking whether the text type is matched with the mark type to which the first voice synthesis mark belongs;
if not, generating prompt information to prompt the user to trigger the speech synthesis mark adding request again.
6. The method of claim 3, wherein after determining the target annotation text included in the original text that corresponds to the text selection operation, further comprising:
judging whether the target labeling text contains a second voice synthesis label or not;
if yes, displaying a modification inlet corresponding to the second voice synthesis mark in the text mark interface; wherein the modification entry is for performing modification processing on the second speech synthesis flag; and, placing a marking tool button contained in the text marking interface in an unavailable state;
if not, determining a marking tool button matched with the text type according to the text type of the target marking text, and setting the marking tool button matched with the text type into a usable state.
7. The method of claim 1, wherein said converting the second speech synthesis mark contained in the second speech synthesis mark text into the corresponding first speech synthesis mark comprises:
and responding to the received mark preservation instruction or the voice synthesis instruction, and converting the second voice synthesis mark contained in the second voice synthesis mark text into a corresponding first voice synthesis mark.
8. A processing apparatus for speech synthesis markup text, comprising:
a determining module adapted to determine a first speech synthesis tag corresponding to an original text in a text tagging interface in response to a speech synthesis tag addition request triggered for the original text;
the determining module is further adapted to determine a second speech synthesis mark corresponding to the first speech synthesis mark according to a preset mark mapping relation;
the adding module is suitable for adding the second voice synthesis mark aiming at the original text to obtain a second voice synthesis mark text containing the second voice synthesis mark; the second voice synthesis mark text is used for being displayed in the text mark interface;
the conversion module is suitable for converting the second voice synthesis mark contained in the second voice synthesis mark text into a corresponding first voice synthesis mark to obtain a first voice synthesis mark text; the first voice synthesis mark text is used for providing the voice synthesis device with voice synthesis processing; the first speech synthesis markup is implemented based on a first markup language and the second speech synthesis markup is implemented based on a second markup language; the first markup language is a language which is not supported by the browser to be rendered, and the second markup language is a language which is supported by the browser to be rendered.
9. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,,
the memory stores one or more computer programs executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.
10. A computer readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the method according to any of claims 1-7.
CN202211279014.2A 2022-10-19 2022-10-19 Processing method and related device for speech synthesis marked text Pending CN116153289A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211279014.2A CN116153289A (en) 2022-10-19 2022-10-19 Processing method and related device for speech synthesis marked text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211279014.2A CN116153289A (en) 2022-10-19 2022-10-19 Processing method and related device for speech synthesis marked text

Publications (1)

Publication Number Publication Date
CN116153289A true CN116153289A (en) 2023-05-23

Family

ID=86351382

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211279014.2A Pending CN116153289A (en) 2022-10-19 2022-10-19 Processing method and related device for speech synthesis marked text

Country Status (1)

Country Link
CN (1) CN116153289A (en)

Similar Documents

Publication Publication Date Title
US8160881B2 (en) Human-assisted pronunciation generation
JP6069211B2 (en) Text conversion and expression system
US20160306784A1 (en) Audio Onboarding Of Digital Content With Enhanced Audio Communications
US20060136220A1 (en) Controlling user interfaces with voice commands from multiple languages
CN111681642B (en) Speech recognition evaluation method, device, storage medium and equipment
CN111666776B (en) Document translation method and device, storage medium and electronic equipment
CN103544140A (en) Data processing method, display method and corresponding devices
US20080215308A1 (en) Integrated pinyin and stroke input
US9196251B2 (en) Contextual conversion platform for generating prioritized replacement text for spoken content output
McTear et al. Voice application development for Android
CN111142667A (en) System and method for generating voice based on text mark
US20150371550A1 (en) System and method for rendering music
JP2014240884A (en) Content creation assist device, method, and program
JP2009187349A (en) Text correction support system, text correction support method and program for supporting text correction
WO2008106475A1 (en) Stroke number input
JPWO2015162737A1 (en) Transliteration work support device, transliteration work support method, and program
CN106991083A (en) Electronic document processing method and device
CN116153289A (en) Processing method and related device for speech synthesis marked text
US10373606B2 (en) Transliteration support device, transliteration support method, and computer program product
JP2020140374A (en) Electronic book reproducing device and digital book reproducing program
JP2015064543A (en) Text reading device
KR100464019B1 (en) Pronunciation string display method at the time of edit for voice recognizing apparatus
CN114694657A (en) Method for cutting audio file and related product
JP2016167027A (en) Information processing device, information processing method, and program
CN114676319A (en) Method and device for acquiring name of merchant and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination