CN114863907B

CN114863907B - Marking method and device for text-to-speech processing

Info

Publication number: CN114863907B
Application number: CN202210791141.4A
Authority: CN
Inventors: 刘丹; 王荔; 汤跃忠; 陈龙; 杨静波
Original assignee: Third Research Institute Of China Electronics Technology Group Corp; Beijing Zhongdian Huisheng Technology Co ltd
Current assignee: Third Research Institute Of China Electronics Technology Group Corp; Beijing Zhongdian Huisheng Technology Co ltd
Priority date: 2022-07-07
Filing date: 2022-07-07
Publication date: 2022-10-28
Anticipated expiration: 2042-07-07
Also published as: CN114863907A

Abstract

The invention discloses a marking method and a marking device for text-to-speech processing, wherein the marking method comprises the following steps: providing a plurality of marking menu items, each marking menu item having a marking tool with a type of function; selecting a first target text, and adding a mark of a corresponding function to the selected first target text based on a mark menu item; providing a temporary marking area; acquiring a copy instruction after adding a target mark to the first target text so as to temporarily store the target mark in a temporary mark area; and under the condition that the second target text is selected, presenting a target mark in association with the second target text based on the temporary mark area, so as to endow the target mark to the second target text after acquiring a confirmation instruction of the user. According to the embodiment of the application, the mark which the user expects to copy is temporarily stored in the temporary mark area, so that mark copy is realized, the interaction frequency in the mark process is greatly reduced, and the mark efficiency in the text-to-speech process is improved.

Description

Marking method and device for text-to-speech processing

Technical Field

The invention relates to the technical field of voice transcription, in particular to a marking method and a marking device for text-to-voice processing.

Background

In the text-to-speech audio software, the accuracy and naturalness of the synthesized speech can be improved by adding text pronunciation and prosody marks.

The mark icon of the text of the prior art marking method can be deleted, pop up in a pop-up window or pull down menu after clicking, but the mark can not be selected and copied. When the user needs to add the same mark at different positions, the user needs to click the function icon again, and then select in a pull-down menu or input in a popup window. Therefore, the operation steps of the marking process in the prior art are too complicated, and the marking efficiency is low.

Disclosure of Invention

The embodiment of the invention provides a marking method and a marking device for text-to-speech processing, which are used for providing a marking copy function and greatly improving marking efficiency.

The embodiment of the invention provides a marking method for text-to-speech processing, which comprises the following steps:

providing a plurality of marking menu items, each marking menu item having a marking tool of a type of function;

selecting a first target text, and adding a mark of a corresponding function to the selected first target text based on a mark menu item;

providing a temporary marking area;

acquiring a copy instruction after adding a target mark to the first target text so as to temporarily store the target mark to the temporary mark area;

and under the condition that a second target text is selected, presenting the target mark in association with the second target text based on the temporary mark area, so that the target mark is endowed to the second target text after a confirmation instruction of a user is obtained.

Optionally, presenting the target mark in association with the second target text is achieved by mark pop;

and when the mark popup is closed, or the mark given to the second target text is different from the target mark in the temporary mark area, or the operation on the second target text is inconsistent with the precondition operation corresponding to the target mark in the temporary mark area, not performing the mark popup after the text is selected after the second target text.

Optionally, the method further includes: and after a third target text is selected, acquiring a selection instruction of the temporary marking area so as to endow the target mark in the temporary marking area to the third target text.

Optionally, the plurality of marker menu items includes at least: pause flag, continuous reading flag, polyphone flag, local volume flag, reread flag, alias flag.

Optionally, at least some of the plurality of marking menu items are provided with corresponding customization functions.

Optionally, after adding a mark of a corresponding function to the selected first target text, the method further includes:

and acquiring the clicking operation of the mark of the first target text to modify the mark of the first target text.

Optionally, the method further includes:

a delete key is provided in association with a marker based on the first target text to delete the corresponding marker based on the delete key.

The present application further provides a text-to-speech processing labeling apparatus, which includes a processor and a memory, where the memory stores a computer program, and the computer program implements the steps of the text-to-speech processing labeling method when executed by the processor.

The present application also proposes a computer-readable storage medium having stored thereon a computer program which, when being executed by a processor, implements the steps of the aforementioned labeling method for text-to-speech processing.

The embodiment of the invention provides a plurality of marking menu items and provides a temporary marking area, so that marks which are expected to be copied by a user are temporarily stored in the temporary marking area, thereby realizing mark copying, greatly reducing the interaction frequency in the marking process and improving the marking efficiency in the text-to-speech process.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

FIG. 1 is a basic flow diagram of a tagging method of an embodiment of the present application;

FIG. 2 is an example of marker replication for an embodiment of the present application;

fig. 3 is an example of implementing label pasting based on a temporary label area in the embodiment of the present application;

fig. 4 is a state example after the mark pop-up window is closed according to the embodiment of the present application.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited by the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The speech synthesis software is mainly used for synthesizing texts into audios, but the accuracy and naturalness of the existing machine speech synthesis effect are still insufficient, so that various text pronunciation marks and rhythm adjustment marks need to be added to improve the synthesis effect, such as adding "aliases", "polyphones", "pauses", "continuous reading", "rereading", "local speed change", "local volume", and the like in the texts.

The embodiment of the invention provides a marking method for text-to-speech processing, which comprises the following steps of:

in step S101, a plurality of marking menu items each having a marking tool of one kind of function are provided. Text labels are used to ameliorate prosodic problems such as incorrect pronunciation or unnatural pauses when machines synthesize speech. Referring specifically to fig. 2, in some examples, the plurality of marker menu items includes at least: pause flag, continuous reading flag, polyphone flag, local volume flag, reread flag, alias flag.

In step S102, a first target text is selected, and a mark of a corresponding function is added to the selected first target text based on a mark menu item. In the actual use process, a user can click a certain marking menu item, then a marking of a corresponding function can be executed, for example, a polyphone marking is clicked, a pronunciation can be set for a selected text, the usage of other marking tools is similar, and the description is omitted here.

In step S103, a temporary mark area is provided. Referring specifically to the "temporary marks" in fig. 2, there are no marks in the temporary mark areas in fig. 2.

In step S104, a copy instruction after adding a target mark to the first target text is obtained, so as to temporarily store the target mark to the temporary mark area. Referring to fig. 2 and 3, for example, after the copy operation of the mark "0.9x" is performed in fig. 2, the mark "0.9x" is temporarily stored in the temporary mark area, so that the mark in the temporary mark area can be pasted based on the temporary mark area. The specific copying action can be completed by a right mouse button or a shortcut Ctrl + C.

In step S105, in a case where a second target text is selected, the target mark is presented in association with the second target text based on the temporary mark area, so that the target mark is given to the second target text after a confirmation instruction of the user is acquired. The second target text referred to in this example may be text following the first target text in the text passage in the word order, and fig. 3 shows an example of implementing label pasting based on a temporary label, for example, after the second text is selected, a label in the temporary label area, for example "0.9x" in fig. 3, may be presented in association with the selected text. By the method, the user can copy the mark based on the temporary mark area under the condition that the same mark is required to be continuously used, so that the mark adding efficiency in the text-to-speech process is greatly improved.

Optionally, presenting the target mark in association with the second target text is achieved by marking a pop-up window. For example, in fig. 3, a copied mark icon "0.9x" automatically pops up near the cursor, the mark can be pasted by clicking the floating icon, the pasting of the mark is realized after clicking the pop-up window, and the pop-up window disappears.

And when the mark popup is closed, or the mark given to the second target text is different from the target mark in the temporary mark area, or the operation (text selection area or cursor positioning) on the second target text is inconsistent with the precondition operation corresponding to the target mark in the temporary mark area, the mark popup is not carried out after the text is selected after the second target text.

Specifically, for example, in fig. 4, after the following text segment "not innocent from our deep," where the preceding target text "from" paste mark "0.9X" based on mark popup implementation, "X" in user popup closes the mark popup, and thus the following target text "deep" does not popup. In other examples, e.g., the user has given another label for "deep" than "0.9x", then no pop-up is performed. Or the user does not select the text, but performs mouse click operation in the text, and the popup is not executed.

In other examples, in a case that the operation on the second target text does not coincide with the precondition operation corresponding to the target mark in the temporary mark area, after the second target text is selected, the mark pops are not performed. The precondition operation corresponding to the target mark in the temporary mark area referred to in this example may be, for example, that the polyphone mark function is to select a single character and then click the "polyphone" mark, and at this time, the operation of selecting a single character may be used as the precondition operation of the polyphone mark. The pause is that a mouse is required to click a certain position in a text to position a cursor, then a pause mark is clicked, and the cursor positioning can be used as the precondition operation of the pause mark. By the mode, the method can be more suitable for the use scene of the user, so that the popup window can appear at a proper time through marking, and the marking efficiency of the user is improved.

In some embodiments, further comprising: and after the third target text is selected, acquiring a selection instruction of the temporary marking area so as to endow the target mark in the temporary marking area to the third target text. Specifically, as shown in fig. 4, for example, after the mark pop-up window is closed, the temporary mark area still stores the mark "0.9x" temporarily, and in the process of performing the marking subsequently, the user may click on the temporary mark area, thereby calling the mark in the temporary mark area and giving the selected third target text.

In some embodiments, at least some of the plurality of branding menu items are provided with corresponding custom functions.

By clicking on the toolbar icon after sliding the text content or cursor positioning in this context, a mark may be generated directly or need to be selected in a drop down menu, entered in an input box, adjusted to a desired numerical value using a slider, etc.

Specific pause markers: when a pause mark is inserted into the text, the cursor is positioned at a certain position in the text, the pause mark menu item is clicked, and the user-defined pause time length, the non-pause time length, the 0.05s, the 0.1s, the 0.15s and the 0.2s are selected from the popped-up pull-down menu or the user-defined pause time length is input into an input box, so that the pause mark can be inserted.

Reading-through marks: after part of text (more than two characters) is selected in a sliding way, the continuous reading mark menu item is clicked, and continuous reading marks appear in the text immediately.

Polyphone marking: sliding a menu of characters, clicking a 'polyphone' mark menu item, and selecting pinyin provided by a system in a pull-down menu or inputting user-defined pinyin in an input box.

Local volume marking: selecting a text needing to adjust the volume, clicking a 'local volume' mark menu item, adjusting the volume to an ideal value through a sliding bar, or clicking a 'confirm' button after inputting a custom value in an input box.

In some embodiments, after adding the mark of the corresponding function to the selected first target text, the method further includes: and acquiring clicking operation on the mark of the first target text to modify the mark of the first target text. In some embodiments, further comprising: a delete key is provided in association with a marker based on the first target text to delete the corresponding marker based on the delete key. Specifically, after the mark is generated, the mark in the text can be clicked to modify. Clicking the delete button in the mark can delete the mark.

The method of the embodiment can quickly copy the mark, intelligently identify the action of pasting the mark and prompt a user to paste the mark. The marks copied at the last time are stored in the temporary mark area, and a user can directly click and use the marks, so that the time for operating a plurality of same marks is reduced, the operation efficiency of text marks is improved, and the text voice synthesis efficiency is improved.

The present application further provides a labeling apparatus for text-to-speech processing, which includes a processor and a memory, where the memory stores a computer program, and the computer program implements the steps of the labeling method for text-to-speech processing when executed by the processor.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a component of' 8230; \8230;" does not exclude the presence of another like element in a process, method, article, or apparatus that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are only for description, and do not represent the advantages and disadvantages of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention or portions thereof contributing to the prior art may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the methods according to the embodiments of the present invention.

While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A marking method for text-to-speech processing is characterized by comprising the following steps:

providing a temporary marking area;

under the condition that a second target text is selected, presenting the target mark in association with the second target text based on the temporary mark area, and endowing the target mark to the second target text after acquiring a confirmation instruction of a user;

presenting the target mark in association with the second target text is accomplished by mark pop;

2. The method for labeling text-to-speech processing as recited in claim 1, further comprising: and after a third target text is selected, acquiring a selection instruction of the temporary marking area so as to endow the target mark in the temporary marking area to the third target text.

3. The method for text-to-speech tagging of claim 1, wherein the plurality of tagging menu items comprises at least: pause flag, continuous reading flag, polyphone flag, local volume flag, reread flag, alias flag.

4. The method for tagging of text to speech processing of claim 3, wherein at least some of the plurality of tagging menu items are provided with corresponding custom functions.

5. The method for labeling text-to-speech processing according to claim 1, wherein after adding a label corresponding to a function to the selected first target text, further comprising:

6. The method for text-to-speech processing tagging of claim 5, further comprising:

a delete key is associatively provided based on the marking of the first target text to delete the corresponding marking based on the delete key.

7. A text-to-speech processing tagging device comprising a processor and a memory, the memory having stored thereon a computer program which, when executed by the processor, performs the steps of the tagging method of text-to-speech processing according to any one of claims 1 to 6.

8. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the steps of the text-to-speech processing labeling method according to any one of claims 1 to 6.