CN116092477A

CN116092477A - Voice synthesis system mark memory library-based audio generation method and device

Info

Publication number: CN116092477A
Application number: CN202310322513.3A
Authority: CN
Inventors: 杨静波; 汤跃忠; 陈龙; 刘丹
Original assignee: Third Research Institute Of China Electronics Technology Group Corp; Beijing Zhongdian Huisheng Technology Co ltd
Current assignee: Third Research Institute Of China Electronics Technology Group Corp; Beijing Zhongdian Huisheng Technology Co ltd
Priority date: 2023-03-30
Filing date: 2023-03-30
Publication date: 2023-05-09

Abstract

The invention provides an audio generation method and device based on a speech synthesis system mark memory library, wherein the method comprises the following steps: acquiring a text to be searched; searching the text to be searched based on a pre-configured memory bank text to obtain a mark text matched with the memory bank text in the text to be searched; determining the marking information of the marking text in the memory text as the marking information of the marking text in the text to be searched; generating a corresponding audio file based on the marked text with the marked information; based on the interaction process with the user, it is determined whether the audio file meets the user requirements. The invention can be used for automatically searching the speech synthesis text content through the memory function, and once the detection result is matched with the memory content, the text marking function and the scheme of the memory can be called, thereby realizing the speech synthesis effect in the memory, avoiding repeated manual addition of the same mark by a user and greatly reducing the speech synthesis workload of the user.

Description

Voice synthesis system mark memory library-based audio generation method and device

Technical Field

The invention relates to the technical field of speech synthesis, in particular to an audio generation method and device based on a speech synthesis system mark memory bank.

Background

Currently, a speech synthesis system marks text content based on the personalized requirements of a user, such as pause marks, continuous reading marks, rereading marks, multi-tone marks, alias marks, and the like, in the use process. However, the personalized requirements of each user may be similar, when the next speech item is synthesized after the speech synthesis of one item is completed, and when the content which needs to be marked again is encountered, the same text content needs to be marked again, so that the operation mode can cause repeated labor and the process is complex and tedious.

Disclosure of Invention

The invention aims to solve the technical problem of simplifying the repeated marking process in the voice synthesis process; in view of the above, the present invention provides an audio generating method and apparatus based on a tag memory of a speech synthesis system.

The invention adopts the technical scheme that the audio generation method based on the voice synthesis system mark memory library comprises the following steps:

acquiring a text to be searched;

searching the text to be searched based on a pre-configured memory bank text to obtain a mark text matched with the memory bank text in the text to be searched;

determining the marking information of the marking text in the memory text as the marking information of the marking text in the text to be searched;

generating a corresponding audio file based on the marked text with the marked information;

based on the interaction process with the user, it is determined whether the audio file meets the user requirements.

In one embodiment, the method further comprises:

and setting corresponding mark information for part of texts in the memory bank texts.

In one embodiment, the method further comprises:

and in the memory text, at least one of adding, editing and deleting the mark information is processed.

In one embodiment, the determining whether the audio file meets the user requirement based on the interaction process with the user includes:

and responding to the confirmation information of the user on the audio file, and outputting the current audio file.

responding to negative information of the user on the audio file, and further configuring a mark text corresponding to the audio file;

the reconfigured markup text is synthesized into an audio file for further interaction with the user.

In one embodiment, the responding to the negative information of the user on the audio file further configures the marked text corresponding to the audio file, and the method comprises the following steps:

and responding to negative information of the user on the audio file, and further configuring the marking information in the marking text corresponding to the audio file.

In one embodiment, the responding to the negative information of the user on the audio file further configures the marking information in the marking text corresponding to the audio file, and the method includes:

and responding to negative information of the user on the audio file, and performing at least one of adding, deleting and modifying operations on the marking information in the marking text corresponding to the audio file.

The invention also provides an audio generating device based on the speech synthesis system mark memory library, comprising:

an acquisition unit configured to acquire a text to be retrieved;

the retrieval unit is configured to retrieve the text to be retrieved based on a pre-configured memory bank text so as to obtain a marked text matched with the memory bank text in the text to be retrieved;

the calling unit is configured to determine the marking information of the marking text in the memory text as the marking information of the marking text in the text to be searched;

an audio synthesis unit configured to generate a corresponding audio file based on the markup text with markup information;

and the interaction unit is configured to determine whether the audio file meets the requirement of a user based on the interaction process with the user.

Another aspect of the present invention also provides an electronic device including: memory, a processor and a computer program stored on the memory and executable on the processor, which when executed by the processor, performs the steps of the speech synthesis system tag memory based audio generation method as claimed in any one of the preceding claims.

Another aspect of the present invention also provides a computer storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the speech synthesis system tag memory based audio generation method as described in any of the above.

By adopting the technical scheme, the audio generation method based on the voice synthesis system mark memory bank provided by the invention can be used for automatically searching voice synthesis text content due to the function of the built-in memory bank, and once the detection result is matched with the memory bank content, the text mark function and scheme of the memory bank can be called, so that the voice synthesis effect in the memory bank is realized, and the user is prevented from repeatedly adding the same mark manually. The speech synthesis workload of the user is greatly reduced.

Drawings

FIG. 1 is a flow chart of an audio generation method based on a tag memory of a speech synthesis system according to an embodiment of the invention;

FIG. 2 is a flowchart of another audio generation method based on a tag memory of a speech synthesis system according to an embodiment of the present invention;

FIG. 3 is a screenshot of an unedditive markup page in an application example of a speech synthesis system markup repository-based audio generation method according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating the addition of a "read-through" markup page screenshot in an application example of a method for generating audio based on a markup memory of a speech synthesis system according to an embodiment of the present invention;

FIG. 5 is a screenshot added to a library page in an application example of a speech synthesis system tag library-based audio generation method according to an embodiment of the present invention;

FIG. 6 is a screenshot of a library tagging scheme invocation page in an application example of a speech synthesis system tag library-based audio generation method according to an embodiment of the present invention;

FIG. 7 is a screenshot of a memory management function in an application example of a method for generating audio based on a markup memory of a speech synthesis system according to an embodiment of the present invention;

FIG. 8 is a screenshot of a second editing page of a speech synthesis system tag library in an application example of the speech synthesis system tag library-based audio generation method according to an embodiment of the present invention;

FIG. 9 is a block diagram showing the construction of an audio generating apparatus based on a tag memory of a speech synthesis system according to an embodiment of the present invention;

fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to further describe the technical means and effects adopted by the present invention for achieving the intended purpose, the following detailed description of the present invention is given with reference to the accompanying drawings and preferred embodiments.

In the drawings, the thickness, size and shape of the object have been slightly exaggerated for convenience of explanation. The figures are merely examples and are not drawn to scale.

It will be further understood that the terms "comprises," "comprising," "includes," "including," "having," "containing," and/or "including," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. Furthermore, when a statement such as "at least one of the following" appears after a list of features that are listed, the entire listed feature is modified instead of modifying a separate element in the list. Furthermore, when describing embodiments of the present application, the use of "may" means "one or more embodiments of the present application. Also, the term "exemplary" is intended to refer to an example or illustration.

As used herein, the terms "substantially," "about," and the like are used as terms of a table approximation, not as terms of a table level, and are intended to illustrate inherent deviations in measured or calculated values that would be recognized by one of ordinary skill in the art.

Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other. The present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

In the prior art, the electric speech synthesis system supports various means editing, including pause marks, continuous reading marks, repeated reading marks, multi-voice marks, digital reading marks, english reading marks, alias marks, local variable speed marks, local volume marks and multi-voice marks. When a user uses the electric speech synthesis system, personalized setting of the mark content can be realized according to actual needs. For example, "stable pushing" in the phrase "fast and stable pushing" is also an independent phrase, so that the machine cannot accurately judge the prosodic information of the word when performing speech synthesis, and problems occur in word segmentation and sentence breaking. The combined 'hoof stable propulsion' has a slight pause between 'walking' and 'stable', which is inconsistent with the 'hoof stable' which is actually needed. At this time, the user can manually add the continuous reading mark to continuously read the 'hoof stable' step. After the artificial intervention, the shoe-shaped stepping stabilization pushing device is combined to slightly pause between the shoe-shaped stepping stabilization pushing device and the pushing device, so that the user requirements are met. However, the same problems are encountered in the subsequent text of the synthesis task or the subsequent synthesis task, and the user is still required to manually intervene, and the same marks are added to solve the problems. Such operations can bring significant labor costs to the user's use.

In a first embodiment of the present invention, as shown in fig. 1, an audio generating method based on a speech synthesis system tag memory library includes the following steps:

step S1, acquiring a text to be retrieved;

step S2, searching the text to be searched based on a pre-configured memory bank text to obtain a marked text matched with the memory bank text in the text to be searched;

step S3, determining the marking information of the marking text in the memory text as the marking information of the marking text in the text to be searched;

step S4, generating a corresponding audio file based on the marked text with the marked information;

step S5, based on the interaction process with the user, whether the audio file meets the user requirement is determined.

The method provided in this embodiment will be described in detail with reference to fig. 1 or 2.

And S1, acquiring a text to be retrieved.

In this embodiment, the text to be retrieved may be directly obtained by copying, importing, or the like, or may be manually edited and input. The text to be retrieved may include chinese characters, english characters, punctuation marks, numeric characters, or any character information that may exist in text form, as will not be limited in this context.

And step S2, searching the text to be searched based on the pre-configured memory bank text so as to obtain a mark text matched with the memory bank text in the text to be searched.

In this embodiment, the corresponding flag information may be set in advance for a part of the text in the memory text.

Specifically, the above configuration process for the memory bank may include: and at least one of adding, editing and deleting the mark information in the memory bank text.

In this embodiment, the mark text matched with the memory text in the text to be searched may be text information overlapping the text to be searched and the memory text.

that is, when text information overlapping or matching with the memory bank text is retrieved in the text to be retrieved, the tag information of the corresponding memory bank text may be directly called to the corresponding text information in the text to be retrieved.

And S4, generating a corresponding audio file based on the marked text with the marked information.

It will be appreciated that the audio file generated based on the markup text with the markup information is an audio file generated based on the markup text and with continuous and/or pauses (markup information).

In one embodiment, determining whether the audio file meets the user requirements based on an interaction process with the user includes: and responding to the confirmation information of the user on the audio file, and outputting the current audio file.

In one embodiment, determining whether the audio file meets the user requirements based on an interaction process with the user includes: responding to negative information of the user on the audio file, and further configuring a mark text corresponding to the audio file; the reconfigured markup text is synthesized into an audio file for further interaction with the user.

Specifically, responding to the negative information of the user on the audio file, further configuring the marked text corresponding to the audio file, and comprises the following steps: and responding to negative information of the user on the audio file, and further configuring the marking information in the marking text corresponding to the audio file.

Illustratively, in response to the negative information of the user on the audio file, further configuring the tag information in the tag text corresponding to the audio file, including: and responding to negative information of the user on the audio file, and performing at least one of adding, deleting and modifying operations on the marking information in the marking text corresponding to the audio file.

In this embodiment, after the user of the speech synthesis system completes the marking of the text content according to the actual need, the marking may be added to the memory bank. When the user performs the speech synthesis task again, the system automatically searches the text content, and if the text content matched with the memory bank appears, the same mark of the memory bank is automatically added. The similar problems are solved once. The class of speech synthesis tasks in the workplace of the same user is approximately consistent, such as politics, culture, military, etc., and the problems encountered are also approximately the same. The design and the realization of the function of the mark memory library can effectively reduce the workload of users, increase the automation degree of the system and promote the perception of users of products.

That is, the present embodiment has at least the following advantages:

in this embodiment, the system is designed to realize the function of the memory, automatically retrieve the text content of the speech synthesis, and once the detection result is matched with the content of the memory, the text marking function and the scheme of the memory are called, so as to realize the speech synthesis effect in the memory, and avoid the user from repeatedly manually adding the same mark. The speech synthesis workload of the user is greatly reduced.

The second embodiment of the present invention, corresponding to the first embodiment, introduces an application example of the audio generation method based on the speech synthesis system tag memory provided in the first embodiment.

In this embodiment, after the login system is successful, click on the [ real-time synthesis ] function menu to enter the real-time synthesis page. Clicking the [ listen to test ] button after editing the text content, listening to the synthesized audio, as shown in fig. 3. It should be understood that the system mentioned in this embodiment is a system for implementing the method provided in the first embodiment, and may be implemented in a computer in a software form, and the related designs such as page appearance and the like are merely exemplary in this embodiment, and are not intended to limit the scope of the present invention.

After listening to the audio file, for example, a slight pause is found between "step" and "steady" in "hoof step|steady push", which does not fit the scene. The 'hoof steady' text can be selected in a sliding way in the system, and the 'continuous reading' function button is clicked. The continuous reading mark is added into the 'hoof stable walking', as shown in figure 4, and the hearing is tried again, so that the requirements of users are met.

Further, the "continuous reading X (hoof-walk) pushing" word is slid, and the [ memory ] function button is clicked and added to the memory bank, as shown in FIG. 5.

Whenever a new composite task appears "hoof-walk-stable push" word, the system can automatically retrieve the matching memory and automatically invoke the memory text-marking scheme to add "hoof-walk-stable" marks as shown in fig. 6.

And, the memory management function may perform a second edit, as shown in fig. 7.

Specifically, the button can be clicked (edited) from the management page of the memory bank, the detailed editing page can be entered again, and the marking scheme in the memory bank can be modified. As shown in fig. 8.

The third embodiment of the present invention, corresponding to the first embodiment, introduces an audio generating device based on a tag memory of a speech synthesis system, as shown in fig. 9, and includes the following components:

an acquisition unit configured to acquire a text to be retrieved;

the retrieval unit is configured to retrieve the text to be retrieved based on the pre-configured memory bank text so as to obtain a mark text matched with the memory bank text in the text to be retrieved;

an audio synthesis unit configured to generate a corresponding audio file based on the markup text with the markup information;

and an interaction unit configured to determine whether the audio file satisfies the user requirement based on an interaction process with the user.

In one embodiment, the apparatus further comprises: the configuration module is used for setting corresponding mark information for part of texts in the memory bank texts.

In one embodiment, the configuration module is further to: and at least one of adding, editing and deleting the mark information in the memory bank text.

In one embodiment, the interaction unit is further configured to: and outputting the current audio file in response to the confirmation information of the user on the audio file.

In one embodiment, the interaction unit is further configured to: responding to negative information of the user on the audio file, and further configuring a mark text corresponding to the audio file; the reconfigured markup text is synthesized into an audio file for further interaction with the user.

In one embodiment, the interaction unit is further configured to: and responding to negative information of the user on the audio file, and further configuring the marking information in the marking text corresponding to the audio file.

In one embodiment, the interaction unit is further configured to: and responding to negative information of the user on the audio file, and performing at least one of adding, deleting and modifying operations on the marking information in the marking text corresponding to the audio file.

A fourth embodiment of the present invention, as shown in fig. 10, can be understood as a physical device, including a processor and a memory storing processor-executable instructions, which when executed by the processor, perform the following operations:

step S1, acquiring a text to be retrieved;

step S2, searching the text to be searched based on a pre-configured memory bank text to obtain a mark text matched with the memory bank text in the text to be searched;

In a fifth embodiment of the present invention, the flow of the audio generation method based on the tag memory of the speech synthesis system in this embodiment is the same as that in the first, second or third embodiment, except that in engineering implementation, this embodiment may be implemented by means of software plus a necessary general hardware platform, and certainly may also be implemented by hardware, but in many cases, the former is a preferred implementation. Based on such understanding, the method of the present invention may be embodied in the form of a computer software product stored on a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) comprising instructions for causing an apparatus to perform the method of the embodiments of the present invention.

While the invention has been described in connection with specific embodiments thereof, it is to be understood that these drawings are included in the spirit and scope of the invention, it is not to be limited thereto.

Claims

1. An audio generation method based on a speech synthesis system mark memory library is characterized by comprising the following steps:

acquiring a text to be searched;

2. The method for generating audio based on a speech synthesis system tag memory library of claim 1, further comprising:

3. The method for generating audio based on a speech synthesis system tag memory library of claim 2, further comprising:

4. The method for generating audio based on a tag memory of a speech synthesis system according to claim 1, wherein the determining whether the audio file satisfies the user requirement based on the interaction process with the user comprises:

5. The method for generating audio based on a tag memory of a speech synthesis system according to claim 1, wherein the determining whether the audio file satisfies the user requirement based on the interaction process with the user comprises:

6. The method for generating audio based on a markup memory of a speech synthesis system according to claim 5, wherein said further configuring markup text corresponding to said audio file in response to negative information of said audio file by a user comprises:

7. The method for generating audio based on a markup memory of a speech synthesis system according to claim 6, wherein said further configuring markup information in markup text corresponding to said audio file in response to negative information of said audio file by a user comprises:

8. An audio generating device based on a speech synthesis system tag memory library, comprising:

an acquisition unit configured to acquire a text to be retrieved;

9. An electronic device, the electronic device comprising: memory, a processor and a computer program stored on the memory and executable on the processor, which when executed by the processor, implements the steps of the speech synthesis system tag memory based audio generation method of any one of claims 1 to 7.

10. A computer storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the speech synthesis system tag memory based audio generation method of any of claims 1 to 7.