CN116959452A - Visual adjustment method, device, equipment, medium and product for synthesized audio - Google Patents

Visual adjustment method, device, equipment, medium and product for synthesized audio Download PDF

Info

Publication number
CN116959452A
CN116959452A CN202310970213.6A CN202310970213A CN116959452A CN 116959452 A CN116959452 A CN 116959452A CN 202310970213 A CN202310970213 A CN 202310970213A CN 116959452 A CN116959452 A CN 116959452A
Authority
CN
China
Prior art keywords
character
audio
adjustment
adjusted
tone
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310970213.6A
Other languages
Chinese (zh)
Inventor
段志毅
戴世昌
范志强
周文君
翁超
李广之
卞衍尧
张桥
杜念冬
欧阳才晟
甄帅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202310970213.6A priority Critical patent/CN116959452A/en
Publication of CN116959452A publication Critical patent/CN116959452A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • G06F3/165Management of the audio stream, e.g. setting of volume, audio stream path
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • G06F3/167Audio in a user interface, e.g. using voice commands for navigating, audio feedback
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The application discloses a visual adjustment method, device, equipment, medium and product for synthesized audio, and belongs to the technical field of audio. The method comprises the following steps: displaying an audio editing area, wherein the synthesized audio is obtained by converting text content, the audio editing area comprises an adjustment control corresponding to characters in the text content, and the adjustment control is used for adjusting audio attributes corresponding to the characters; responding to the adjustment operation on the adjustment control corresponding to the character, and displaying an adjustment audio attribute, wherein the adjustment audio attribute is an audio attribute obtained after adjusting the audio attribute parameter value corresponding to the character; and displaying the updated new synthesized audio based on the adjusted audio attribute. The adjustment control corresponding to each character is adjusted in a visual mode, so that the adjustment of the synthesized audio is realized, the adjustment operation steps of the synthesized audio are simplified, and the difficulty in producing high-quality synthesized audio is reduced.

Description

Visual adjustment method, device, equipment, medium and product for synthesized audio
Technical Field
The embodiment of the application relates to the technical field of audio, in particular to a visual adjustment method, device, equipment, medium and product for synthesized audio.
Background
The Text To Speech (TTS) technology refers To a technology of converting Text contents into audio for output. When the user plays the synthesized audio converted by the text content, in order to meet the requirement of the user on hearing, the output attribute of the characters in the synthesized audio can be adjusted in the process of generating the synthesized audio so as to realize the effect of playing the synthesized audio in a personalized way.
In the related art, by adding a speech synthesis markup language (Speech Synthesis Markup Language, SSML) to text content, for example, adjusting the speech rate of synthesized audio, SSML added to the text content is < speech rate= "200" >, and in the process of converting the text content into the synthesized audio, the computer device adjusts the speech rate of the synthesized audio based on SSML.
However, the adjustment method in the related art depends on the experienced adjuster to add SSML into the text content, so that the SSML has a great difficulty in understanding, and the adjustment of the synthesized audio is difficult and difficult to operate.
Disclosure of Invention
The application provides a visual adjustment method, a visual adjustment device, visual adjustment equipment, visual adjustment media and visual adjustment products for synthesizing audio frequency, wherein the technical scheme is as follows:
according to an aspect of the present application, there is provided a visual adjustment method of synthesized audio, the method including:
Displaying an audio editing area, wherein the synthesized audio is obtained by converting text content, the audio editing area comprises an adjustment control corresponding to characters in the text content, and the adjustment control is used for adjusting audio attributes corresponding to the characters;
responding to the adjustment operation on the adjustment control corresponding to the character, and displaying an adjustment audio attribute, wherein the adjustment audio attribute is an audio attribute obtained after adjusting the audio attribute parameter value corresponding to the character;
and displaying the updated new synthesized audio based on the adjusted audio attribute.
According to an aspect of the present application, there is provided a visual adjustment apparatus for synthesizing audio, the apparatus comprising:
the display module is used for displaying an audio editing area, the synthesized audio is obtained by converting text content, the audio editing area comprises an adjustment control corresponding to characters in the text content, and the adjustment control is used for adjusting audio attributes corresponding to the characters;
the display module is used for responding to the adjustment operation on the adjustment control corresponding to the character and displaying an adjustment audio attribute, wherein the adjustment audio attribute is an audio attribute obtained after adjusting the audio attribute parameter value corresponding to the character;
And the display module is used for displaying the updated new synthesized audio based on the adjusted audio attribute.
According to another aspect of the present application, there is provided a computer apparatus comprising: a processor and a memory, the memory storing at least one computer program, the at least one computer program being loaded and executed by the processor to implement the method of visual adjustment of synthesized audio as described in the above aspect.
According to another aspect of the present application, there is provided a computer storage medium having stored therein at least one computer program loaded and executed by a processor to implement the method of visual adjustment of synthesized audio as described in the above aspect.
According to another aspect of the present application, there is provided a computer program product comprising a computer program stored in a computer readable storage medium; the computer program is read from the computer-readable storage medium and executed by a processor of a computer device, so that the computer device performs the visual adjustment method of synthesized audio as described in the above aspect.
The technical scheme provided by the application has the beneficial effects that at least:
displaying an audio editing area; responding to the adjustment operation on the adjustment control corresponding to the character, and displaying an adjusted audio attribute obtained after adjusting the audio attribute parameter value corresponding to the character; based on the adjusted audio attributes, updated newly synthesized audio is displayed. According to the application, the adjustment control corresponding to each character is adjusted in a visual mode, so that the adjustment of the synthesized audio is realized, the adjustment operation steps of the synthesized audio are simplified, and the difficulty in producing high-quality synthesized audio is reduced.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of a method for visual adjustment of synthesized audio according to an exemplary embodiment of the present application;
FIG. 2 is a schematic diagram of the architecture of a computer system provided by an exemplary embodiment of the present application;
FIG. 3 is a flow chart of a method for visual adjustment of synthesized audio provided by an exemplary embodiment of the present application;
FIG. 4 is a flow chart of a method for visual adjustment of synthesized audio provided by an exemplary embodiment of the present application;
FIG. 5 is a schematic diagram of a content input area provided by an exemplary embodiment of the present application;
FIG. 6 is a schematic illustration of an audio editing area provided by an exemplary embodiment of the present application;
FIG. 7 is a schematic diagram of a manner of selecting a plurality of characters provided by an exemplary embodiment of the present application;
FIG. 8 is a schematic diagram of a tone adjustment region provided by an exemplary embodiment of the present application;
FIG. 9 is a schematic diagram of a dwell interval adjustment area provided by an exemplary embodiment of the application;
FIG. 10 is a schematic diagram of a mood adjustment area provided in accordance with an exemplary embodiment of the present application;
FIG. 11 is a schematic diagram of a text adjustment area provided by an exemplary embodiment of the present application;
FIG. 12 is a schematic diagram of a pronunciation adjustment window provided by an exemplary embodiment of the present application;
FIG. 13 is a schematic diagram of a character duration adjustment area provided by an exemplary embodiment of the present application;
FIG. 14 is a schematic diagram of an audio preview region provided by an exemplary embodiment of the present application;
FIG. 15 is a flowchart of a method for visual adjustment of synthesized audio provided by an exemplary embodiment of the present application;
FIG. 16 is a block diagram of a visual adjustment apparatus for synthesized audio provided by an exemplary embodiment of the present application;
FIG. 17 is a schematic diagram of a computer device provided in an exemplary embodiment of the application;
fig. 18 is a schematic diagram of a computer device according to an exemplary embodiment of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings. Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the application. Rather, they are merely examples of apparatus and methods consistent with aspects of the application as detailed in the accompanying claims.
The terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used in this disclosure and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.
It should be understood that although the terms first, second, third, etc. may be used in this disclosure to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another.
First, a plurality of nouns related to the embodiment of the application are briefly described:
the speech synthesis markup language (Speech Synthesis Markup Language, SSML) is an XML-based markup language that can be used to fine tune output properties of text-to-speech, such as pitch, pronunciation, speed, volume, pause interval, mood, text content, character duration, etc.
The embodiment of the application provides a schematic diagram of a visual adjustment method for synthesized audio, as shown in fig. 1, the method can be executed by computer equipment, and the computer equipment can be a terminal or a server.
Illustratively, the content input area 10, the audio editing area 20, and the audio preview area 30 are displayed in the computer device.
The content input area 10 is used for displaying text content to be converted.
Optionally, text content for conversion is included in the content input area 10.
The audio editing area 20 is used to edit audio attributes of the synthesized audio. The synthesized audio is synthesized based on the text content.
Optionally, the audio attribute includes at least one of a tone color, an audio emotion, an audio speech rate, a tone, a pause interval between characters, and a character duration, but is not limited thereto, and the embodiment of the present application is not particularly limited thereto.
The audio preview area 30 is used for displaying the overall playing effect of a plurality of audios of the masonry according to roles or sequences after the text content is converted into audios.
Illustratively, the audio editing area 20 is displayed, and the computer device displays the adjusted audio attribute in response to an adjustment operation on the adjustment control corresponding to the character; the computer device updates the synthesized audio based on the adjusted audio attributes.
The synthesized audio is audio obtained by converting text contents.
The audio editing area comprises adjustment controls corresponding to characters in the text content, namely, each character in the text content is corresponding to one visualized adjustment control, and the adjustment controls are used for adjusting audio attributes corresponding to the characters.
The adjusting audio attribute is an audio attribute obtained after adjusting the audio attribute parameter value corresponding to the character.
Optionally, the form of the adjustment control includes at least one of a slide bar control, an input box control and a selection control, but is not limited thereto, and the embodiment of the present application is not limited thereto in particular.
For example, the audio speech speed and tone adjusting control is correspondingly a slide bar control, namely, the tone is adjusted by sliding a slide bar up and down or left and right; the pause interval and the character duration between the characters are correspondingly input box controls, namely, the pause interval and the character duration are adjusted by inputting numerical values into the input box or modifying the numerical values in the input box; the tone color and the audio emotion correspond to selection controls, and the tone color and the audio emotion are adjusted by triggering the selection controls of the corresponding options, but the tone color and the audio emotion are not limited to the selection controls, and the embodiment of the application is only exemplified.
Optionally, the timbre includes at least one of a Rayleigh tone, a girl tone, a Yujie tone, an aged tone, a Queen tone, a Zhengtai tone, a teenager tone, a young tone, a Datiao tone, an aged tone, and a monaural tone, but the embodiment of the application is not limited thereto.
Optionally, the audio emotion includes at least one of entertainment, pleasure, beauty, relaxation, sadness, fantasy, victory, anxiety, fear, vexation, resistance, and excitement, but is not limited thereto, and the embodiment of the present application is not particularly limited thereto.
Optionally, the adjustment operation includes at least one of sliding the adjustment control, modifying a parameter value of the adjustment control, and inputting a parameter value of the adjustment control, but is not limited thereto, and the embodiment of the present application is not particularly limited thereto.
Illustratively, the computer device determines the character to be adjusted in response to a selection operation of the character in the text content; the computer equipment responds to the adjustment operation on the adjustment control corresponding to the character to be adjusted, and the audio attribute is displayed and adjusted.
Optionally, the selection operation of the character includes selection of a single character, selection of a plurality of characters, and selection of all characters.
Optionally, the selection manner of the single character includes at least one of a single click character, a double click character, a triple click character, a sliding character and a circle character, but is not limited thereto, and the embodiment of the present application is not particularly limited thereto.
Optionally, the selection manner of the plurality of characters includes at least one of the following manners, but is not limited thereto:
selecting a start character and a stop character, and automatically determining characters between the start character and the stop character as characters to be adjusted;
selecting a start character and a stop character, selecting an abandoned character between the start character and the stop character, and automatically determining characters except the abandoned character between the start character and the stop character as characters to be adjusted; discarding characters refers to unwanted characters;
performing sliding operation in the text content, and determining the character corresponding to the sliding operation as the character to be adjusted;
And carrying out circling operation in the text content, and determining the characters in the circle corresponding to the circling operation as characters to be adjusted.
In some embodiments, the computer device determines the single character as the character to be adjusted in response to a selection operation of the single character in the text content; and displaying the adjusting audio attribute corresponding to the single character by performing an adjusting operation on the adjusting control corresponding to the single character, namely adjusting the audio attribute corresponding to the single character by the adjusting control corresponding to the single character aiming at the single character.
For example, the text content is "hello o, which is very happy to recognize you-! "wherein the tone value corresponding to the character" you "is 123, and the tone adjustment of the single character can be achieved by sliding the adjustment control corresponding to the character" you "or modifying the tone value corresponding to the character" you ".
In some embodiments, the computer device determines, as characters to be adjusted, characters between the start character and the end character in response to selecting the start character and the end character in the text content; the computer device displays that the audio attribute of the second character between the start character and the end character changes in linkage with the audio attribute of the first character in response to an adjustment operation on an adjustment control corresponding to the first character between the start character and the end character.
The first character refers to any one character between a start character and a stop character.
The second character refers to the other characters than the first character among the characters between the start character and the end character.
The manner in which the audio attribute is changed in linkage includes at least one of the following manners, but is not limited thereto:
the magnitude of the change in the audio attribute of the second character is the same as the magnitude of the change in the audio attribute of the first character, i.e., how much the audio attribute of the first character changes, how much the audio attribute of the second character likewise changes, e.g., the audio attribute of the first character increases by 50, the audio attribute of the second character likewise increases by 50;
the magnitude of the change in the audio attribute of the second character decreases correspondingly, based on the magnitude of the change in the audio attribute of the first character, as the distance value between the second character and the first character increases, i.e., the further away from the first character the corresponding magnitude of the change is smaller, e.g., the audio attribute of the first character increases by 50 and the audio attribute of the second character, separated from the first character by a plurality of character intervals, increases by 10, based on the magnitude of the change in the audio attribute of the first character;
the magnitude of change in the audio attribute of the second character, etc. decreases as the distance value between the second character and the first character increases, i.e. the further from the first character the magnitude of change in the audio attribute of the first character corresponds to a smaller magnitude of change, e.g. the audio attribute of the first character increases by 100, with a tolerance of 10, the audio attribute of the second character increases by 60, 3 character intervals from the first character;
The magnitude of the change in the audio attribute of the second character decreases equally as the distance value between the second character and the first character increases, i.e. the further from the first character the magnitude of the change in the audio attribute of the first character corresponds to a smaller magnitude of the change, e.g. the audio attribute of the first character increases by 100, the common ratio is 2, the audio attribute of the second character being separated from the first character by 1 character interval increases by 25;
the magnitude of the change in the audio attribute of the second character increases as the distance value between the second character and the first character increases, i.e., the further from the first character the magnitude of the change in the audio attribute of the first character corresponds to a greater magnitude of the change, e.g., 50 increases in the audio attribute of the first character and 100 increases in the audio attribute of the second character separated from the first character by a plurality of character intervals;
the magnitude of the change in the audio attribute of the second character increases with the increase in the distance value between the second character and the first character, i.e. the further from the first character the corresponding magnitude of the change is greater, e.g. the audio attribute of the first character increases by 100, the tolerance is 10, the audio attribute of the second character is increased 140 by 3 character intervals from the first character, based on the magnitude of the change in the audio attribute of the first character;
The magnitude of the change in the audio attribute of the second character increases equally as the distance value between the second character and the first character increases, i.e. the further from the first character the magnitude of the change in the audio attribute of the first character corresponds to a larger magnitude change, e.g. the audio attribute of the first character increases by 100 and the common ratio increases by 2, the audio attribute of the second character being separated from the first character by 1 character interval increases by 300.
In summary, the method provided in this embodiment displays the audio editing area; responding to the adjustment operation on the adjustment control corresponding to the character, and displaying an adjusted audio attribute obtained after adjusting the audio attribute parameter value corresponding to the character; based on the adjusted audio attributes, updated newly synthesized audio is displayed. According to the application, the adjustment control corresponding to each character is adjusted in a visual mode, so that the adjustment of the synthesized audio is realized, the adjustment operation steps of the synthesized audio are simplified, and the difficulty in producing high-quality synthesized audio is reduced.
FIG. 2 is a schematic diagram of a computer system according to an embodiment of the present application. The computer system may include: a terminal 100 and a server 200.
The terminal 100 may be an electronic device such as a mobile phone, tablet computer, vehicle-mounted terminal (car), wearable device, personal computer (Personal Computer, PC), vehicle-mounted terminal, aircraft, unmanned vending terminal, etc. The client terminal 100 may be provided with a target application program, which may be an application program for visual adjustment of the reference synthesized audio, or may be another application program provided with a visual adjustment function of the synthesized audio, which is not limited in the present application. The present application is not limited to the form of the target Application program, and may be a web page, including, but not limited to, an Application (App) installed in the terminal 100, an applet, and the like.
The server 200 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server providing cloud computing services, a cloud database, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, a content delivery network (Content Delivery Network, CDN), and a cloud server of basic cloud computing services such as a big data and manual palm image recognition platform. The server 200 may be a background server of the target application program, and is configured to provide a background service for a client of the target application program.
Cloud technology (Cloud technology) refers to a hosting technology that unifies serial resources such as hardware, software, networks and the like in a wide area network or a local area network to realize calculation, storage, processing and sharing of data. The cloud technology is based on the general names of network technology, information technology, integration technology, management platform technology, application technology and the like applied by the cloud computing business mode, can form a resource pool, and is flexible and convenient as required. Cloud computing technology will become an important support. Background services of technical networking systems require a large amount of computing, storage resources, such as video websites, picture-like websites, and more portals. Along with the high development and application of the internet industry, each article possibly has an own identification mark in the future, the identification mark needs to be transmitted to a background system for logic processing, data with different levels can be processed separately, and various industry data needs strong system rear shield support and can be realized only through cloud computing.
In some embodiments, the servers described above may also be implemented as nodes in a blockchain system. Blockchain (Blockchain) is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanisms, encryption algorithms, and the like. The blockchain is essentially a decentralised database, and is a series of data blocks which are generated by association by using a cryptography method, and each data block contains information of a batch of network transactions and is used for verifying the validity (anti-counterfeiting) of the information and generating a next block. The blockchain may include a blockchain underlying platform, a platform product services layer, and an application services layer.
Communication between the terminal 100 and the server 200 may be performed through a network, such as a wired or wireless network.
According to the visual adjustment method for the synthesized audio, provided by the embodiment of the application, an execution main body of each step can be computer equipment, and the computer equipment refers to electronic equipment with data calculation, processing and storage capacity. Taking the implementation environment of the solution shown in fig. 2 as an example, the method for visually adjusting the synthesized audio may be performed by the terminal 100 (for example, the method for visually adjusting the synthesized audio may be performed by a client terminal that installs a target application running in the terminal 100), the method for visually adjusting the synthesized audio may be performed by the server 200, or the method for visually adjusting the synthesized audio may be performed by the terminal 100 and the server 200 in an interactive and coordinated manner, which is not limited in the present application.
Fig. 3 is a flowchart of a method for visual adjustment of synthesized audio according to an exemplary embodiment of the present application. The method may be performed by a computer device, which may be a terminal or a server. The method comprises the following steps:
step 302: and displaying the audio editing area.
The audio editing area is used for adjusting the audio attribute of the synthesized audio.
The synthesized audio is the audio obtained after converting the text content; or, the synthesized audio is audio obtained by converting text contents through an AI model.
It should be noted that, in the embodiment of the present application, the synthesized audio is taken as an example for adjustment, and besides the synthesized audio, the audio editing area may also be used for adjusting the audio attribute of the recorded audio, and the visual adjustment process for the recorded audio may refer to the visual adjustment process for the synthesized audio.
The recorded audio can be obtained directly from a database or obtained by recording through computer equipment.
The audio editing area comprises an adjustment control corresponding to characters in the text content, and the adjustment control is used for adjusting audio attributes corresponding to the characters.
Optionally, the audio attribute includes at least one of a tone color, an audio emotion, an audio speech rate, a tone, a pause interval between characters, and a character duration, but is not limited thereto, and the embodiment of the present application is not particularly limited thereto.
Wherein the manner of acquiring the synthesized audio includes at least one of:
1. the computer device receives synthesized audio, for example: the computer device receives synthesized audio synthesized by other devices.
2. The computer device retrieves synthesized audio from a stored database.
3. The computer device performs speech synthesis based on the input text content to obtain synthesized audio.
It should be noted that the above manner of obtaining the synthesized audio is merely an illustrative example, and the embodiments of the present application are not limited thereto.
Step 304: and responding to the adjustment operation on the adjustment control corresponding to the character, and displaying the adjustment audio attribute.
The audio editing area comprises adjustment controls corresponding to characters in the text content, namely, each character in the text content is corresponding to one visualized adjustment control, and the adjustment controls are used for adjusting audio attributes corresponding to the characters.
The adjusting audio attribute is an audio attribute obtained after adjusting the audio attribute parameter value corresponding to the character.
Optionally, the form of the adjustment control includes at least one of a slide bar control, an input box control and a selection control, but is not limited thereto, and the embodiment of the present application is not limited thereto in particular.
For example, the audio speech speed and tone adjusting control is correspondingly a slide bar control, namely, the tone is adjusted by sliding a slide bar up and down or left and right; the pause interval and the character duration between the characters are correspondingly input box controls, namely, the pause interval and the character duration are adjusted by inputting numerical values into the input box or modifying the numerical values in the input box; the tone color and the audio emotion are correspondingly selected controls, and the tone color and the audio emotion are adjusted by triggering the selected controls corresponding to the options.
Optionally, the adjustment operation includes at least one of sliding the adjustment control, modifying a parameter value of the adjustment control, and inputting a parameter value of the adjustment control, but is not limited thereto, and the embodiment of the present application is not particularly limited thereto.
Illustratively, the computer device obtains the adjusted audio attribute corresponding to the character based on the adjustment operation on the adjustment control corresponding to the character.
Step 306: and displaying the updated new synthesized audio based on the adjusted audio attribute.
The way to update the synthesized audio includes: based on the adjusted audio attribute, new synthesized audio is synthesized again, so that the synthesized audio is updated.
Illustratively, the computer device re-synthesizes the new synthesized audio based on the adjusted audio attributes and the text content, resulting in new synthesized audio.
In summary, the method provided in this embodiment displays the audio editing area; responding to the adjustment operation on the adjustment control corresponding to the character, and displaying an adjusted audio attribute obtained after adjusting the audio attribute parameter value corresponding to the character; based on the adjusted audio attributes, updated newly synthesized audio is displayed. According to the application, the adjustment control corresponding to each character is adjusted in a visual mode, so that the adjustment of the synthesized audio is realized, the adjustment operation steps of the synthesized audio are simplified, and the difficulty in producing high-quality synthesized audio is reduced.
Fig. 4 is a flowchart of a method for visual adjustment of synthesized audio provided in an exemplary embodiment of the present application. The method may be performed by a computer device, which may be a terminal or a server. The method comprises the following steps:
step 402: and displaying the audio editing area.
The audio editing area is used for adjusting the audio attribute of the synthesized audio.
The synthesized audio is the audio obtained after converting the text content; or, the synthesized audio is audio obtained by converting text contents through an AI model.
The audio editing area comprises an adjustment control corresponding to characters in the text content, and the adjustment control is used for adjusting audio attributes corresponding to the characters. Through setting corresponding adjustment controls for each character in the text content, the audio attribute of a certain character can be adjusted in a targeted manner, so that the audio attribute of the adjusted synthesized audio is changed more flexibly and more variously, the audio rhythm of the adjusted synthesized audio is more natural, ideas are more accurate, and the hearing effect and quality of the AI speech synthesized audio are greatly improved.
Optionally, the audio attribute includes at least one of a tone color, an audio emotion, an audio speech rate, a tone, a pause interval between characters, and a character duration, but is not limited thereto, and the embodiment of the present application is not particularly limited thereto.
In some embodiments, a content input area is also displayed in the computer device, the content input area being for displaying text content to be converted.
Illustratively, as shown in the schematic diagram of the content input area of fig. 5, the content input area includes a sequence adjustment button 501, a text input box 502, an audio track naming box 503, a copy button 504, an original mode button 505, and a delete dialog button 506.
The sequence adjustment button 501 is used to perform sequence adjustment in the case of having a plurality of content input areas. For example, in the case where there are a plurality of content input areas corresponding to different audio tracks, the order adjustment of the content input areas may be achieved by adjusting the order adjustment button 501 in the content input areas, for example, text contents in three content input areas as shown in fig. 1 are "hello o, respectively, and it is very happy to recognize that you-! "," I were also we have previously seen "," is it? Where "the user can achieve the sequential adjustment of the three content input areas by dragging the adjustment sequential adjustment button 501 in the content input area.
The text input box 502 is used to input text content that needs to be converted into synthesized audio, supporting a maximum of 500 characters, dynamically showing the number of characters and character limitations.
The copy button 504 is used to copy the text content in the text input box 502.
The track naming box 503 is used for setting a corresponding track for the content of the text input box 502, displaying track a by default, in a pull-down menu, supporting direct text input (first version is stored locally, and is invalid after page refreshing), and displaying tracks 1, 2 and 3 in the menu.
The text mode button 505 refers to a switch button of the text mode and the SSML mode, which is turned off by default, and shows a plain text state, if the text with a tag is inputted, the text mode button 505 needs to be turned on. If the SSML tag is input in plain text mode, it is processed according to english word recognition.
After the audio editing area adjusts the synthesized audio, the text content with the SSML added will be presented after the text mode button 505 is turned on.
For example, the text content is "I have a pair of shoes", when SSML is automatically added to the text content when 3s of pauses are added between the pair and the shoes, the text content with SSML added can be expressed as: "i have a pair of < break strength=" weak "= 3 s/>" shoes ".
Illustratively, as a schematic diagram of an audio editing area shown in fig. 6, synthesized audio after having been converted, adjustment controls corresponding to characters in the synthesized audio, and audio attributes are displayed annually in the audio editing area; if the text content is not converted, the adjustment control is not displayed along with the audio attributes.
The audio editing area includes a top bar function section 601, an audio editing section 602, and a bottom function section 603.
The top bar function section 601 is used to adjust the audio properties of the piece of audio. The top bar function section 601 includes: at least one of tone color of the whole audio, tone emotion of the whole audio (0-100, customizable input), emotion degree of the whole audio (0-100, customizable input), and speech rate control of the whole audio (0-100, customizable input), but the embodiment of the application is not limited thereto.
The audio editing section 602 displays adjustment controls corresponding to characters in the text content, the adjustment controls being used to adjust audio properties corresponding to the characters. The audio attribute includes at least one of a tone, a mood, a pause interval, text content, and a duration, but is not limited thereto, and the embodiment of the present application is not particularly limited thereto.
The bottom functional partition 603 is used to control the play or synthesis functions of the synthesized audio. The bottom functional section 603 includes, in order from left to right, a composition/play button, a download button, a withdraw button, a redo button, a restore button, and a play progress bar button. The synthesis/play button is used to re-synthesize new synthesized audio or play synthesized audio or pause synthesized audio.
Step 404: and determining the character to be adjusted in response to a selection operation of the character in the text content.
The character to be adjusted refers to a character whose audio attribute needs to be adjusted.
In some embodiments, the selection operation of a character includes selection of a single character, selection of multiple characters, and selection of all characters.
Illustratively, the computer device determines the individual character as the character to be adjusted in response to a selection operation of the individual character in the text content.
Optionally, the selection manner of the single character includes at least one of a single click character, a double click character, a triple click character, a sliding character and a circle character, but is not limited thereto, and the embodiment of the present application is not particularly limited thereto.
The computer device determines, as characters to be adjusted, characters between the start character and the end character in response to selecting the start character and the end character in the text content.
Optionally, the selection manner of the plurality of characters includes at least one of the following manners, but is not limited thereto:
selecting a start character and a stop character, and automatically determining characters between the start character and the stop character as characters to be adjusted;
selecting a start character and a stop character, selecting an abandoned character between the start character and the stop character, and automatically determining characters except the abandoned character between the start character and the stop character as characters to be adjusted; discarding characters refers to characters which do not need to be adjusted; illustratively, the selection manner of the plurality of characters shown in fig. 7 is schematically shown, for example, the text content is: "today's weather is good, unlike what is raining all the time for that day," the "present" in the text content is determined to be the start character 701 by clicking, the "good" is determined to be the end character 702 by clicking, and the "true" is determined to be the discard character 703 by double clicking, then the computer device automatically determines the characters between the start character 701 and the end character 702 other than the discard character 703 as the character to be adjusted, i.e., the character to be adjusted is "today's weather good".
Optionally, the manner of selecting the start character 701, the end character 702, and the discard character 703 includes: mode one: a start character 701 and a stop character 702 are selected by a single click, and a discard character 703 is selected by a double click; mode two: the initial character 701 is selected by means of a single click, the final character 702 is selected by means of a double click, and the discard character 703 is selected by means of a triple click; mode three: a start character 701 and a stop character 702 are selected by clicking, and a discard character 703 is selected by circling; mode four: a start character 701 and a stop character 702 are selected by clicking, and a discard character 703 is selected by crossing; however, the embodiment of the present application is not limited thereto in particular. The character to be adjusted is determined by selecting the abandoned character between the initial character and the termination character, so that the character to be adjusted can be rapidly and accurately selected, and especially under the condition that the number of the characters to be adjusted is large, reverse operation is performed by selecting the abandoned character, the character to be adjusted can be rapidly determined, and the selection efficiency of the character to be adjusted is improved.
Performing sliding operation in the text content, and determining the character corresponding to the sliding operation as the character to be adjusted;
And carrying out circling operation in the text content, and determining the characters in the circle corresponding to the circling operation as characters to be adjusted.
Step 406: and responding to the adjustment operation on the adjustment control corresponding to the character to be adjusted, and displaying the audio attribute to be adjusted.
The audio editing area comprises adjustment controls corresponding to characters in the text content, namely, each character in the text content is corresponding to one visualized adjustment control, and the adjustment controls are used for adjusting audio attributes corresponding to the characters.
The adjusting audio attribute is an audio attribute obtained after adjusting the audio attribute parameter value corresponding to the character.
Optionally, the form of the adjustment control includes at least one of a slide bar control, a button control, an input box control, and a selection control, but is not limited thereto, and the embodiment of the present application is not limited thereto in particular.
Optionally, the adjustment operation includes at least one of sliding the adjustment control, modifying a parameter value of the adjustment control, and inputting a parameter value of the adjustment control, but is not limited thereto, and the embodiment of the present application is not particularly limited thereto.
In some embodiments, the adjustment operations on the adjustment controls corresponding to the characters to be adjusted include an adjustment operation on the adjustment control corresponding to the single character and an adjustment operation on the adjustment control corresponding to the plurality of characters.
For example, the computer device displays, in response to an adjustment operation on an adjustment control corresponding to a single character, an adjusted audio attribute corresponding to the single character, i.e., for the single character, the audio attribute corresponding to the single character is adjusted by the adjustment control corresponding to the single character.
For example, the text content is "hello o, which is very happy to recognize you-! "wherein the tone value corresponding to the character" you "is 345, and the tone adjustment of the single character can be achieved by sliding the adjustment control corresponding to the character" you "or modifying the tone value corresponding to the character" you ".
The computer device may be configured to display a second character having an audio attribute that varies in linkage with the audio attribute of the first character.
The first character refers to any one character between a start character and a stop character. The second character refers to the other characters than the first character among the characters between the start character and the end character.
The manner in which the audio attribute is changed in linkage includes at least one of the following manners, but is not limited thereto:
The magnitude of the change in the audio attribute of the second character is the same as the magnitude of the change in the audio attribute of the first character, i.e., how much the audio attribute of the first character changes, how much the audio attribute of the second character likewise changes, e.g., the audio attribute of the first character increases by 50, the audio attribute of the second character likewise increases by 50;
the magnitude of the change in the audio attribute of the second character decreases correspondingly, based on the magnitude of the change in the audio attribute of the first character, as the distance value between the second character and the first character increases, i.e., the further away from the first character the corresponding magnitude of the change is smaller, e.g., the audio attribute of the first character increases by 50 and the audio attribute of the second character, separated from the first character by a plurality of character intervals, increases by 10, based on the magnitude of the change in the audio attribute of the first character;
the magnitude of change in the audio attribute of the second character, etc. decreases as the distance value between the second character and the first character increases, i.e. the further from the first character the magnitude of change in the audio attribute of the first character corresponds to a smaller magnitude of change, e.g. the audio attribute of the first character increases by 100, with a tolerance of 10, the audio attribute of the second character increases by 60, 3 character intervals from the first character;
The magnitude of the change in the audio attribute of the second character decreases equally as the distance value between the second character and the first character increases, i.e. the further from the first character the magnitude of the change in the audio attribute of the first character corresponds to a smaller magnitude of the change, e.g. the audio attribute of the first character increases by 100, the common ratio is 2, the audio attribute of the second character being separated from the first character by 1 character interval increases by 25;
the magnitude of the change in the audio attribute of the second character increases as the distance value between the second character and the first character increases, i.e., the further from the first character the magnitude of the change in the audio attribute of the first character corresponds to a greater magnitude of the change, e.g., 50 increases in the audio attribute of the first character and 100 increases in the audio attribute of the second character separated from the first character by a plurality of character intervals;
the magnitude of the change in the audio attribute of the second character increases with the increase in the distance value between the second character and the first character, i.e. the further from the first character the corresponding magnitude of the change is greater, e.g. the audio attribute of the first character increases by 100, the tolerance is 10, the audio attribute of the second character is increased 140 by 3 character intervals from the first character, based on the magnitude of the change in the audio attribute of the first character;
The magnitude of the change in the audio property of the second character increases equally as the distance value between the second character and the first character increases, based on the magnitude of the change in the audio property of the first character, i.e., the further away the first character, the greater the corresponding magnitude of the change, e.g., the audio attribute of the first character increases by 100 and the common ratio is 2, the audio attribute of the second character, which is 1 character apart from the first character, increases by 300.
In some embodiments, the audio attributes include at least one of timbre, audio emotion, audio pace, tone, pause interval between characters, and character duration, but are not limited thereto, and embodiments of the present application are not particularly limited thereto.
The adjustment of the pitch (or pitch) of the synthesized audio.
Illustratively, the computer device displays the adjusted adjustment tone in response to an adjustment operation on a tone adjustment control corresponding to the character to be adjusted.
The audio editing area also includes text character correspondence in content at least one of a pitch change trend and a pitch value of (a); the computer device performs an adjustment operation on a tone adjustment control corresponding to the character to be adjusted based on at least one of the tone variation trend and the tone value, and displays the adjusted adjustment tone.
Taking the slide bar control shown in fig. 8 as an example, the computer device drags the adjustment tone up and down on the slide bar control corresponding to the character to be adjusted based on at least one of the tone variation trend and the tone value, or directly modifies the tone value corresponding to the character, and displays the adjustment tone after the adjustment is completed.
The pitch refers to the level of sound, and the frequency of sound determines the level of pitch.
The tone variation trend refers to a variation trend of a tone corresponding to a character in text content.
For example, taking the tone shown in FIG. 8 as an example, the text content is "hello o," which is very happy to recognize you-! "wherein, the tone corresponding to" you "is 478, the tone corresponding to" good "is 310, the tone corresponding to" o "is 248, and the curve formed by connecting the tones corresponding to different characters constitutes the tone variation trend.
The computer device adjusts an audio frequency in the audio signal corresponding to the character to be adjusted based on at least one of the pitch variation trend and the pitch value, resulting in an adjusted pitch.
The pitch trend is characterized by a curve of the pitch value corresponding to the characters in the text content.
The tone adjustment control is used for adjusting the audio frequency in the audio signal corresponding to the character to be adjusted, and the audio frequency refers to the sampling frequency corresponding to the character to be adjusted.
Illustratively, the audio frequency in the audio signal corresponding to the character to be adjusted is determined by adjusting the number of the played sampling points, for example, 44.1k sampling points are played in one second, and the speed is doubled to 88.2k sampling points played in 1 second, so that the tone is raised.
The sampling point refers to the measured acoustic value at the sampling time point.
The measured acoustic wave value refers to the amplitude corresponding to the analog signal of the audio.
The manner of adjusting the tone includes, but is not limited to, the following two types:
in response to an adjustment operation on the tone adjustment control corresponding to the single character, the computer device displays an adjusted tone corresponding to the single character, i.e., adjusts only the tone corresponding to the single character, without the tones corresponding to the other characters being changed;
in response to an adjustment operation on the tone adjustment control corresponding to the first character between the start character and the end character, the computer device displays that the audio attribute of the second character between the start character and the end character changes in linkage with the tone of the first character, i.e., adjusts the tone adjustment control of any one character, and the tone corresponding to the other characters changes in linkage with the tone of the first character.
For example, as shown in the schematic diagram of the pitch adjustment area shown in fig. 8, the pitch adjustment area 801 includes a pitch value 802 and a pitch change trend 803 corresponding to the current character. The scale (not shown) on the left dynamically scales according to the pitch value 802, showing by default 0-100. The user can drag the tone adjustment control up and down directly in the tone adjustment area 801 to adjust the tone, or directly modify the tone value 802 corresponding to the character, and the tone corresponding to the character on the left and right sides does not change in cascade. Or, selecting the character to be adjusted in the text display area, clicking one character by one character according to the text display sequence in the text display area to determine a character or sentence, or clicking non-adjacent characters as starting/ending points to determine a word/sentence, and dragging the tone adjustment area 801 at this time, the tone of the selected character will change in cascade (the change floating can be the same or different).
After the synthesized audio is synthesized for the first time, in the tone adjustment area 801, the tone of the synthesized audio is displayed, the user adjusts the tone by the slider, the computer device adds an SSML tone tag at the adjusted text content (single character or multiple continuous characters) according to the adjusted tone adjusted by the user, and maps the value to the SSML tone tag, and the SSML tone tag corresponding to the tone can be expressed as: < prosody pitch "=", where "×" is a specific pitch value. < prosody pitch "=" > for adjusting the pitch of synthesized audio. For example, when the tone corresponding to the character "you" in the text content is reduced to 132, an SSML tone tag "< prosody pitch=132 >" is added after the character "you" in the text content. After the user clicks the synthesis next time, the computer device parses the SSML tone label, and then re-synthesizes the audio according to the tone adjustment result, to obtain new synthesized audio.
And (3) adjusting pause intervals of synthesized audio.
Illustratively, the computer device displays the adjusted pause interval in response to an adjustment operation on a pause interval adjustment control corresponding to the character to be adjusted.
And the computer equipment performs adjustment operation on the pause interval adjustment control corresponding to the character to be adjusted based on the pause interval change trend, and displays the adjusted pause interval.
Based on the change trend of the pause interval, the computer equipment adjusts the pause interval duration value corresponding to the character to be adjusted to obtain the adjusted pause interval.
The pause interval duration value refers to the pause duration between characters.
The pause interval variation trend refers to the value variation trend of pause time length among characters to be adjusted.
For example, but not limited to, the pause time between characters is gradually increased, the pause time between characters is gradually decreased, the pause time between characters is changed in an arithmetic progression, and the pause time between characters is changed in an arithmetic progression.
For example, the computer device determines a modification value corresponding to the character to be adjusted based on the change trend of the pause interval, and the modification value corresponding to the character to be adjusted has a positive correlation with the change trend of the pause interval; and modifying the pause interval duration value corresponding to the character to be adjusted based on the modification value.
The manner of determining the modification value includes at least one of the following manners, but is not limited thereto:
determining a change trend function corresponding to the change trend of the pause interval based on the change trend of the pause interval corresponding to the character; calculating a modification value corresponding to the character to be adjusted based on the change trend function; through adjusting the pause interval duration value corresponding to the character to be adjusted, the pause interval change trend is more attached to the change trend function, and the audio rhythm of the adjusted synthesized audio is more natural and the ideas are more accurate;
Based on the pause interval change trend corresponding to the character, inquiring a pause interval change trend table, and determining a standard pause interval change trend corresponding to the pause interval change trend; determining a modification value corresponding to the character to be adjusted based on the standard pause interval duration value in the standard pause interval variation trend; through the pause interval duration value corresponding to the character to be adjusted, the pause interval change trend is more attached to the standard pause interval change trend, and the audio rhythm of the adjusted synthesized audio is more natural and the ideas are more accurate.
The pause interval change trend table comprises at least one preset standard pause interval change trend.
For example, the playing speed of the synthesized audio is faster and faster, and the pause time corresponding to the corresponding character to be adjusted is gradually reduced, so that the pause time between any two characters is adjusted based on the gradually reduced pause interval change trend, so that the pause time between any two characters is more attached to the pause interval change trend, and the audio rhythm of the adjusted synthesized audio is more natural and ideographic is more accurate.
For example, the playing style of the synthesized audio is an exciting surge, the pause time corresponding to the corresponding character to be adjusted is increased in an equal ratio, and the pause time between any two characters is adjusted based on the change trend of the pause interval increased in the equal ratio, so that the pause time between any two characters is more attached to the change trend of the pause interval increased in the equal ratio, and the audio rhythm of the adjusted synthesized audio is more natural and the ideas are more accurate.
Optionally, the type of the pause interval includes at least one of long pause, medium pause, short pause, custom duration, and no pause, but is not limited thereto, and the embodiment of the present application is not limited thereto in particular.
Illustratively, adjusting the pause interval duration value corresponding to the character to be adjusted includes: a pause interval is inserted between two characters or the pause interval duration value of the pause interval between two characters is modified.
For example, in the case where there is no pause interval between two characters, the two characters are divided, the pause interval is inserted between the two characters, and the pause interval is adjusted by inputting the pause interval duration value of the pause interval.
For example, in the case where there is a pause interval between two characters, the pause interval is adjusted by modifying the pause interval duration value of the pause interval.
And responding to the triggering operation of the pause interval adjustment control between the characters, popping up a pause interval type selection frame corresponding to the characters by the computer equipment, and displaying the adjusted pause interval after the adjustment is completed by selecting the pause interval type in the pause interval type selection frame.
The manner of adjusting the pause interval includes, but is not limited to, the following two types:
The computer equipment responds to the selection operation of the single character, pops up a type selection frame of a pause interval corresponding to the single character, and selects any one of long pause, medium pause, short pause, custom duration and no pause;
the computer equipment responds to the selection operation of the start character and the end character, pops up a pause interval type selection frame corresponding to the characters, selects any one of long pause, medium pause, short pause, self-defined duration and no pause, and sets each character between the start character and the end character as the selected pause interval type after selecting the pause interval type.
For example, as shown in the schematic diagram of the pause interval adjustment area in fig. 9, the pause interval adjustment area 901 includes a pause interval adjustment control 902 corresponding to the current character. The computer device pops up the pause interval type selection box 903 corresponding to the character in response to the triggering operation of the pause interval adjustment control 902 corresponding to the character, clicks the blank again, and closes the pause interval type selection box 903.
The user can directly select the type of the pause interval corresponding to the character in the pause interval adjustment area 901. The user selects the character to be adjusted in the text display area, and can determine a character after clicking the character by character according to the text display sequence in the text display area, and pop up a pause interval type selection box 903 corresponding to a single character at the moment to select the pause interval type from the pause interval type selection box; it is also possible to click on non-adjacent characters as start/end points to determine a word/sentence, at this time, pop up a pause interval type selection box 903 corresponding to a plurality of characters, select a pause interval type therefrom, and then the tone of the selected character follows the change.
After the synthesized audio is synthesized for the first time, the pause interval between characters is displayed in the pause interval adjustment area 901, the user determines whether to add the pause interval by selecting, and the computer device adds an SSML pause interval tag at the adjusted text content (single character or multiple characters) according to the pause interval adjusted by the user, and maps the value to the SSML pause interval tag, where the SSML pause interval tag corresponding to the pause interval can be expressed as: < break strength= "weak" = ", where" × "is a specific dwell time. < break strungth= "weak" = ", for adjusting pause duration between characters of synthesized audio. After the user clicks the synthesis next time, the computer device analyzes the SSML pause interval label, and then synthesizes the audio again according to the adjustment result of the pause interval, so as to obtain new synthesized audio.
And (5) adjusting the mood of the synthesized audio.
Illustratively, the computer device displays the adjusted mood in response to an adjustment operation on a mood adjustment control corresponding to the character to be adjusted.
And the computer equipment performs adjustment operation on the tone adjustment control corresponding to the character to be adjusted based on the tone change trend, and displays the adjusted tone.
The mood is used to represent the sound color of the audio.
The computer device adjusts the speech by comprehensively adjusting at least two of the loudness, character duration and tone of the synthesized audio, thereby realizing the high, low, urgent, late, dune and contusion changes of the speech.
The high and low of the synthesized audio is achieved by adjusting the loudness of the audio.
The urgency and the delay of synthesizing the audio are achieved by adjusting the character duration of the audio.
The pausing and frustration of the synthesized audio is accomplished by adjusting the pitch variability of the characters of the audio.
Optionally, the types of the mood include at least one of a strong mood, and a weak mood, but the embodiment of the present application is not limited thereto.
The mood enhancement refers to the integrated increase in loudness, character duration, and pitch of the synthesized audio over a first range of magnitudes.
The language intensity refers to the combined increase of the loudness, character duration and pitch of the synthesized audio in a second amplitude range, wherein the second amplitude range is greater than the first amplitude range.
The computer equipment selects the type of the language and the gas in the language and gas adjusting control corresponding to the characters, so that the language and gas corresponding to the characters are adjusted.
The ways of adjusting the mood include, but are not limited to, the following two ways:
The computer device responding to the selection operation of the single character, selecting the type of the language corresponding to the single character;
the computer device adjusts the type of the mood set by the first character in response to the selection operation of the start character and the end character, the type of the mood of the second character between the start character and the end character being changed in linkage with the type of the mood of the first character.
For example, as shown in the schematic diagram of the mood adjustment area in fig. 10, the mood adjustment area 1001 includes a mood adjustment control corresponding to the current character. The types of the mood include strong mood, strong mood and weak mood, the user can select among the mood adjustment controls corresponding to the current characters, black is in a selected state, and white is in an unselected state.
The user can directly select the type of the tone corresponding to the character in the tone adjustment area 1001. The user selects characters to be adjusted in the text display area, one character can be determined after clicking according to the text display sequence in the text display area, and the type of the tone corresponding to the single character can be selected at the moment; and clicking non-adjacent characters to determine a word/sentence as a starting point/ending point, selecting a corresponding language-gas type for the first character, and displaying the language-gas type of all characters between the starting character and the ending character as the selected language-gas type.
After the synthesized audio is synthesized for the first time, the tone adjustment area 1001 displays the tone type of the character, and the user determines the tone type corresponding to the character by selecting, and the computer device adds an SSML tone tag at the adjusted text content (single character or multiple characters) according to the tone type adjusted by the user, and maps the type to the SSML tone tag, where the SSML tone tag corresponding to the tone may be expressed as: < tone= ">", where "" is a specific category of language. < tone= "" > is used to adjust the tone corresponding to the character of the synthesized audio. After the user clicks the synthesis next time, the computer device analyzes the SSML tone label, and then re-synthesizes the audio according to the tone adjustment result to obtain new synthesized audio.
Adjustment of the text content of the synthesized audio.
Illustratively, the computer device displays the adjusted text in response to an adjustment operation on a text adjustment control corresponding to the character to be adjusted.
After the character to be adjusted is selected, a text adjustment window corresponding to the character to be adjusted is displayed, the text adjustment window comprises at least one of intonation, speech speed, pronunciation type and personalized pronunciation setting corresponding to the character, and the computer equipment responds to adjustment of text content in the text adjustment window and displays the adjusted adjustment text.
For example, as shown in the schematic diagram of the text adjustment area in fig. 11, the pronunciation adjustment control 1102 corresponding to the current character is included in the text adjustment area 1101.
The user may modify the pronunciation corresponding to the character directly in the text adjustment region 1101. After triggering the pronunciation adjustment control 1102, a pronunciation adjustment window is displayed, as shown in the schematic of the pronunciation adjustment window in fig. 12, which supports both setting pronunciation and reading rules. The user may make modifications to the pronunciation of the character or characters.
After the synthesized audio is first synthesized, text content is presented in text adjustment area 1101, and the user displays pronunciation adjustment control 1102 by selecting a determined character. The computer device adds an SSML pronunciation tag at the adjusted text content (single character or multiple characters) based on the user adjusted pronunciation and maps the data onto the SSML pronunciation tag. The SSML tags corresponding to the text content include, but are not limited to, < sub alias= "# >, < phone me alphabets=" py "ph=" xing2 ". < sub alias= "#" > for replacing text content, < phone alphabets= "py" ph= "xing2" > for adjusting pronunciation of characters. After the user clicks the synthesized next time, the computer device analyzes the SSML pronunciation tag, and then re-synthesizes the audio according to the pronunciation adjustment result to obtain new synthesized audio.
And (3) adjusting the character duration of the synthesized audio.
Illustratively, the computer device displays the adjusted character duration in response to an adjustment operation on a character duration adjustment control corresponding to the character to be adjusted.
The character duration refers to the pronunciation duration corresponding to each character in the audio.
The ways of adjusting the character duration include, but are not limited to, the following two ways:
the computer equipment responds to the selection operation of the single character and adjusts the character duration corresponding to the single character;
the computer device adjusts the character duration of the first character in response to the selection operation of the start character and the end character, the character duration of the second character between the start character and the end character varying in tandem with the character duration of the first character.
It should be noted that, the above audio attributes may be adjusted independently to update the synthesized audio, or may be adjusted to update the synthesized audio.
For example, as shown in the schematic diagram of the character duration adjustment area in fig. 13, the character duration adjustment area 1301 includes the audio time consumption corresponding to the current character in ms.
The user may modify the audio time consumption corresponding to the current character directly in the character duration adjustment region 1301. The user selects the characters to be adjusted in the text display area, one character can be determined after clicking the characters one by one according to the text display sequence in the text display area, and the audio time consumption or the audio time consumption ratio corresponding to the single character can be adjusted at the moment; and if the corresponding audio time consumption or audio time consumption ratio is adjusted for the first character, the audio time consumption or audio time consumption ratio of all the characters between the starting character and the ending character is synchronously modified.
After the synthesized audio is synthesized for the first time, the character duration adjustment area 1301 displays the audio time consumption of the character, the user determines the character duration corresponding to the character through selection, the computer device adds an SSML character duration tag at the adjusted text content (single character or multiple characters) according to the character duration adjusted by the user, and maps the type to the SSML character duration tag, and the SSML character duration tag corresponding to the character duration can be expressed as: < duration= "#" >, and < duration= "#" > are used to adjust the play duration corresponding to the characters of the synthesized audio. After the user clicks and synthesizes next time, the computer equipment analyzes the SSML character duration label, and then synthesizes the audio again according to the adjustment result of the language, so as to obtain new synthesized audio.
Step 408: and displaying the updated new synthesized audio based on the adjusted audio attribute.
The way to update the synthesized audio includes: based on the adjusted audio attribute, new synthesized audio is synthesized again, so that the synthesized audio is updated.
Illustratively, the computer device adds SSML corresponding to the adjusted audio attribute to the text content based on the adjusted audio attribute; the computer device re-synthesizes the new synthesized audio based on the text content and the added SSML.
For example, the text content is that the pair of shoes is sent by friend C to My, and the text content after adding the SSML corresponding to the adjusted audio attribute is: the double < w > ball shoes are that friends C < break strength= "weak"/> "are given to my < prosody rate=" 50% > ", wherein" < w > "," </w > "," < break strength= "weak"/> "," < prosody rate "=" 50% > "are SSML," < w > "and" </w > "are used for indicating that the specified phrase is not split to ensure no pause," < break strength= "weak"/> "is used for indicating that there is a pause between C and send, and" < prosody rate= "50% >" is used for indicating that the speech speed is slowed down by half.
The computer equipment analyzes the SSML to obtain adjustment parameters corresponding to the audio attributes; the computer device re-synthesizes the newly synthesized audio based on the text content and the adjustment parameters.
In some embodiments, the computer device determines an SSML location and SSML content corresponding to the adjustment character based on the adjustment audio attribute; based on the SSML location and the SSML content, the computer device adds SSML corresponding to the adjusted audio attributes to the text content.
SSML locations are used to represent the locations of addition of SSML in text content.
SSML content is used to represent the corresponding audio attribute adjustment content of SSML in text content.
In some embodiments, the computer device determines the SSML location based on the location of the adjustment character corresponding to the adjustment audio attribute in the text content; the computer device determines SSML content based on the audio attribute adjustment content corresponding to the adjustment audio attribute.
The computer equipment obtains SSML format content corresponding to the audio attribute from the SSML type comparison table in a query mode based on the type aiming at the audio attribute; adjusting parameters in the content in the SSML format based on the audio attributes adjusts the content to obtain SSML content.
The SSML type lookup table includes correspondence between audio attributes and SSML format content. For example, when the audio attribute is a mood, the SSML format content in the SSML type lookup table is: <tone=??>, where "??" is the parameter to be modified.
In some embodiments, the computer device further displays an audio preview region comprising track information to which the synthesized audio belongs; at least one of a play start point and a play end point of the synthesized audio is adjusted based on the track information.
For example, as shown in the schematic diagram of the audio preview area in fig. 14, the audio preview area is used to differentiate the synthesized audio generated by different text contents through audio tracks, and then perform overall effect experience and evaluation. Clicking the play button can sequentially play the synthesized audio on the plurality of audio tracks and adjust the position of the play start point.
The text contents corresponding to different audio tracks are completed in cards of different content input areas, after synthesized audio is synthesized, the synthesized audio is correspondingly displayed in an audio preview area according to the audio tracks, and the text contents are arranged backwards according to the sequence in the different content input areas and are not overlapped at the same time.
In summary, the method provided in this embodiment displays the audio editing area; responding to the adjustment operation on the adjustment control corresponding to the character, and displaying an adjusted audio attribute obtained after adjusting the audio attribute parameter value corresponding to the character; based on the adjusted audio attributes, updated newly synthesized audio is displayed. According to the application, the adjustment control corresponding to each character is adjusted in a visual mode, so that the adjustment of the synthesized audio is realized, the adjustment operation steps of the synthesized audio are simplified, and the difficulty in producing high-quality synthesized audio is reduced.
The method provided by the embodiment provides two methods for selecting characters, wherein the audio attribute corresponding to a single character or a plurality of characters is adjusted, and updated new synthesized audio is displayed based on the adjusted audio attribute. According to the method and the device, the characters to be adjusted in the process of selecting can be quickly selected in a visual mode, and the adjustment control corresponding to each character is adjusted in a targeted mode, so that the adjustment of the synthesized audio is realized, the adjustment operation steps of the synthesized audio are simplified, and the difficulty in producing high-quality synthesized audio is reduced.
According to the method provided by the embodiment, the various audio attributes corresponding to the characters are adjusted in a visual mode, and the updated new synthesized audio is displayed based on the adjusted audio attributes. According to the method and the device, the adjustment control corresponding to each character is adjusted in a visual mode, so that the adjustment of the synthesized audio is realized, the adjustment operation steps of the synthesized audio are simplified, and the difficulty in producing high-quality synthesized audio is reduced.
According to the method provided by the embodiment, the tone height fluctuation and the occupied duration of each character in the AI voice synthesized audio are displayed in a visual mode, and the audio attribute parameters of the synthesized audio are modified in a button and curve adjustment mode, so that the effect of the synthesized audio can be modified by a user faster and better, the synthesized audio rhythm is more natural, the ideas are more accurate, the hearing effect and the quality of the AI voice synthesized audio are greatly improved, and the application range is enlarged.
Fig. 15 is a flowchart of a method for visual adjustment of synthesized audio provided by an exemplary embodiment of the present application. The method may be performed by a computer device comprising a terminal and a speech synthesis system. The method comprises the following steps:
Step 1501: text content is entered.
The user enters text content in the client interface. Text content refers to text to be speech synthesized.
Step 1502: the text content is displayed.
After the user inputs the text content, the text content is displayed in an interface of the terminal.
Step 1503: the text content is submitted.
The terminal submits the text content to the speech synthesis system.
Alternatively, the speech synthesis system may be located in the terminal or in the server.
Step 1504: synthesizing audio.
The speech synthesis system generates synthesis parameters of synthesized audio based on text content using an AI speech synthesis model and an engine to synthesize audio.
Step 1505: returning the synthesized audio and audio attribute parameters.
In the case where the speech synthesis system synthesizes the synthesized audio, the synthesized audio and the audio attribute parameters of the synthesized audio are returned to the terminal.
Step 1506: and displaying the audio attribute parameters.
And after receiving the synthesized audio and the audio attribute parameters of the synthesized audio, the terminal displays the audio attribute parameters of the synthesized audio in the interface.
Step 1507: the audio attributes are adjusted.
The user modifies the audio attribute parameters of the synthesized audio according to the user's expectations or goals based on the audio attribute parameters of the synthesized audio displayed in the interface.
Optionally, the audio attribute includes at least one of a tone color, an audio emotion, an audio speech rate, a tone, a pause interval between characters, and a character duration, but is not limited thereto, and the embodiment of the present application is not particularly limited thereto.
Step 1508: parameters are submitted that adjust the audio properties.
After the user modifies the parameters of the audio attributes in the interface, the terminal submits the modified parameters for adjusting the audio attributes to the speech synthesis system.
Step 1509: newly synthesized audio is synthesized again.
The voice synthesis system adds SSML corresponding to the adjusted audio attribute into the text content based on the adjusted audio attribute; the voice synthesis system analyzes the SSML to obtain adjustment parameters corresponding to the audio attributes; the speech synthesis system re-synthesizes the newly synthesized audio based on the text content and the adjustment parameters.
Step 1510: returning new synthesized audio and new audio attribute parameters.
After re-synthesizing the new synthesized audio, the speech synthesis system returns the new synthesized audio and new audio attribute parameters.
Step 1511: new audio attribute parameters of the new synthesized audio are displayed.
After receiving the new synthesized audio and the new audio attribute parameters, the new audio attribute parameters of the new synthesized audio are displayed.
Fig. 16 is a schematic structural view showing a visual adjustment apparatus for synthesized audio according to an exemplary embodiment of the present application. The apparatus may be implemented as all or part of a computer device by software, hardware, or a combination of both, the apparatus comprising:
the display module 1601 is configured to display an audio editing area, where the synthesized audio is an audio obtained by converting text content, and the audio editing area includes an adjustment control corresponding to a character in the text content, where the adjustment control is configured to adjust an audio attribute corresponding to the character;
the display module 1601 is configured to respond to an adjustment operation on the adjustment control corresponding to the character, and display an adjusted audio attribute, where the adjusted audio attribute is an audio attribute obtained after adjusting an audio attribute parameter value corresponding to the character;
a display module 1601 for updating the synthesized audio based on the adjusted audio attribute.
In some embodiments, the apparatus further comprises an adjustment module 1602, the adjustment module 1602 for determining a character to be adjusted in response to a selection operation of the character in the text content.
In some embodiments, the display module 1601 is configured to display the adjusted audio attribute in response to the adjustment operation on the adjustment control corresponding to the character to be adjusted.
In some embodiments, the adjustment module 1602 is configured to determine a single character in the text content as the character to be adjusted in response to a selection operation of the single character.
In some embodiments, the display module 1601 is configured to display the adjusted audio attribute corresponding to the single character in response to the adjustment operation on the adjustment control corresponding to the single character.
In some embodiments, the adjustment module 1602 is configured to determine a character between a start character and a stop character as the character to be adjusted in response to selecting the start character and the stop character in the text content.
In some embodiments, the display module 1601 is configured to display, in response to the adjustment operation on the adjustment control corresponding to a first character between the start character and the end character, that an audio attribute of a second character between the start character and the end character changes in linkage with the audio attribute of the first character.
The first character refers to any one character between the initial character and the termination character, and the second character refers to other characters except the first character in the characters between the initial character and the termination character.
In some embodiments, the display module 1601 is configured to display the adjusted tone in response to the adjustment operation on the tone adjustment control corresponding to the character to be adjusted.
In some embodiments, the display module 1601 is configured to perform the adjustment operation on the tone adjustment control corresponding to the character to be adjusted based on at least one of the tone variation trend and the tone value, and display the adjusted tone.
In some embodiments, the adjusting module 1602 is configured to adjust an audio frequency in an audio signal corresponding to the character to be adjusted based on at least one of the pitch variation trend and the pitch value, to obtain the adjusted pitch.
The tone adjustment control is used for adjusting the audio frequency in the audio signal corresponding to the character to be adjusted.
In some embodiments, the display module 1601 is configured to display the adjusted pause interval in response to the adjustment operation on the pause interval adjustment control corresponding to the character to be adjusted.
In some embodiments, the display module 1601 is configured to perform the adjustment operation on the pause interval adjustment control corresponding to the character to be adjusted based on the pause interval variation trend, and display the adjusted pause interval.
In some embodiments, the adjusting module 1602 is configured to adjust the pause interval duration value corresponding to the character to be adjusted based on the pause interval variation trend, so as to obtain the adjusted pause interval.
In some embodiments, the apparatus further includes a synthesizing module 1603, the synthesizing module 1603 configured to add a speech synthesis markup language SSML corresponding to the adjusted audio attribute to the text content based on the adjusted audio attribute; based on the text content and the added SSML, new synthesized audio is re-synthesized.
In some embodiments, the synthesizing module 1603 is configured to parse the SSML to obtain adjustment parameters corresponding to the audio attribute; and re-synthesizing the new synthesized audio based on the text content and the adjustment parameters.
In some embodiments, the display module 1601 is configured to display an audio preview area, where the audio preview area includes audio track information to which the synthesized audio belongs.
In some embodiments, the adjusting module 1602 is configured to adjust at least one of a play start point and a play end point of the synthesized audio based on the audio track information.
Fig. 17 shows a block diagram of a computer device 1700 in accordance with an exemplary embodiment of the present application. The computer device may be implemented as a server in the above-described aspects of the present application. The computer apparatus 1700 includes a central processing unit (Central Processing Unit, CPU) 1701, a system Memory 1704 including a random access Memory (Random Access Memory, RAM) 1702 and a Read-Only Memory (ROM) 1703, and a system bus 1705 connecting the system Memory 1704 and the central processing unit 1701. The computer device 1700 also includes a mass storage device 1706 for storing an operating system 1709, application programs 1710, and other program modules 1711.
The mass storage device 1706 is connected to the central processing unit 1701 through a mass storage controller (not shown) connected to the system bus 1705. The mass storage device 1706 and its associated computer-readable media provide non-volatile storage for the computer device 1700. That is, the mass storage device 1706 may include a computer readable medium (not shown) such as a hard disk or a compact disk-Only (CD-ROM) drive.
The computer readable medium may include computer storage media and communication media without loss of generality. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, erasable programmable read-Only registers (Erasable Programmable Read Only Memory, EPROM), electrically erasable programmable read-Only Memory (EEPROM) flash Memory, or other solid state Memory technology, CD-ROM, digital versatile disks (Digital Versatile Disc, DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage, or other magnetic storage devices. Of course, those skilled in the art will recognize that the computer storage medium is not limited to the one described above. The system memory 1704 and mass storage 1706 described above may be referred to collectively as memory.
According to various embodiments of the disclosure, the computer device 1700 may also operate by being connected to a remote computer on a network, such as the Internet. I.e., the computer device 1700 may be connected to the network 1708 through a network interface unit 1707 coupled to the system bus 1705, or other types of networks or remote computer systems (not shown) may also be coupled to the network interface unit 1707.
The memory further includes at least one section of computer program stored in the memory, and the central processor 1701 implements all or part of the steps of the visual adjustment method for synthesized audio shown in the above embodiments by executing the at least one section of program.
Fig. 18 shows a block diagram of a computer device 1800 provided by an exemplary embodiment of the application. The computer device 1800 may be a portable mobile terminal, such as: smart phones, tablet computers, MP3 players (Moving Picture Experts Group Audio Layer III, mpeg 3), MP4 (Moving Picture Experts Group Audio Layer IV, mpeg 4) players. The computer device 1800 may also be referred to as a user device, a portable terminal, or the like.
In general, the computer device 1800 includes: a processor 1801 and a memory 1802.
Processor 1801 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 1801 may be implemented in at least one hardware form of DSP (Digital Signal Processing ), FPGA (Field Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array ). The processor 1801 may also include a main processor and a coprocessor, the main processor being a processor for processing data in an awake state, also referred to as a CPU (Central Processing Unit ); a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 1801 may integrate a GPU (Graphics Processing Unit, image processor) for rendering and rendering of content required to be displayed by the display screen. In some embodiments, the processor 1801 may also include an AI (Artificial Intelligence ) processor for processing computing operations related to machine learning.
The memory 1802 may include one or more computer-readable storage media, which may be tangible and non-transitory. The memory 1802 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 1802 is used to store at least one instruction for execution by processor 1801 to implement a method of visual adjustment of synthesized audio provided in embodiments of the present application.
In some embodiments, the computer device 1800 may also optionally include: a peripheral interface 1803 and at least one peripheral. Specifically, the peripheral device includes: at least one of a radio frequency interface 1804, a touch display screen 1805, a camera assembly 1806, audio circuitry 1807, and a power supply 1808.
The peripheral interface 1803 may be used to connect I/O (Input/Output) related at least one peripheral device to the processor 1801 and memory 1802. In some embodiments, processor 1801, memory 1802, and peripheral interface 1803 are integrated on the same chip or circuit board; in some other embodiments, either or both of the processor 1801, memory 1802, and peripheral interface 1803 may be implemented on separate chips or circuit boards, which is not limited in this embodiment.
The Radio Frequency interface 1804 is used to receive and transmit RF (Radio Frequency) signals, also known as electromagnetic signals. The radio frequency interface 1804 communicates with a communication network and other communication devices via electromagnetic signals. The radio frequency interface 1804 converts electrical signals into electromagnetic signals for transmission, or converts received electromagnetic signals into electrical signals. Optionally, the radio frequency interface 1804 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, etc. The radio interface 1804 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocol includes, but is not limited to: the world wide web, metropolitan area networks, intranets, generation mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and/or WiFi (Wireless Fidelity ) networks. In some embodiments, the radio frequency interface 1804 may also include NFC (Near Field Communication, short-range wireless communication) related circuitry, which is not limiting of the application.
The touch display screen 1805 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. The touch display 1805 also has the ability to collect touch signals at or above the surface of the touch display 1805. The touch signal may be input as a control signal to the processor 1801 for processing. The touch display 1805 is used to provide virtual buttons and/or virtual keyboards, also referred to as soft buttons and/or soft keyboards. In some embodiments, the touch display 1805 may be one, providing a front panel of the computer device 1800; in other embodiments, the touch display 1805 may be at least two, disposed on different surfaces of the computer device 1800, or in a folded design; in some embodiments, the touch display 1805 may be a flexible display disposed on a curved surface or a folded surface of the computer device 1800. Even further, the touch display screen 1805 may be arranged in an irregular pattern other than a rectangle, i.e., a shaped screen. The touch display 1805 may be made of LCD (Liquid Crystal Display ), OLED (Organic Light-Emitting Diode) or other materials.
The camera assembly 1806 is used to capture images or video. Optionally, the camera assembly 1806 includes a front camera and a rear camera. In general, a front camera is used for realizing video call or self-photographing, and a rear camera is used for realizing photographing of pictures or videos. In some embodiments, the number of the rear cameras is at least two, and the rear cameras are any one of a main camera, a depth camera and a wide-angle camera, so as to realize fusion of the main camera and the depth camera to realize a background blurring function, and fusion of the main camera and the wide-angle camera to realize a panoramic shooting function and a Virtual Reality (VR) shooting function. In some embodiments, the camera assembly 1806 may also include a flash. The flash lamp can be a single-color temperature flash lamp or a double-color temperature flash lamp. The dual-color temperature flash lamp refers to a combination of a warm light flash lamp and a cold light flash lamp, and can be used for light compensation under different color temperatures.
The audio circuitry 1807 is used to provide an audio interface between the user and the computer device 1800. The audio circuitry 1807 may include a microphone and a speaker. The microphone is used for collecting sound waves of users and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 1801 for processing, or inputting the electric signals to the radio frequency interface 1804 for realizing voice communication. For purposes of stereo acquisition or noise reduction, the microphone may be multiple, each disposed at a different location of the computer device 1800. The microphone may also be an array microphone or an omni-directional pickup microphone. The speaker is used to convert electrical signals from the processor 1801 or the radio frequency interface 1804 into sound waves. The speaker may be a conventional thin film speaker or a piezoelectric ceramic speaker. When the speaker is a piezoelectric ceramic speaker, not only the electric signal can be converted into a sound wave audible to humans, but also the electric signal can be converted into a sound wave inaudible to humans for ranging and other purposes. In some embodiments, the audio circuitry 1807 may also include a headphone jack.
A power supply 1808 is used to power the various components in the computer device 1800. The power supply 1808 may be an alternating current, a direct current, a disposable battery, or a rechargeable battery. When the power supply 1808 includes a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. The wired rechargeable battery is a battery charged through a wired line, and the wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery may also be used to support fast charge technology.
In some embodiments, the computer device 1800 also includes one or more sensors 1809. The one or more sensors 1809 include, but are not limited to: acceleration sensor 1810, gyro sensor 1811, pressure sensor 1812, optical sensor 1813, and proximity sensor 1814.
The acceleration sensor 1810 may detect the magnitude of acceleration on three coordinate axes of a coordinate system established with the computer device 1800. For example, the acceleration sensor 1810 may be used to detect components of gravitational acceleration in three coordinate axes. The processor 1801 may control the touch display 1805 to display a user interface in either a landscape view or a portrait view based on gravitational acceleration signals acquired by the acceleration sensor 1810. The acceleration sensor 1810 may also be used for the acquisition of motion data of a game or a user.
The gyro sensor 1811 may detect a body direction and a rotation angle of the computer device 1800, and the gyro sensor 1811 may collect a 3D motion of the user to the computer device 1800 in cooperation with the acceleration sensor 1810. The processor 1801 may implement the following functions based on the data collected by the gyro sensor 1811: motion sensing (e.g., changing UI according to a tilting operation by a user), image stabilization at shooting, game control, and inertial navigation.
Pressure sensor 1812 can be disposed on a side bezel of computer device 1800 and/or on an underlying layer of touch display 1805. When the pressure sensor 1812 is disposed at a side frame of the computer device 1800, a grip signal of the computer device 1800 by a user may be detected, and left-right hand recognition or shortcut operation may be performed according to the grip signal. When the pressure sensor 1812 is disposed at the lower layer of the touch screen 1805, control of the operability control on the UI interface can be achieved according to the pressure operation of the user on the touch screen 1805. The operability controls include at least one of a button control, a scroll bar control, an icon control, and a menu control.
The optical sensor 1813 is used to collect the ambient light intensity. In one embodiment, the processor 1801 may control the display brightness of the touch display screen 1805 based on the intensity of ambient light collected by the optical sensor 1813. Specifically, when the ambient light intensity is high, the display brightness of the touch display screen 1805 is turned up; when the ambient light intensity is low, the display brightness of the touch display screen 1805 is turned down. In another embodiment, the processor 1801 may also dynamically adjust the shooting parameters of the camera assembly 1806 based on the intensity of ambient light collected by the optical sensor 1813.
A proximity sensor 1814, also known as a distance sensor, is typically provided on the front of the computer device 1800. Proximity sensor 1814 is used to collect the distance between the user and the front of computer device 1800. In one embodiment, when the proximity sensor 1814 detects a gradual decrease in the distance between the user and the front of the computer device 1800, the processor 1801 controls the touch display screen 1805 to switch from the bright screen state to the off-screen state; when the proximity sensor 1814 detects that the distance between the user and the front of the computer device 1800 gradually increases, the touch display 1805 is controlled by the processor 1801 to switch from the off-screen state to the on-screen state.
Those skilled in the art will appreciate that the architecture shown in fig. 18 is not limiting and that more or fewer components than shown may be included or that certain components may be combined or that a different arrangement of components may be employed.
The embodiment of the application also provides a computer device, which comprises a processor and a memory, wherein at least one program is stored in the memory, and the at least one program is loaded and executed by the processor to realize the visual adjustment method of the synthesized audio provided by the above method embodiments.
The embodiment of the application also provides a computer readable storage medium, wherein at least one computer program is stored in the storage medium, and the at least one computer program is loaded and executed by a processor to realize the visual adjustment method for the synthesized audio provided by each method embodiment.
Embodiments of the present application also provide a computer program product comprising a computer program stored in a computer readable storage medium; the computer program is read from the computer readable storage medium and executed by a processor of a computer device, so that the computer device executes to implement the method for visual adjustment of synthesized audio provided by the above-mentioned method embodiments.
It will be appreciated that in the specific embodiments of the present application, data related to user data processing, such as related to user identity or characteristics, such as historical data, portraits, etc., may be subject to user approval or consent when the above embodiments of the present application are applied to specific products or technologies, and the collection, use and processing of the related data may be subject to relevant national and regional laws and regulations and standards.
It is noted that all terms used in the claims are to be construed in accordance with their ordinary meaning in the technical field unless explicitly defined otherwise herein. All references to "an element, device, component, apparatus, step, etc" are to be interpreted openly as referring to at least one instance of the element, device, component, apparatus, step, etc., unless explicitly stated otherwise. The steps of any method disclosed herein do not have to be performed in the exact order disclosed, unless explicitly stated.
It should be understood that references herein to "a plurality" are to two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The foregoing description of the preferred embodiments of the application is not intended to limit the application to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and principles of the application are intended to be included within the scope of the application.

Claims (17)

1. A method of visual adjustment of synthesized audio, the method comprising:
displaying an audio editing area, wherein the synthesized audio is obtained by converting text content, the audio editing area comprises an adjustment control corresponding to characters in the text content, and the adjustment control is used for adjusting audio attributes corresponding to the characters;
responding to the adjustment operation on the adjustment control corresponding to the character, and displaying an adjustment audio attribute, wherein the adjustment audio attribute is an audio attribute obtained after adjusting the audio attribute parameter value corresponding to the character;
and displaying the updated new synthesized audio based on the adjusted audio attribute.
2. The method of claim 1, wherein the displaying an adjustment audio attribute in response to an adjustment operation on the adjustment control corresponding to the character comprises:
determining a character to be adjusted in response to a selection operation of the character in the text content;
And responding to the adjustment operation on the adjustment control corresponding to the character to be adjusted, and displaying the adjustment audio attribute.
3. The method of claim 2, wherein the determining a character to be adjusted in response to the selection of the character in the text content comprises:
determining a single character in the text content as the character to be adjusted in response to a selection operation of the single character;
the response to the adjustment operation on the adjustment control corresponding to the character to be adjusted, displaying the adjusted audio attribute, includes:
and responding to the adjustment operation on the adjustment control corresponding to the single character, and displaying the adjustment audio attribute corresponding to the single character.
4. The method of claim 2, wherein the determining a character to be adjusted in response to the selection of the character in the text content comprises:
determining characters between the start character and the end character as the characters to be adjusted in response to selection of the start character and the end character in the text content;
the response to the adjustment operation on the adjustment control corresponding to the character to be adjusted, displaying the adjusted audio attribute, includes:
Responding to the adjustment operation on the adjustment control corresponding to a first character between the initial character and the termination character, and displaying that the audio attribute of a second character between the initial character and the termination character changes along with the audio attribute of the first character;
the first character refers to any one character between the initial character and the termination character, and the second character refers to other characters except the first character in the characters between the initial character and the termination character.
5. The method of claim 2, wherein the audio attribute comprises a tone, the adjustment control comprises a tone adjustment control corresponding to the character, and the adjusting the audio attribute comprises adjusting a tone; the method further comprises the steps of:
and responding to the adjustment operation on the tone adjustment control corresponding to the character to be adjusted, and displaying the adjusted tone.
6. The method of claim 5, wherein the audio editing area further comprises at least one of a pitch trend and a pitch value corresponding to characters in the text content;
the step of responding to the adjustment operation on the tone adjustment control corresponding to the character to be adjusted, displaying the adjusted tone comprises the following steps:
And based on at least one of the tone variation trend and the tone value, carrying out the adjustment operation on the tone adjustment control corresponding to the character to be adjusted, and displaying the adjusted tone.
7. The method of claim 6, wherein the performing the adjustment operation on the tone adjustment control corresponding to the character to be adjusted based on at least one of the tone variation trend and the tone value, displaying the adjusted tone, comprises:
adjusting the audio frequency in the audio signal corresponding to the character to be adjusted based on at least one of the tone variation trend and the tone value to obtain the adjusted tone;
the tone adjustment control is used for adjusting the audio frequency in the audio signal corresponding to the character to be adjusted.
8. The method of claim 2, wherein the audio attribute comprises a pause interval, the adjustment control comprises a pause interval adjustment control corresponding to the character, and the adjusting the audio attribute comprises adjusting a pause interval; the method further comprises the steps of:
and responding to the adjustment operation on the pause interval adjustment control corresponding to the character to be adjusted, and displaying the adjusted pause interval.
9. The method of claim 8, wherein the audio editing area further comprises a pause interval variation trend corresponding to the characters in the text content;
the step of responding to the adjustment operation on the pause interval adjustment control corresponding to the character to be adjusted, displaying the adjusted pause interval comprises the following steps:
and carrying out the adjustment operation on the pause interval adjustment control corresponding to the character to be adjusted based on the pause interval change trend, and displaying the adjusted pause interval.
10. The method according to claim 9, wherein the performing the adjustment operation on the pause interval adjustment control corresponding to the character to be adjusted based on the pause interval variation trend, displaying the adjusted pause interval, includes:
and adjusting the pause interval duration value corresponding to the character to be adjusted based on the pause interval change trend to obtain the adjusted pause interval.
11. The method of any of claims 1 to 10, wherein displaying updated new synthesized audio based on the adjusted audio attributes comprises:
based on the adjusted audio attribute, adding a speech synthesis markup language SSML corresponding to the adjusted audio attribute in the text content;
The newly synthesized audio is re-synthesized based on the text content and the added SSML.
12. The method of claim 11, wherein the re-synthesizing new synthesized audio based on the text content and the SSML added comprises:
analyzing the SSML to obtain adjustment parameters corresponding to the audio attributes;
and re-synthesizing the new synthesized audio based on the text content and the adjustment parameters.
13. The method according to any one of claims 1 to 10, further comprising:
displaying an audio preview area, wherein the audio preview area comprises audio track information to which the synthesized audio belongs;
and adjusting at least one of a playing start point and a playing end point of the synthesized audio based on the audio track information.
14. A visual adjustment apparatus for synthesizing audio, the apparatus comprising:
the display module is used for displaying an audio editing area, the synthesized audio is obtained by converting text content, the audio editing area comprises an adjustment control corresponding to characters in the text content, and the adjustment control is used for adjusting audio attributes corresponding to the characters;
The display module is used for responding to the adjustment operation on the adjustment control corresponding to the character and displaying an adjustment audio attribute, wherein the adjustment audio attribute is an audio attribute obtained after adjusting the audio attribute parameter value corresponding to the character;
and the display module is used for displaying the updated new synthesized audio based on the adjusted audio attribute.
15. A computer device, the computer device comprising: a processor and a memory, said memory having stored therein at least one computer program, at least one of said computer programs being loaded and executed by said processor to implement the method of visual adjustment of synthesized audio according to any one of claims 1 to 13.
16. A computer storage medium, characterized in that at least one computer program is stored in the computer readable storage medium, the at least one computer program being loaded and executed by a processor to implement the method of visual adjustment of synthesized audio according to any one of claims 1 to 13.
17. A computer program product, characterized in that the computer program product comprises a computer program, the computer program being stored in a computer readable storage medium; the computer program is read from the computer-readable storage medium and executed by a processor of a computer device, so that the computer device performs the visual adjustment method of synthesized audio according to any one of claims 1 to 13.
CN202310970213.6A 2023-08-02 2023-08-02 Visual adjustment method, device, equipment, medium and product for synthesized audio Pending CN116959452A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310970213.6A CN116959452A (en) 2023-08-02 2023-08-02 Visual adjustment method, device, equipment, medium and product for synthesized audio

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310970213.6A CN116959452A (en) 2023-08-02 2023-08-02 Visual adjustment method, device, equipment, medium and product for synthesized audio

Publications (1)

Publication Number Publication Date
CN116959452A true CN116959452A (en) 2023-10-27

Family

ID=88456481

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310970213.6A Pending CN116959452A (en) 2023-08-02 2023-08-02 Visual adjustment method, device, equipment, medium and product for synthesized audio

Country Status (1)

Country Link
CN (1) CN116959452A (en)

Similar Documents

Publication Publication Date Title
CN110379430B (en) Animation display method and device based on voice, computer equipment and storage medium
EP4006897A1 (en) Audio processing method and electronic device
US9697814B2 (en) Method and device for changing interpretation style of music, and equipment
CN114205324B (en) Message display method, device, terminal, server and storage medium
CN111031386B (en) Video dubbing method and device based on voice synthesis, computer equipment and medium
CN111524501B (en) Voice playing method, device, computer equipment and computer readable storage medium
US20230252964A1 (en) Method and apparatus for determining volume adjustment ratio information, device, and storage medium
CN112445395B (en) Music piece selection method, device, equipment and storage medium
CN110992927B (en) Audio generation method, device, computer readable storage medium and computing equipment
CN111899706A (en) Audio production method, device, equipment and storage medium
CN110956971B (en) Audio processing method, device, terminal and storage medium
CN110830368B (en) Instant messaging message sending method and electronic equipment
CN111276122A (en) Audio generation method and device and storage medium
CN113420177A (en) Audio data processing method and device, computer equipment and storage medium
CN111415650A (en) Text-to-speech method, device, equipment and storage medium
CN115238111A (en) Picture display method and electronic equipment
CN108053821B (en) Method and apparatus for generating audio data
CN111428079B (en) Text content processing method, device, computer equipment and storage medium
CN113920979B (en) Voice data acquisition method, device, equipment and computer readable storage medium
CN111091807B (en) Speech synthesis method, device, computer equipment and storage medium
CN116959452A (en) Visual adjustment method, device, equipment, medium and product for synthesized audio
CN113362836A (en) Vocoder training method, terminal and storage medium
CN114691860A (en) Training method and device of text classification model, electronic equipment and storage medium
CN111028823A (en) Audio generation method and device, computer readable storage medium and computing device
CN114329001B (en) Display method and device of dynamic picture, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication